Back to Book

    Chapter 03 ยท Mar 16, 2026

    Refining Evals with SME in the Loop

    Pratik Bhavsar

    Pratik Bhavsar

    Evals & Leaderboards @ Galileo Labs

    "When evals are tailored to a specific product, workflow, or process inside a company's real operating context, they reliably improve quality, reduce errors, and accelerate iteration."

    โ€” Shyamal Anadkat, ex-Applied AI at OpenAI

    ~25 min read
    12 sections
    Your LLM judge just evaluated a customer service response:
    
    User: "Can I get a refund if I cancel after the trial ends?"
    
    Bot: "Yes, you can request a refund within 30 days of your first
    payment. Just contact support and we'll process it right away!"
    
    Judge verdict: PASS
    Judge reasoning: "Response is helpful, addresses the user's
    question directly, and provides actionable next steps."

    The SME's reaction:

    "That answer would get us sued. Our policy is 14 days, not 30. And 'right away' is a promise we can't keep."

    In regulated workflows, generic judges often plateau around "good enough" because they don't encode the domain rules that decide what's safe, compliant, or actually helpful in your specific context. In our deployments, the unlock from "mostly right" to "operationally safe" is usually domain criteria encoded by SMEs: people who've spent years learning what "good" looks like in your workflow.

    This chapter covers systematic refinement with subject matter experts. You'll learn how to build the ground truth dataset that makes improvement measurable, how to structure SME involvement for maximum leverage, and how to create feedback loops that compound over time.

    The Refinement Journey

    Generic judges get you started. SME refinement gets you to production.

    Starting Point

    LLM Judge

    ~70% accuracy ceiling

    โ†’

    With SME Refinement

    95% Accuracy

    Production-grade evaluation

    Foundation

    Building Your Ground Truth Dataset

    Before any refinement work begins, you need a labeled dataset that serves as ground truth: the benchmark against which you'll measure whether your judges are actually getting better.

    01

    Start with a representative sample

    Pull 200โ€“500 examples from production that reflect the actual distribution of queries. If 60% of traffic is FAQ-style and 10% is complex multi-step, your ground truth should mirror that ratio.

    02

    Include known failure modes

    Beyond the representative sample, add examples that target weaknesses you've identified โ€” sarcasm, tool calls, subtle compliance violations.

    03

    Label with documented reasoning

    Every label should include the verdict and the reasoning. "Fail because the response recommends a specific investment without disclaimers" is useful. "Fail" alone is not.

    04

    Measure inter-rater reliability

    Have multiple labelers independently label โ‰ฅ50 examples. Cohen's Kappa > 0.8 = strong agreement. 0.6โ€“0.8 = ambiguity worth investigating. < 0.6 = fix the rubric first.

    05

    Refresh quarterly

    Production distributions shift. New failure modes emerge. A ground truth dataset from six months ago may no longer represent what your system actually faces.

    Minimum Viable Ground Truth

    100 examples labeled by your lead domain expert with documented reasoning, stratified to include at least 20 examples of each major failure mode you're tracking. This is enough to detect meaningful accuracy changes (>5%) with reasonable confidence.

    Measuring Against Ground Truth

    Split your labeled data three ways. Train (~20%): examples you draw few-shots from. Dev (~40%): examples you optimize your prompt against. Test (~40%): final validation to catch overfitting.

    Don't report raw accuracy on imbalanced data. Use True Positive Rate (what % of real errors did we catch?) and True Negative Rate (what % of good responses did we correctly pass?). Aim for >90% on both.

    SME Refinement Loop

    Align your LLM judge with human-labeled ground truth

    โ˜‰ GROUND TRUTHTrain 20%EvaluateDev 40%Test 40%LLM JudgeFew shotsJudge vs HumanMeasure TPR & TNRTPR & TNR> 90%?NoYesReview Disagreementswith SMECheck Test SetHolds?YesOverfit๐Ÿš€ DeployExtract Ruleโ†’ Update Prompt

    Ground Truth Essentials

    • โ†’Representative sample (200โ€“500)
    • โ†’Failure mode stratification
    • โ†’Documented reasoning per label
    • โ†’3-way split (Train/Dev/Test)
    • โ†’Quarterly refresh
    Case Study

    A Complete Refinement Example: Fintech Compliance

    Let's trace how a fintech company refined their investment advice detector from 71% to 94% accuracy over three cycles.

    The Setup

    The company runs a financial education chatbot. Regulatory requirement: the bot must never give personalized investment advice. Their compliance team needed to catch violations before they reached users.

    Baseline judge prompt:

    Evaluate whether this response contains investment advice.
    
    Investment advice includes:
    - Recommending specific securities to buy or sell
    - Suggesting portfolio allocations
    - Providing personalized financial recommendations
    
    Response to evaluate:
    {response}
    
    Verdict: PASS if no investment advice, FAIL if investment advice present.

    Cycle 0: Baseline Measurement

    The compliance lead (their domain expert) labeled 150 production responses. The dataset was balanced to include a meaningful number of violations (~25%) to make recall measurable. We intentionally oversampled violations; the production base rate was lower (~8%). This is standard practice for building calibration sets, but it means raw accuracy numbers don't directly translate to production performance.

    71%

    Accuracy

    0.65

    Precision

    0.78

    Recall

    0.71

    F1

    Ground truth: n=150, ~25% violation rate. Test set: n=50 held out.

    The judge caught most obvious violations (decent recall) but flagged too many false positives (poor precision). Compliance staff were wasting time reviewing responses that were actually fine.

    Where the 12 Hours Went

    Activity
    Time
    Notes
    Initial labeling (150 examples)
    4 hrs
    One-time ground truth creation
    Disagreement review (3 cycles ร— 30 cases)
    5 hrs
    The core pattern-finding work
    Prompt iteration and testing
    2 hrs
    Writing updates, re-running evals
    Stakeholder alignment (Cycle 3)
    1 hr
    Product/legal discussion

    What remains in the 6%:

    The team accepted three categories of residual errors:

    • โ€ขGenuinely ambiguous cases where reasonable experts would disagree
    • โ€ขNovel phrasings not covered by existing few-shots
    • โ€ขEdge cases not worth the complexity

    The final prompt:

    You are evaluating whether a chatbot response contains investment advice.
    
    INVESTMENT ADVICE (mark as FAIL) includes:
    - Recommending specific securities to buy, sell, or hold
    - Suggesting portfolio allocations for this specific user
    - Providing personalized financial recommendations
    - Expressing positive or negative sentiment about specific securities
      (e.g., "showing strong momentum," "looks risky," "undervalued")
    - Commentary that implies a buy, hold, or sell stance
    - Mentioning specific ticker symbols with directional language
    - Engaging with "should I invest in X" questions (even with balanced frameworks)
    
    EDUCATIONAL CONTENT (mark as PASS) includes:
    - Explaining financial concepts with general examples
    - Describing how asset classes or investment types work
    - Historical examples or hypothetical illustrations
    - Factual information without directional sentiment
    
    KEY DISTINCTION: Educational content explains concepts. Investment advice
    recommends action or implies a position on specific securities.
    
    EXAMPLES:
    
    Example 1 (PASS):
    Response: "Diversification means spreading investments across different
    asset classes like stocks, bonds, and real estate. The idea is that
    when one class declines, others might hold steady."
    Reasoning: Explains a concept using generic examples. Does not recommend
    the user take any specific action or express sentiment about any security.
    Verdict: PASS
    
    Example 2 (FAIL):
    ...
    
    Example 5 (FAIL):
    ...
    
    Now evaluate the following response:
    
    Response to evaluate:
    {response}
    
    First, explain your reasoning. Then provide your verdict: PASS or FAIL.

    Refinement loop:

    Review disagreements โ†’ Extract implicit rule โ†’ Update prompt โ†’ Regression test โ†’ Deploy โ†’ Repeat

    Strategy

    Selecting Cases for SME Review

    You need SME eyes on actual failures, but SMEs are busy. Random sampling wastes their time on obvious cases. Strategic sampling finds signal fast.

    01

    Disagreement Sampling

    Highest yield

    Run two different judge prompts (or two different models) on the same cases. Where they disagree, something interesting is happening.

    02

    Risk-Based Sampling

    Highest stakes

    Regulated intents, irreversible actions, user PII โ€” anything where a false negative has serious consequences.

    03

    Novelty Sampling

    Catches drift

    Cluster recent production data by embedding similarity. Sample from clusters far from your training distribution.

    04

    Production Incident Sampling

    Confirmed failures

    User complaints, escalations, QA flags, and support tickets that mention the AI. These are confirmed failures that bypassed current judges.

    05

    Boundary Sampling

    High impact

    Pull cases near your pass/fail threshold. These are the genuinely hard cases where small prompt changes have outsized impact.

    Note on confidence scores: We tried using judge confidence as a standalone routing signal. It didn't work well; confidence didn't reliably predict true difficulty. What did work: using confidence as one weak feature combined with other signals (disagreement across judges, regulated intents, novelty clusters).

    Weekly Review Mix

    Combine strategies: 10 disagreement cases, 10 risk-based cases, 10 novelty/incident cases. Thirty cases total, all worth expert attention.

    Governance

    Operating Model: Roles and Governance

    Teams get excited when they hit 80% agreement with human labelers. But that last bit of disagreement matters. Simple algorithms capture the first 80%. It's the remaining 20% you need great SMEs and processes.

    Lead Domain Expert

    One person with final decision authority on quality criteria. No committees, no escalation chains for routine disagreements. This person owns the rubric and approves all changes.

    Escalation Path

    When the lead SME is genuinely unsure, or the issue crosses into legal/compliance/product territory, escalate. Define this path in advance.

    Decision Policy for Ambiguity

    In regulated domains, default to FAIL when uncertain. Better to over-flag and have humans review than to miss a violation.

    Change Control for Rubric Updates

    • โ†’All criteria changes logged in the criteria library with date and rationale
    • โ†’Prompt versions tracked (v1.0, v1.1, v1.2โ€ฆ)
    • โ†’Metrics recorded before and after each change
    • โ†’Regression test results documented

    Auditability: For regulated industries, store everything. The rationale for each rule, the dataset version used for testing, the prompt version deployed, and the metrics at each stage. When auditors ask, "Why did the system flag this?", you can trace back to the criteria, the examples that informed it, and the SME who approved it.

    Insights

    What SMEs Catch That Judges Miss

    SME refinement surfaces two categories: implicit quality criteria that humans apply unconsciously, and systematic LLM judge biases.

    Implicit Criteria SMEs Apply

    Pattern
    What SMEs Catch
    The Implicit Rule
    Confidence calibration
    "The answer is X" vs. "probably X"
    Match confidence to evidence
    Completeness
    Correct but incomplete answers
    Answer the real need, not the literal question
    Tone mismatch
    Right content, wrong delivery
    Tone is part of correctness
    Negative results
    "I don't know" mislabeled as failure
    Honest uncertainty is correct behavior

    Known LLM Judge Biases

    Bias
    What Happens
    Mitigation
    Position
    Favors first option in A/B comparisons
    Randomize order, average across orderings
    Length
    Longer responses score higher regardless of quality
    Add few-shot examples rewarding concise answers
    Self-preference
    Models rate own outputs higher
    Use different model as judge, or human baseline
    Format
    Prefers lists/structure even when prose is better
    Include prose examples that pass

    When your SME consistently disagrees with your judge in a pattern, you've found a bias. Document it, add counter-examples, and track whether refinement reduces it.

    Gotchas

    Gotchas with LLM Judges

    SME refinement improves judges, but some failure modes are structural. Know where judges reliably fail so you don't waste cycles trying to prompt-engineer around fundamental limitations.

    Discovery

    How to Look for Unknown Unknowns

    Traditional refinement loops assume you can find the failures. When your judge marks something as "correct" but it's actually broken in ways your rubric doesn't capture, you have a silent failure. At scale, these accumulate.

    01

    Disagreement mining

    Instead of relying on a single judge, run 2โ€“3 judges with different prompting strategies on the same cases. High disagreement cases are gold mines. If your judges can't agree, something interesting is happening that your current rubric doesn't capture. These cases deserve SME attention first.

    02

    Clustering and outlier detection

    Embed your spans or judge reasoning chains, cluster them, then manually inspect three areas: small clusters (rare failure modes hide here), items far from cluster centroids (anomalies your framework hasn't encountered), and neighbors of known failures (examine nearby items that weren't flagged).

    03

    Proxy signal triangulation

    Stack multiple weak signals that might indicate failures your judge missed: โ€ข Follow-up questions that rephrase the original query (user didn't get what they needed) โ€ข Session abandonment after a "correct" response โ€ข User explicitly correcting the system in a later turn Individually, these signals are noisy. The intersection of 2โ€“3 signals pointing at the same case is often a real failure worth investigating.

    04

    Stratified random sampling

    Budget 5โ€“10% of your evaluation effort for "discovery mode" where humans look at random slices of data, not filtered views. Filtered views only show you what you already know to look for.

    Discovery Methods

    Multi-Judge Disagreement

    Judge A โ‰  Judge B

    Clustering & Outliers

    Small clusters, far from centroid

    Proxy Signal Triangulation

    Abandonment + Rephrase + Correction

    Random Sampling

    5โ€“10% budget, unfiltered views

    ?

    Unknown Unknowns

    SME Review

    New Criteria Discovered

    The Uncomfortable Reality

    Finding unknown unknowns requires accepting that some percentage of your evaluation budget goes toward looking at things that might be fine. That's not waste. That's the cost of a robust evaluation system.

    Takeaway

    The Takeaway

    LLM judges plateau because they lack domain expertise. Through systematic SME refinement, you can push accuracy toward 95%. Every hour of SME feedback should improve thousands of future automated evaluations.

    But prompt-based refinement eventually hits a ceiling. Chapter 4 introduces fine-tuning small language models as specialized judges for when you've exhausted what prompting can achieve.

    "Every hour of SME feedback should improve thousands of future automated evaluations."

    Checklist

    Chapter Checklist

    Before moving to Chapter 4:

    Lead domain expert designated with final decision authority

    Ground truth dataset: 100+ labeled examples with documented reasoning

    Data split into Train (~20%), Dev (~40%), Test (~40%)

    Inter-rater Kappa > 0.7 (if multiple annotators)

    Judge TPR and TNR both > 90% against Test set

    Criteria library documenting implicit rules discovered

    Regression test process in place before each deploy

    Escalation path defined for ambiguous cases

    Free Download

    Eval Engineering Cheatsheet

    Key concepts, frameworks, and best practices โ€” all in one page.

    Stay in the loop

    New chapters, practical guides, and eval engineering insights delivered to your inbox.