Refining Evals with SME in the Loop
[intro]LLM judges plateau at ~70% because they lack domain expertise. Systematic SME involvement is the only way to close the gap to 90%+. This chapter shows exactly how.[/intro]
Your LLM judge just evaluated a customer service response:
[chat-prompt]
User: "Can I get a refund if I cancel after the trial ends?"
Bot: "Yes, you can request a refund within 30 days of your first payment. Just contact support and we'll process it right away!"
===
Judge verdict: PASS
Judge reasoning: "Response is helpful, addresses the user's question directly, and provides actionable next steps."
[/chat-prompt]
A senior support agent takes one look and laughs. "That answer would get us sued. Our policy is 14 days, not 30. And 'right away' is a promise we can't keep."
In regulated workflows, generic judges often plateau around "good enough" because they don't encode the domain rules that decide what's safe, compliant, or actually helpful in your specific context. In our deployments, the unlock from "mostly right" to "operationally safe" is usually domain criteria encoded by SMEs: people who've spent years learning what "good" looks like in your workflow.
This chapter covers systematic refinement with subject matter experts. You'll learn how to build the ground truth dataset that makes improvement measurable, how to structure SME involvement for maximum leverage, and how to create feedback loops that compound over time.
Building Your Ground Truth Dataset
Before any refinement work begins, you need a labeled dataset that serves as ground truth: the benchmark against which you'll measure whether your judges are actually getting better.
Start with a representative sample from production. Pull 200-500 examples that reflect the actual distribution of queries your system handles. If 60% of your traffic is simple FAQ-style questions and 10% is complex multi-step requests, your ground truth should roughly mirror that ratio. Skewing toward edge cases is tempting but misleading. You'll optimize for rare scenarios while regressing on common ones.
Include known failure modes deliberately. Beyond the representative sample, add examples that specifically target weaknesses you've already identified. If your judge struggles with sarcasm, tool calls, or subtle compliance violations, include those.
Label with documented reasoning. Every label should include not just the verdict (pass/fail) but the reasoning behind it. "Fail because the response recommends a specific investment without disclaimers" is useful. "Fail" alone is not. The reasoning becomes training data for your judge prompts.
Measure inter-rater reliability before trusting your labels. If you have multiple labelers, have them independently label the same subset (at least 50 examples). Calculate Cohen's Kappa. Above 0.8 means strong agreement. Between 0.6-0.8 suggests ambiguity worth investigating. Below 0.6 means your criteria aren't clear enough. Fix the annotation rubric before proceeding.
Refresh quarterly or as per drift velocity. Production distributions shift. New failure modes emerge. A ground truth dataset from six months ago may no longer represent what your system actually faces.
The minimum viable ground truth
If resources are constrained, here's the floor: 100 examples labeled by your lead domain expert with documented reasoning, stratified to include at least 20 examples of each major failure mode you're tracking. This is enough to detect meaningful accuracy changes (>5%) with reasonable confidence.
Measuring against ground truth

Split your labeled data three ways. Train (~20%): examples you draw few-shots from. Dev (~40%): examples you optimize your prompt against. Test (~40%): final validation to catch overfitting. The percentages differ from ML because you're not training weights, just informing the prompt.
Don't report raw accuracy on imbalanced data. Use True Positive Rate (what % of real errors did we catch?) and True Negative Rate (what % of good responses did we correctly pass?). Aim for >90% on both. After each prompt update, re-run against Dev to measure improvement, then validate against Test to confirm you haven't overfit to Dev.
[alert:idea]
Ground truth essentials: Representative sample (200-500), failure mode stratification, documented reasoning per label, 3-way split (Train/Dev/Test), quarterly refresh.
[/alert]
A Complete Refinement Example: Fintech Compliance
Let's trace how a fintech company refined their investment advice detector from 71% to 94% accuracy over three cycles.
The Setup
The company runs a financial education chatbot. Regulatory requirement: the bot must never give personalized investment advice. Their compliance team needed to catch violations before they reached users.
Baseline judge prompt:
[chat-prompt]
User: "Evaluate whether this response contains investment advice."
Investment advice includes:
- Recommending specific securities to buy or sell
- Suggesting portfolio allocations
- Providing personalized financial recommendations
Response to evaluate:
{response}
===
Judge verdict: PASS
Judge reasoning: "The response does not recommend specific securities to buy or sell, does not suggest portfolio allocations, and does not provide personalized financial recommendations. Therefore, it does not contain investment advice."
[/chat-prompt]
Cycle 0: Baseline Measurement
The compliance lead (their domain expert) labeled 150 production responses. The dataset was balanced to include a meaningful number of violations (~25%) so recall would be measurable. We intentionally oversampled violations; production base rates were lower (~8%). This is standard practice for building calibration sets, but means raw accuracy numbers don't directly translate to production performance.
Results against this ground truth:
Metric | Value |
|---|---|
Accuracy | 71% |
Precision | 0.65 |
Recall | 0.78 |
F1 | 0.71 |
Ground truth: n=150, ~25% violation rate. Test set: n=50 held out.
The judge caught most obvious violations (decent recall) but flagged too many false positives (poor precision). Compliance staff were wasting time reviewing responses that were actually fine.
Cycle 1: First Disagreement Analysis
The compliance lead exported the 30 cases where she disagreed with the judge. Her review format:
Column | Purpose |
|---|---|
Case ID | Tracking |
User query | Context |
Bot response | What we're evaluating |
Judge verdict | What the LLM said |
SME verdict | What the expert says |
SME rationale | 1-3 sentences explaining why |
Failure mode tag | Category for grouping |
Within an hour of sorting by query type, she'd found the pattern. The judge was failing responses that mentioned investment concepts without actually recommending action.
Example of a false positive:
[chat-prompt]
User: "What's a good way to think about diversification?"
Bot: "Diversification means spreading investments across different asset classes. For example, a portfolio might include stocks, bonds, and real estate. The idea is that when one asset class declines, others might hold steady or increase."
===
Judge verdict: FAIL
Judge reasoning: "The response references asset classes such as stocks, bonds, and real estate and describes how they may be combined in a portfolio, which could be interpreted as portfolio allocation guidance."
SME verdict: PASS
SME rationale: "This is educational content explaining the concept of diversification. It does not recommend specific securities, suggest a personalized allocation, or instruct the user to take action. Mentioning 'stocks, bonds, and real estate' as examples of asset classes is not equivalent to recommending that the user invest in them."
Failure mode: Educational content mislabeled
[/chat-prompt]
This revealed an implicit criterion: the distinction between educational explanation and actionable recommendation.
Prompt update after Cycle 1:
Results after Cycle 1:
Metric | Cycle 0 | Cycle 1 | Change |
|---|---|---|---|
Accuracy | 71% | 82% | +11% |
Precision | 0.65 | 0.79 | +0.14 |
Recall | 0.78 | 0.81 | +0.03 |
F1 | 0.71 | 0.80 | +0.09 |
Precision improved significantly. Fewer false positives. But the accuracy is still below the 90% target.
Cycle 2: Subtle Violations
The compliance lead reviewed another 30 disagreements, this time focusing on false negatives (violations the judge missed).
Example of a false negative:
[chat-prompt]
User: "What do you think about Apple stock?"
Bot: "Apple has shown strong momentum recently, with the stock up 15% this quarter. The company's services revenue continues to grow, which many analysts see as a positive sign for long-term holders."
===
Judge verdict: PASS
Judge reasoning: "The response provides factual performance information and summarizes analyst sentiment without explicitly recommending buying, selling, or holding the stock."
SME verdict: FAIL
SME rationale: "This is investment advice disguised as market commentary. Phrases like 'strong momentum' and 'positive sign for long-term holders' are directional statements about a specific security. The user asked for an opinion, and the response provides evaluative language that could influence a buy or hold decision."
Failure mode: Sentiment as implicit advice
[/chat-prompt]
New implicit criterion: sentiment about specific securities counts as advice, even without explicit buy/sell language.
Prompt update after Cycle 2:
Added to criteria:
Regression test caught problems: The first version caused two regressions: "momentum" in physics contexts started failing, and historical railroads content got flagged. She narrowed the criterion to "specific securities" and the regressions disappeared.
Results after Cycle 2:
Metric | Cycle 0 | Cycle 1 | Cycle 2 | Change |
|---|---|---|---|---|
Accuracy | 71% | 82% | 89% | +7% |
Precision | 0.65 | 0.79 | 0.87 | +0.08 |
Recall | 0.78 | 0.81 | 0.91 | +0.10 |
F1 | 0.71 | 0.80 | 0.89 | +0.09 |
Now close to the 90% target. One more cycle to address remaining edge cases.
Cycle 3: The Gray Zone
Final review surfaced a subtle category: responses that were technically compliant but "felt wrong" to the compliance lead.
Example:
[chat-prompt]
User: "Should I invest in index funds or individual stocks?"
Bot: "Index funds offer broad diversification and lower fees. Individual stocks offer potential for higher returns but require more research and carry more risk. Your choice depends on your risk tolerance, time horizon, and how much time you want to spend managing investments."
===
Judge verdict: PASS
Judge reasoning: "The response provides general educational information comparing index funds and individual stocks without recommending one over the other or giving personalized allocation advice."
SME verdict: PASS (borderline)
SME note: "This is technically educational, but it's answering a 'should I' question. We need the bot to deflect these questions entirely, not provide frameworks that might be interpreted as guidance."
[/chat-prompt]
This wasn't a judge failure. It was a product requirement the team hadn't fully specified. The compliance lead worked with product to update the bot's system prompt to deflect "should I invest in X" questions entirely.
Escalation to product/legal:
When the SME is unsure or the issue crosses into product policy, escalate. In this case:
SME flagged the ambiguity
Product manager confirmed: deflect "should I" questions
Legal reviewed and approved the new policy
System prompt updated, not just the judge
The judge criteria were then updated to flag responses that engaged with "should I" investment questions, even with balanced frameworks.
Final results:
Metric | Cycle 0 | Final | Improvement |
|---|---|---|---|
Accuracy | 71% | 94% | +23% |
Precision | 0.65 | 0.93 | +0.28 |
Recall | 0.78 | 0.95 | +0.17 |
F1 | 0.71 | 0.94 | +0.23 |
Where the 12 hours went:
Activity | Time | Notes |
|---|---|---|
Initial labeling (150 examples) | 4 hrs | One-time ground truth creation |
Disagreement review (3 cycles × 30 cases) | 5 hrs | The core pattern-finding work |
Prompt iteration and testing | 2 hrs | Writing updates, re-running evals |
Stakeholder alignment (Cycle 3) | 1 hr | Product/legal discussion |
What remains in the 6%:
The team accepted three categories of residual errors:
Genuinely ambiguous cases where reasonable experts would disagree
Novel phrasings not covered by existing few-shots
Edge cases not worth the complexity
The final prompt:
[alert:idea]
Refinement loop: Review disagreements → Extract implicit rule → Update prompt → Regression test → Deploy → Repeat
[/alert]
Selecting Cases for SME Review
You need SME eyes on actual failures, but SMEs are busy. Random sampling wastes their time on obvious cases. Strategic sampling finds signal fast.
Five sampling strategies that work:
[callout title="Disagreement sampling"]
Disagreement sampling has the highest yield. Run two different judge prompts (or two different models) on the same cases. Where they disagree, something interesting is happening. Either the criteria are ambiguous or one judge is wrong. Either way, worth SME attention.
[/callout]
[callout title="Risk-based sampling"]
Risk-based sampling focuses on what matters most. Regulated intents, irreversible actions, user PII, anything where a false negative has serious consequences. Even if overall accuracy is high, you want near-perfect recall on high-stakes categories.
[/callout]
[callout title="Novelty sampling"]
Novelty sampling catches distribution shift. Cluster your recent production data by embedding similarity. Sample from clusters that are far from your training distribution. New phrasings and new topics live here.
[/callout]
[callout title="Production incident sampling"]
Production incident sampling learns from real failures. User complaints, escalations, QA flags, support tickets that mention the AI. These are confirmed failures that bypassed your current judges.
[/callout]
[callout title="Boundary sampling"]
Boundary sampling (if you have calibrated scores) pulls cases near your pass/fail threshold. These are the genuinely hard cases where small prompt changes have outsized impact.
[/callout]
Note on confidence scores: We tried using judge confidence as a standalone routing signal. It didn't work well; confidence didn't reliably predict true difficulty. What did work: using confidence as one weak feature combined with other signals (disagreement across judges, regulated intents, novelty clusters).
For weekly review, combine strategies: 10 disagreement cases, 10 risk-based cases, 10 novelty/incident cases. Thirty cases total, all worth expert attention.
Operating Model: Roles and Governance
Teams get excited when they hit 80% agreement with human labelers. But that last bit of disagreement between humans and LLMs matters. Those are the non-obvious cases. Simple algorithms capture the first 80%. It's the remaining 20% you need great SMEs and processes.
Lead domain expert: One person with final decision authority on quality criteria. When edge cases arise, they decide. No committees, no escalation chains for routine disagreements. This person owns the rubric and approves all changes.
Escalation path: When the lead SME is genuinely unsure, or the issue crosses into legal/compliance/product territory, escalate. Define this path in advance. The fintech team escalated "should I" questions to product and legal before changing criteria.
Decision policy for ambiguity: In regulated domains, default to FAIL when uncertain. Better to over-flag and have humans review than to miss a violation. Document this policy so it's applied consistently.
Change control for rubric updates:
All criteria changes logged in the criteria library with date and rationale
Prompt versions tracked (v1.0, v1.1, v1.2...)
Metrics recorded before and after each change
Regression test results documented
Auditability: For regulated industries, store everything. The rationale for each rule, the dataset version used for testing, the prompt version deployed, the metrics at each stage. When auditors ask "why did the system flag this?", you can trace back to the criteria, the examples that informed it, and the SME who approved it.
What SMEs Catch That Judges Miss
SME refinement surfaces two categories of issues: implicit quality criteria that humans apply unconsciously, and systematic LLM judge biases.
Implicit criteria SMEs apply:
Pattern | What SMEs Catch | The Implicit Rule |
|---|---|---|
Confidence calibration | "The answer is X" vs. "probably X" | Match confidence to evidence |
Completeness | Correct but incomplete answers | Answer the real need, not the literal question |
Tone mismatch | Right content, wrong delivery | Tone is part of correctness |
Negative results | "I don't know" mislabeled as failure | Honest uncertainty is correct behavior |
Known LLM judge biases:
Bias | What Happens | Mitigation |
|---|---|---|
Position | Favors first option in A/B comparisons | Randomize order, average across orderings |
Length | Longer responses score higher regardless of quality | Add few-shot examples rewarding concise answers |
Self-preference | Models rate own outputs higher | Use different model as judge, or human baseline |
Format | Prefers lists/structure even when prose is better | Include prose examples that pass |
When your SME consistently disagrees with your judge in a pattern, you've found a bias. Document it, add counter-examples, track whether refinement reduces it.
Gotchas with LLM Judges
SME refinement improves judges, but some failure modes are structural. Know where judges reliably fail so you don't waste cycles trying to prompt-engineer around fundamental limitations.
[callout title="LLMs miss behavioral and engagement signals"]
LLMs judge semantic correctness. They can't see what happens after the response. A support bot answer might be factually accurate, but if users immediately escalate to a human agent, something failed. A search result might be topically relevant, but if users consistently click the third result instead of the first, the ranking is wrong. A product description might be technically complete, but if customers keep asking "will it fit in my car?" in reviews, it's missing what buyers actually need.
Your SMEs know these patterns because they've watched users struggle. They know that in your domain, "technically correct" and "actually helpful" diverge in specific, predictable ways. An LLM judge sees the words. Your SME sees the gap between what was said and what the user needed to hear.
[/callout]
[callout title="Domain knowledge hides in unexpected places"]
An LLM doesn't understand that in healthcare, "unremarkable" means good. In insurance, "total loss" has a specific threshold. In HR, "not a culture fit" carries loaded implications. In customer support, In finance, an "aggressive" portfolio means high-risk, not hostile. SMEs catch these because they've internalized thousands of domain-specific associations that might not have made it into the LLM's training data.
[/callout]
[callout title="Sneaky overfitting when adding examples"]
Every time you add a rule or example to fix a specific failure, you push the model's attention away from the general problem. Layer example after example and you'll end up with a brittle system focused excessively on specific cases.
To avoid this: maintain a holdout set you don't peek at. When you add rules, measure impact on the holdout, not just on the cases that triggered the rule.
[/callout]
[callout title="Use judges for regression testing, not discovery"]
Trust your LLM judge to flag regressions on use cases your human labels cover. Don't trust it to discover new, unexpected failure modes. For discovery, you need different approaches.
[/callout]
How to Look for Unknown Unknowns

Traditional refinement loops work like this: look at failures, improve your judges, repeat. But this assumes you can find the failures in the first place. When your judge marks something as "correct" but it's actually broken in ways your rubric doesn't capture, you have a silent failure. At scale, these accumulate.
Disagreement mining
Instead of relying on a single judge, run 2-3 judges with different prompting strategies on the same cases. High disagreement cases are gold mines. If your judges can't agree, something interesting is happening that your current rubric doesn't capture. These cases deserve SME attention first.
Clustering and outlier detection
Embed your spans or judge reasoning chains, cluster them, then manually inspect three areas: small clusters (rare failure modes hide here), items far from cluster centroids (anomalies your framework hasn't encountered), and neighbors of known failures (examine nearby items that weren't flagged).
Proxy signal triangulation
Stack multiple weak signals that might indicate failures your judge missed:
Follow-up questions that rephrase the original query (user didn't get what they needed)
Session abandonment after a "correct" response
User explicitly correcting the system in a later turn
Individually these signals are noisy. The intersection of 2-3 signals pointing at the same case is often a real failure worth investigating.
Stratified random sampling
Budget 5-10% of your evaluation effort for "discovery mode" where humans look at random slices of data, not filtered views. Filtered views only show you what you already know to look for.
The uncomfortable reality: finding unknown unknowns requires accepting that some percentage of your evaluation budget goes toward looking at things that might be fine. That's not waste. That's the cost of a robust evaluation system.
The Takeaway
LLM judges plateau because they lack domain expertise. Through systematic SME refinement you can push accuracy toward 95%. Every hour of SME feedback should improve thousands of future automated evaluations.
But prompt-based refinement eventually hits a ceiling. Chapter 4 introduces fine-tuning small language models as specialized judges for when you've exhausted what prompting can achieve.
Chapter Checklist
Before moving to Chapter 4:
[alert:takeaway]
Lead domain expert designated with final decision authority
Ground truth dataset: 100+ labeled examples with documented reasoning
Data split into Train (~20%), Dev (~40%), Test (~40%)
Inter-rater Kappa > 0.7 (if multiple annotators)
Judge TPR and TNR both > 90% against Test set
Criteria library documenting implicit rules discovered
Regression test process in place before each deploy
Escalation path defined for ambiguous cases
[/alert]
