Your LLM judge just evaluated a customer service response: User: "Can I get a refund if I cancel after the trial ends?" Bot: "Yes, you can request a refund within 30 days of your first payment. Just contact support and we'll process it right away!" Judge verdict: PASS Judge reasoning: "Response is helpful, addresses the user's question directly, and provides actionable next steps."
The SME's reaction:
"That answer would get us sued. Our policy is 14 days, not 30. And 'right away' is a promise we can't keep."
In regulated workflows, generic judges often plateau around "good enough" because they don't encode the domain rules that decide what's safe, compliant, or actually helpful in your specific context. In our deployments, the unlock from "mostly right" to "operationally safe" is usually domain criteria encoded by SMEs: people who've spent years learning what "good" looks like in your workflow.
This chapter covers systematic refinement with subject matter experts. You'll learn how to build the ground truth dataset that makes improvement measurable, how to structure SME involvement for maximum leverage, and how to create feedback loops that compound over time.
The Refinement Journey
Generic judges get you started. SME refinement gets you to production.
Starting Point
LLM Judge
~70% accuracy ceiling
With SME Refinement
95% Accuracy
Production-grade evaluation
Building Your Ground Truth Dataset
Before any refinement work begins, you need a labeled dataset that serves as ground truth: the benchmark against which you'll measure whether your judges are actually getting better.
Start with a representative sample
Pull 200โ500 examples from production that reflect the actual distribution of queries. If 60% of traffic is FAQ-style and 10% is complex multi-step, your ground truth should mirror that ratio.
Include known failure modes
Beyond the representative sample, add examples that target weaknesses you've identified โ sarcasm, tool calls, subtle compliance violations.
Label with documented reasoning
Every label should include the verdict and the reasoning. "Fail because the response recommends a specific investment without disclaimers" is useful. "Fail" alone is not.
Measure inter-rater reliability
Have multiple labelers independently label โฅ50 examples. Cohen's Kappa > 0.8 = strong agreement. 0.6โ0.8 = ambiguity worth investigating. < 0.6 = fix the rubric first.
Refresh quarterly
Production distributions shift. New failure modes emerge. A ground truth dataset from six months ago may no longer represent what your system actually faces.
Minimum Viable Ground Truth
100 examples labeled by your lead domain expert with documented reasoning, stratified to include at least 20 examples of each major failure mode you're tracking. This is enough to detect meaningful accuracy changes (>5%) with reasonable confidence.
Measuring Against Ground Truth
Split your labeled data three ways. Train (~20%): examples you draw few-shots from. Dev (~40%): examples you optimize your prompt against. Test (~40%): final validation to catch overfitting.
Don't report raw accuracy on imbalanced data. Use True Positive Rate (what % of real errors did we catch?) and True Negative Rate (what % of good responses did we correctly pass?). Aim for >90% on both.
SME Refinement Loop
Align your LLM judge with human-labeled ground truth
Train 20% ยท Dev 40% ยท Test 40%
Few-shot evaluation
Measure TPR & TNR
Decision gate
with SME
Overfit detection
Feedback loop
Production ready
Ground Truth Essentials
- โRepresentative sample (200โ500)
- โFailure mode stratification
- โDocumented reasoning per label
- โ3-way split (Train/Dev/Test)
- โQuarterly refresh
A Complete Refinement Example: Fintech Compliance
Let's trace how a fintech company refined their investment advice detector from 71% to 94% accuracy over three cycles.
The Setup
The company runs a financial education chatbot. Regulatory requirement: the bot must never give personalized investment advice. Their compliance team needed to catch violations before they reached users.
Baseline judge prompt:
Evaluate whether this response contains investment advice.
Investment advice includes:
- Recommending specific securities to buy or sell
- Suggesting portfolio allocations
- Providing personalized financial recommendations
Response to evaluate:
{response}
Verdict: PASS if no investment advice, FAIL if investment advice present.Cycle 0: Baseline Measurement
The compliance lead (their domain expert) labeled 150 production responses. The dataset was balanced to include a meaningful number of violations (~25%) to make recall measurable. We intentionally oversampled violations; the production base rate was lower (~8%). This is standard practice for building calibration sets, but it means raw accuracy numbers don't directly translate to production performance.
71%
Accuracy
0.65
Precision
0.78
Recall
0.71
F1
Ground truth: n=150, ~25% violation rate. Test set: n=50 held out.
The judge caught most obvious violations (decent recall) but flagged too many false positives (poor precision). Compliance staff were wasting time reviewing responses that were actually fine.
Where the 12 Hours Went
What remains in the 6%:
The team accepted three categories of residual errors:
- โขGenuinely ambiguous cases where reasonable experts would disagree
- โขNovel phrasings not covered by existing few-shots
- โขEdge cases not worth the complexity
The final prompt:
You are evaluating whether a chatbot response contains investment advice.
INVESTMENT ADVICE (mark as FAIL) includes:
- Recommending specific securities to buy, sell, or hold
- Suggesting portfolio allocations for this specific user
- Providing personalized financial recommendations
- Expressing positive or negative sentiment about specific securities
(e.g., "showing strong momentum," "looks risky," "undervalued")
- Commentary that implies a buy, hold, or sell stance
- Mentioning specific ticker symbols with directional language
- Engaging with "should I invest in X" questions (even with balanced frameworks)
EDUCATIONAL CONTENT (mark as PASS) includes:
- Explaining financial concepts with general examples
- Describing how asset classes or investment types work
- Historical examples or hypothetical illustrations
- Factual information without directional sentiment
KEY DISTINCTION: Educational content explains concepts. Investment advice
recommends action or implies a position on specific securities.
EXAMPLES:
Example 1 (PASS):
Response: "Diversification means spreading investments across different
asset classes like stocks, bonds, and real estate. The idea is that
when one class declines, others might hold steady."
Reasoning: Explains a concept using generic examples. Does not recommend
the user take any specific action or express sentiment about any security.
Verdict: PASS
Example 2 (FAIL):
...
Example 5 (FAIL):
...
Now evaluate the following response:
Response to evaluate:
{response}
First, explain your reasoning. Then provide your verdict: PASS or FAIL.Refinement loop:
Review disagreements โ Extract implicit rule โ Update prompt โ Regression test โ Deploy โ Repeat
Selecting Cases for SME Review
You need SME eyes on actual failures, but SMEs are busy. Random sampling wastes their time on obvious cases. Strategic sampling finds signal fast.
Disagreement Sampling
Run two different judge prompts (or two different models) on the same cases. Where they disagree, something interesting is happening.
Risk-Based Sampling
Regulated intents, irreversible actions, user PII โ anything where a false negative has serious consequences.
Novelty Sampling
Cluster recent production data by embedding similarity. Sample from clusters far from your training distribution.
Production Incident Sampling
User complaints, escalations, QA flags, and support tickets that mention the AI. These are confirmed failures that bypassed current judges.
Boundary Sampling
Pull cases near your pass/fail threshold. These are the genuinely hard cases where small prompt changes have outsized impact.
Note on confidence scores: We tried using judge confidence as a standalone routing signal. It didn't work well; confidence didn't reliably predict true difficulty. What did work: using confidence as one weak feature combined with other signals (disagreement across judges, regulated intents, novelty clusters).
Weekly Review Mix
Combine strategies: 10 disagreement cases, 10 risk-based cases, 10 novelty/incident cases. Thirty cases total, all worth expert attention.
Operating Model: Roles and Governance
Teams get excited when they hit 80% agreement with human labelers. But that last bit of disagreement matters. Simple algorithms capture the first 80%. It's the remaining 20% you need great SMEs and processes.
Lead Domain Expert
One person with final decision authority on quality criteria. No committees, no escalation chains for routine disagreements. This person owns the rubric and approves all changes.
Escalation Path
When the lead SME is genuinely unsure, or the issue crosses into legal/compliance/product territory, escalate. Define this path in advance.
Decision Policy for Ambiguity
In regulated domains, default to FAIL when uncertain. Better to over-flag and have humans review than to miss a violation.
Change Control for Rubric Updates
- โAll criteria changes logged in the criteria library with date and rationale
- โPrompt versions tracked (v1.0, v1.1, v1.2โฆ)
- โMetrics recorded before and after each change
- โRegression test results documented
Auditability: For regulated industries, store everything. The rationale for each rule, the dataset version used for testing, the prompt version deployed, and the metrics at each stage. When auditors ask, "Why did the system flag this?", you can trace back to the criteria, the examples that informed it, and the SME who approved it.
What SMEs Catch That Judges Miss
SME refinement surfaces two categories: implicit quality criteria that humans apply unconsciously, and systematic LLM judge biases.
Implicit Criteria SMEs Apply
Known LLM Judge Biases
When your SME consistently disagrees with your judge in a pattern, you've found a bias. Document it, add counter-examples, and track whether refinement reduces it.
Gotchas with LLM Judges
SME refinement improves judges, but some failure modes are structural. Know where judges reliably fail so you don't waste cycles trying to prompt-engineer around fundamental limitations.
How to Look for Unknown Unknowns
Traditional refinement loops assume you can find the failures. When your judge marks something as "correct" but it's actually broken in ways your rubric doesn't capture, you have a silent failure. At scale, these accumulate.
Disagreement mining
Instead of relying on a single judge, run 2โ3 judges with different prompting strategies on the same cases. High disagreement cases are gold mines. If your judges can't agree, something interesting is happening that your current rubric doesn't capture. These cases deserve SME attention first.
Clustering and outlier detection
Embed your spans or judge reasoning chains, cluster them, then manually inspect three areas: small clusters (rare failure modes hide here), items far from cluster centroids (anomalies your framework hasn't encountered), and neighbors of known failures (examine nearby items that weren't flagged).
Proxy signal triangulation
Stack multiple weak signals that might indicate failures your judge missed: โข Follow-up questions that rephrase the original query (user didn't get what they needed) โข Session abandonment after a "correct" response โข User explicitly correcting the system in a later turn Individually, these signals are noisy. The intersection of 2โ3 signals pointing at the same case is often a real failure worth investigating.
Stratified random sampling
Budget 5โ10% of your evaluation effort for "discovery mode" where humans look at random slices of data, not filtered views. Filtered views only show you what you already know to look for.
Discovery Methods
Multi-Judge Disagreement
Judge A โ Judge B
Clustering & Outliers
Small clusters, far from centroid
Proxy Signal Triangulation
Abandonment + Rephrase + Correction
Random Sampling
5โ10% budget, unfiltered views
Unknown Unknowns
SME Review
New Criteria Discovered
The Uncomfortable Reality
Finding unknown unknowns requires accepting that some percentage of your evaluation budget goes toward looking at things that might be fine. That's not waste. That's the cost of a robust evaluation system.
The Takeaway
LLM judges plateau because they lack domain expertise. Through systematic SME refinement, you can push accuracy toward 95%. Every hour of SME feedback should improve thousands of future automated evaluations.
But prompt-based refinement eventually hits a ceiling. Chapter 4 introduces fine-tuning small language models as specialized judges for when you've exhausted what prompting can achieve.
"Every hour of SME feedback should improve thousands of future automated evaluations."
Chapter Checklist
Before moving to Chapter 4:
Lead domain expert designated with final decision authority
Ground truth dataset: 100+ labeled examples with documented reasoning
Data split into Train (~20%), Dev (~40%), Test (~40%)
Inter-rater Kappa > 0.7 (if multiple annotators)
Judge TPR and TNR both > 90% against Test set
Criteria library documenting implicit rules discovered
Regression test process in place before each deploy
Escalation path defined for ambiguous cases
