Chapter 1 introduced the Eval Engineering lifecycle. Stage 1 is LLM-as-Judge: using a large language model to evaluate your AI's outputs. This chapter goes deep on how to do it well.
The approach is deceptively simple. Write a prompt that defines what "good" means. Feed it your AI's outputs. Get judgments at scale. In theory, you've automated evaluation. In practice, most teams plateau at <70% accuracy and wonder what went wrong.
The gap between generic LLM judges and production-grade evaluation is a matter of understanding what judges actually do well, what they systematically fail at, and how to configure them for your specific domain. This chapter gives you that understanding.
Two Sides of LLM Evaluation
Generic judges get you started. Custom judges get you to production.
What Goes In
Generic Judge
64โ68% accuracy ceiling
What Comes Out
Custom Judge
Production-grade accuracy
Why LLM Judges Work
LLM judges work because LLMs have internalized enormous amounts of human judgment. They've seen millions of examples of good writing, clear explanations, accurate summaries, and helpful responses. When you ask GPT or Claude whether a response is helpful, you're leveraging that internalized sense of quality.
This is why LLM judges can achieve 80%+ agreement with human evaluators on general tasks. Research from the MT-Bench paper showed that strong LLM judges match crowdsourced human preferences at roughly the same level that humans agree with each other. For many evaluation tasks, an LLM judge is as reliable as hiring an annotator.
Key Takeaways
- โLLM judges scale infinitely
- โThey don't get tired, don't have bad days, and don't develop annotation fatigue after reviewing their 500th example
- โThey provide consistent application of criteria across thousands of evaluations
- โThey're available 24/7 and can process evaluations in seconds rather than days
- โAnd critically, they're explainable: you can ask them to show their reasoning
There's another advantage that gets less attention: LLM judges improve automatically as models improve. The judge you build today will get more accurate when the next generation of models arrives. You don't need to retrain anything. Just point your evaluation pipeline at a better model and accuracy improves. This is fundamentally different from traditional ML classifiers, which require new training data and retraining cycles to improve.
Key Idea
Your eval infrastructure appreciates over time. Every model upgrade makes your judges smarter for free.
But here's the catch. LLM judges don't know what "good" means for your specific use case, your domain, your users, or your definition of success. And that's where most implementations fail.
The Issue with Generic LLM Judges
Ask GPT-5 to evaluate whether a customer service response is "helpful," and it will give you a reasonable answer. Ask it to evaluate whether that same response follows your company's specific escalation policy, uses approved terminology, maintains the right tone for your brand, and addresses the customer's underlying concern rather than just their stated question, and it will guess.
Generic judges plateau because they're optimizing for generic quality. They catch obvious failures: responses that are clearly wrong, incoherent, or off-topic. They miss subtle failures: responses that are technically correct but violate domain-specific requirements, follow the letter of instructions but not the spirit, or satisfy general criteria while failing specific ones.
Research from UC Berkeley quantified this problem. Off-the-shelf LLM-as-judge setups achieve 64-68% accuracy on domain-specific evaluation tasks. That's the ceiling of the generic approach. Not because the models are incapable of better judgment, but because they lack the domain context to make better judgments.
We know that "good" means different things in different contexts: in legal document review (cites precedents, identifies risks, includes disclaimers), in customer support (acknowledges frustration, provides next steps, stays within policy), in coding assistants (working code, handles edge cases, follows style guide). A generic judge can't evaluate any of these well because it doesn't know the specific criteria that matter.
Building Your First LLM Judge
Start with a single, specific failure mode. Not five. Not ten. One. This constraint forces clarity about what you're actually trying to measure.
But before you write your judge prompt, let me show you what separates a basic prompt from one that actually works.
Basic vs. Refined Eval Prompts
Basic Prompt
Evaluate whether this response is helpful and accurate. Response: {response} Is this response good? Answer yes or no
This prompt will achieve roughly 60% agreement with human judgment. Here's why it fails:
Key Takeaways
- โ"Helpful" and "accurate" are vague and conflated
- โNo context about the domain or use case
- โNo examples of what pass/fail looks like
- โNo reasoning required before verdict
Refined Prompt โ 75โ80% accuracy
You are an impartial evaluator assessing whether AI responses to financial questions provide accurate information without giving investment advice. CRITERIA: A response PASSES if it: 1. Contains only factual, verifiable information about financial instruments or concepts 2. Does NOT recommend buying, selling, or holding any specific investment 3. Does NOT suggest allocation percentages or timing for trades 4. Includes appropriate uncertainty when discussing future performance A response FAILS if it violates ANY of the above. EXAMPLES: Example 1 (FAIL): User: "Should I invest in tech stocks?" Response: "Based on current market trends, I'd recommend allocating 60% to tech ETFs like QQQ." Reasoning: Explicitly recommends a specific allocation and names a specific fund. This is investment advice. Verdict: FAIL Example 2 (PASS): User: "Should I invest in tech stocks?" Response: "Tech stocks have historically shown higher volatility than the broader market. The NASDAQ-100 has had average annual returns of about 10% over the past 20 years, though past performance doesn't guarantee future results." Reasoning: Provides historical facts without recommending any action. Includes appropriate uncertainty language. Verdict: PASS Example 3 (FAIL): User: "What do you think about Apple stock?" Response: "Apple has strong fundamentals. Now would be a good time to buy while it's down." Reasoning: "Good time to buy" constitutes timing advice. Verdict: FAIL NOW EVALUATE: Response: {response} First explain your reasoning, then provide your verdict (PASS/FAIL)
Prompt Engineering for Judges
Specific Criteria
"Evaluate if this response is helpful"
"Answers the question directly, avoids personal opinions, uses clear language, avoids assumptions about expertise."
Clear Terms
"Uses appropriate tone"
"Uses formal language, avoids slang, addresses user by name"
Binary Output
"What's the difference between 3 and 4?"
Unambiguous. No interpretation needed.
Few-Shot Examples
Include examples for cases that caused disagreement. If your team argued about it, the judge needs it.
Your examples are your specifications.
Explicit Decision Rules
"If no citation โ FAIL" "If partial answer without ask โ FAIL" "If correct but wrong tone โ FAIL"
Split Compound Criteria
"Evaluate if helpful AND accurate"
Judge 1: Helpful? โ Judge 2: Accurate?
The quality of your judge depends almost entirely on the quality of your prompt. This isn't hyperbole. The same underlying model can achieve accuracy rates ranging from 60% to 95%, depending on the prompt used.
The principles below are based on hard-won lessons in collaborative annotation. When humans disagree on labels, it's usually because the rubric is ambiguous. The same applies to LLM judges: unclear prompts produce inconsistent judgments.
Multi-Judge Polling: Panels of LLM Evaluators
ChainPoll Algorithm
"One judge is an opinion. Three judges agreeing is evidence."
A single judge call has inherent variance. Ask the same model the same question twice, and you might get different answers. At production scale, this variance becomes noise that obscures real signal.
Why Single Judges Fail
Most LLM-as-Judge systems rely on a single strong evaluator, often GPT-5. This has fundamental problems:
Intra-model bias. Judge models tend to recognize and favor outputs stylistically similar to their own generations. This self-preference effect inflates scores for same-family models.
High variance. Small changes in prompt wording or formatting produce large swings in evaluation outcomes. A judge that was accurate yesterday might be inconsistent today.
Cost at scale. Using a frontier model for every evaluation is expensive and slow.
Aggregation Strategies
Practical Configurations
3 judges. The minimum for meaningful aggregation. Majority vote determines the final verdict. Cost increases 3ร but variance drops significantly.
5 judges. Provides more granularity. You can distinguish "5/5 agree it fails" from "3/5 agree it fails," which maps to confidence levels.
Mixed model families. Reduce systematic bias. If you're worried that GPT has blind spots, use a panel of different models (GPT, Claude, Gemini) and aggregate their verdicts.
The ChainPoll Approach
The ChainPoll approach, developed for hallucination detection, demonstrated this concretely. It combines two techniques to achieve high-accuracy hallucination detection:
First, it uses a carefully engineered chain-of-thought prompt that asks the LLM judge to write out step-by-step reasoning before rendering a verdict. The prompt is designed to elicit systematic, detailed explanations rather than quick judgments.
Second, it polls the model multiple times (typically 3 or 5) and aggregates the binary yes/no verdicts into a confidence score. If two out of three polls say "hallucinated," the score is 0.66. This aggregation captures uncertainty that a single judgment would miss.
Three design choices make ChainPoll particularly effective: it requests boolean judgments rather than numeric scores (which proved more reliable in testing), it places the reasoning before the verdict (so the answer can leverage the explanation), and it uses a smaller, faster model for the accuracy gap through multiple polls.
The Biases You Must Mitigate
LLM judges have systematic biases. If you don't address them, your evaluations will be systematically wrong in predictable ways. These need to be mitigated with different strategies.
Positional Bias
Click to expand
Verbosity Bias
Click to expand
Self-Preference
Click to expand
Recency Bias
Click to expand
Format Bias
Click to expand
Judge Configurations
LLM judges have multiple configuration dimensions. Each involves tradeoffs.
Model Selection: Reasoning vs. Non-Reasoning
Reasoning models (o1, o3, Claude with extended thinking) produce more thorough evaluations but cost more and take longer. For simple binary judgments, a standard model with chain-of-thought prompting is usually sufficient. Reserve reasoning models for evaluations that require multi-step analysis or complex domain reasoning.
Rule of thumb: if your evaluation criteria can be stated in a single sentence, you don't need a reasoning model. If evaluation requires weighing multiple factors and considering edge cases, reasoning models may help.
Scope: Where to Apply the Judge
Start with trace-level evaluation for most use cases. Move to span-level when you need to diagnose which component of a complex system is failing.
Output Type: Choosing the Right Scale
Boolean is almost always the right starting point. If you find yourself wanting more granularity, that's often a sign you haven't clearly defined your criteria. Two judges with clear binary criteria are better than one judge with a vague 5-point scale.
Reasoning: Chain-of-Thought vs. Direct
Chain-of-thought (CoT) prompting asks the judge to explain its reasoning before the verdict. This improves accuracy and makes evaluations explainable. The cost: more tokens and higher latency.
Rule of Thumb
Use CoT for development; consider direct evaluation in production if you're confident in accuracy.
Precision vs. Recall Tradeoff
For professional AI engineers, "accuracy" is just the starting point.
Creating Custom Metrics
One barrier to good evaluation is the perception that custom metrics are hard to build. They're not. Modern evaluation frameworks make it trivial to go from a prompt idea to a running metric.
Galileo Custom Metric API
from galileo.metrics import create_custom_llm_metric, OutputTypeEnum, StepType metric = create_custom_llm_metric( name="Financial Compliance Check", user_prompt="""You are an impartial evaluator assessing whether AI responses provide information without giving investment advice. A response PASSES if it contains only factual information and does NOT recommend buying, selling, or holding any specific investment. A response FAILS if it recommends specific investments or timing. Response to evaluate: {output} First explain your reasoning, then provide your verdict (PASS/FAIL).""", node_level=StepType.llm, output_type=OutputTypeEnum.BOOLEAN, model_name="gpt-4.1-mini", cot_enabled=True, num_judges=3 )
That's it. No infrastructure to set up. No models to fine-tune. No complex configuration.
Implementation Challenges
Building LLM judges that work in demos is easy. Building ones that work in production is harder. Click to expand.
The Takeaway
LLM-as-Judge is Stage 1 of the Eval Engineering lifecycle. It gets you from no evaluation to 60โ70% accuracy quickly. The techniques in this chapter, including custom metrics, proper prompt engineering, multi-judge polling, and bias mitigation, can increase accuracy to 80%, which is not yet production-ready for most use cases.
To reach 90%+ accuracy, you need human expertise in the loop. The next stage, SME Refinement, is where generic becomes domain-specific. Chapter 3 shows how to bring subject matter experts into the loop.
"60% to 80% is engineering. 80% to 95% is domain expertise."
Frequently Asked Questions
Use the strongest model you can afford within your latency and cost constraints. For most use cases, GPT class models or Claude Sonnet provide good balance. Consider using a fine-tuned smaller model for better accuracy at a lower cost.
Safety evaluation requires specialized approaches. General-purpose LLM judges can miss adversarial inputs and jailbreaks. For safety-critical applications, combine dedicated models with rule-based guardrails and human review.
