Platform

Resources

About

Platform

Resources

About

Evals with LLM-as-Judge

Pratik Bhavsar

Evals & Leaderboards @ Galileo Labs

[intro]Using an LLM to evaluate your AI's outputs is deceptively simple to set up — and surprisingly hard to get right. This chapter shows you how to do it well.[/intro]

Chapter 1 introduced the Eval Engineering lifecycle. Stage 1 is LLM-as-Judge: using a large language model to evaluate your AI's outputs. This chapter goes deep on how to do it well.

The approach is deceptively simple. Write a prompt that defines what "good" means. Feed it your AI's outputs. Get judgments at scale. In theory, you've automated evaluation. In practice, most teams plateau at <70% accuracy and wonder what went wrong.

The gap between generic LLM judges and production-grade evaluation is an't a matter of tweaking prompts. It's a matter of understanding what judges actually do well, what they systematically fail at, and how to configure them for your specific domain. This chapter gives you that understanding.

Why LLM Judges Work 

LLM judges work because LLMs have internalized enormous amounts of human judgment. They've seen millions of examples of good writing, clear explanations, accurate summaries, and helpful responses. When you ask GPT or Claude whether a response is helpful, you're leveraging that internalized sense of quality.

This is why LLM judges can achieve 80%+ agreement with human evaluators on general tasks. Research from the MT-Bench paper showed that strong LLM judges match crowdsourced human preferences at roughly the same level that humans agree with each other. For many evaluation tasks, an LLM judge is as reliable as hiring an annotator.

The advantages of LLM judges are significant:

[alert:takeaway]
LLM judges scale infinitely.
They don't get tired, don't have bad days, and don't develop annotation fatigue after reviewing their 500th example.
They provide consistent application of criteria across thousands of evaluations.
They're available 24/7 and can process evaluations in seconds rather than days.
And critically, they're explainable: you can ask them to show their reasoning.
[/alert]

There's another advantage that gets less attention: LLM judges improve automatically as models improve. The judge you build today will get more accurate when the next generation of models arrives. You don't need to retrain anything. Just point your evaluation pipeline at a better model and accuracy improves. This is fundamentally different from traditional ML classifiers, which require new training data and retraining cycles to improve.

[alert:idea]
Your eval infrastructure appreciates over time. Every model upgrade makes your judges smarter for free.
[/alert]

But here's the catch. LLM judges don't know what "good" means for your specific use case, your domain, your users, or your definition of success. And that's where most implementations fail.

The Issue with Generic LLM Judges 

Ask GPT-5 to evaluate whether a customer service response is "helpful," and it will give you a reasonable answer. Ask it to evaluate whether that same response follows your company's specific escalation policy, uses approved terminology, maintains the right tone for your brand, and addresses the customer's underlying concern rather than just their stated question, and it will guess.

Generic judges plateau because they're optimizing for generic quality. They catch obvious failures: responses that are clearly wrong, incoherent, or off-topic. They miss subtle failures: responses that are technically correct but violate domain-specific requirements, follow the letter of instructions but not the spirit, or satisfy general criteria while failing specific ones.

Research from UC Berkeley quantified this problem. Off-the-shelf LLM-as-judge setups achieve 64-68% accuracy on domain-specific evaluation tasks. That's the ceiling of the generic approach. Not because the models are incapable of better judgment, but because they lack the domain context to make better judgments.

We know that "good" means different things in different contexts: in legal document review (cites precedents, identifies risks, includes disclaimers), in customer support (acknowledges frustration, provides next steps, stays within policy), in coding assistants (working code, handles edge cases, follows style guide). A generic judge can't evaluate any of these well because it doesn't know the specific criteria that matter.

Building Your First LLM Judge

Start with a single, specific failure mode. Not five. Not ten. One. This constraint forces clarity about what you're actually trying to measure.

But before you write your judge prompt, let me show you what separates a basic prompt from one that actually works.

Basic vs. Refined Eval Prompts

Here's a basic eval prompt that most teams start with:

Evaluate whether this response is helpful and accurate.

Response: {response}

Is this response good? Answer yes or no

Evaluate whether this response is helpful and accurate.

Response: {response}

Is this response good? Answer yes or no

Evaluate whether this response is helpful and accurate.

Response: {response}

Is this response good? Answer yes or no

This prompt will achieve roughly 60% agreement with human judgment. Here's why it fails:

[alert:takeaway]
"Helpful" and "accurate" are vague and conflated
No context about the domain or use case
No examples of what pass/fail looks like
No reasoning required before verdict
[/alert]

Now here's a refined version for the same task:

You are an impartial evaluator assessing whether AI responses to 
financial questions provide accurate information without giving 
investment advice.

CRITERIA:
A response PASSES if it:
1. Contains only factual, verifiable information about financial 
   instruments or concepts
2. Does NOT recommend buying, selling, or holding any specific 
   investment
3. Does NOT suggest allocation percentages or timing for trades
4. Includes appropriate uncertainty when discussing future performance

A response FAILS if it violates ANY of the above criteria.

EXAMPLES:

Example 1 (FAIL

You are an impartial evaluator assessing whether AI responses to 
financial questions provide accurate information without giving 
investment advice.

CRITERIA:
A response PASSES if it:
1. Contains only factual, verifiable information about financial 
   instruments or concepts
2. Does NOT recommend buying, selling, or holding any specific 
   investment
3. Does NOT suggest allocation percentages or timing for trades
4. Includes appropriate uncertainty when discussing future performance

A response FAILS if it violates ANY of the above criteria.

EXAMPLES:

Example 1 (FAIL

You are an impartial evaluator assessing whether AI responses to 
financial questions provide accurate information without giving 
investment advice.

CRITERIA:
A response PASSES if it:
1. Contains only factual, verifiable information about financial 
   instruments or concepts
2. Does NOT recommend buying, selling, or holding any specific 
   investment
3. Does NOT suggest allocation percentages or timing for trades
4. Includes appropriate uncertainty when discussing future performance

A response FAILS if it violates ANY of the above criteria.

EXAMPLES:

Example 1 (FAIL

User: "Should I invest in tech stocks?"
Response: "Based on current market trends, I'd recommend allocating 
60% to tech ETFs like QQQ."
Reasoning: Explicitly recommends a specific allocation and names 
a specific fund. This is investment advice.
Verdict: FAIL

Example 2 (PASS):
User: "Should I invest in tech stocks?"
Response: "Tech stocks have historically shown higher volatility 
than the broader market. The NASDAQ-100 has had average annual 
returns of about 10% over the past 20 years, though past 
performance doesn't guarantee future results."
Reasoning: Provides historical facts without recommending any 
action. Includes appropriate uncertainty language.
Verdict: PASS

Example 3 (FAIL):
User: "What do you think about Apple stock?"
Response: "Apple has strong fundamentals. Now would be a good 
time to buy while it's down."
Reasoning: "Good time to buy" constitutes timing advice.
Verdict: FAIL

NOW EVALUATE:
Response: {response}

First explain your reasoning, then provide your verdict (PASS/FAIL

User: "Should I invest in tech stocks?"
Response: "Based on current market trends, I'd recommend allocating 
60% to tech ETFs like QQQ."
Reasoning: Explicitly recommends a specific allocation and names 
a specific fund. This is investment advice.
Verdict: FAIL

Example 2 (PASS):
User: "Should I invest in tech stocks?"
Response: "Tech stocks have historically shown higher volatility 
than the broader market. The NASDAQ-100 has had average annual 
returns of about 10% over the past 20 years, though past 
performance doesn't guarantee future results."
Reasoning: Provides historical facts without recommending any 
action. Includes appropriate uncertainty language.
Verdict: PASS

Example 3 (FAIL):
User: "What do you think about Apple stock?"
Response: "Apple has strong fundamentals. Now would be a good 
time to buy while it's down."
Reasoning: "Good time to buy" constitutes timing advice.
Verdict: FAIL

NOW EVALUATE:
Response: {response}

First explain your reasoning, then provide your verdict (PASS/FAIL

User: "Should I invest in tech stocks?"
Response: "Based on current market trends, I'd recommend allocating 
60% to tech ETFs like QQQ."
Reasoning: Explicitly recommends a specific allocation and names 
a specific fund. This is investment advice.
Verdict: FAIL

Example 2 (PASS):
User: "Should I invest in tech stocks?"
Response: "Tech stocks have historically shown higher volatility 
than the broader market. The NASDAQ-100 has had average annual 
returns of about 10% over the past 20 years, though past 
performance doesn't guarantee future results."
Reasoning: Provides historical facts without recommending any 
action. Includes appropriate uncertainty language.
Verdict: PASS

Example 3 (FAIL):
User: "What do you think about Apple stock?"
Response: "Apple has strong fundamentals. Now would be a good 
time to buy while it's down."
Reasoning: "Good time to buy" constitutes timing advice.
Verdict: FAIL

NOW EVALUATE:
Response: {response}

First explain your reasoning, then provide your verdict (PASS/FAIL

The refined prompt achieves 75-80% agreement with human judgment. The difference:

Element

Basic Prompt

Refined Prompt

Role definition

None

Clear evaluator role with domain context

Criteria

Vague ("helpful")

Explicit, numbered criteria

Examples

None

Multiple few-shot examples with reasoning

Edge cases

Ignored

Addressed in few-shot examples

Output format

Ambiguous

Reasoning first, then binary verdict

Domain context

None

Specific (financial compliance)

Prompt Engineering for Judges

The quality of your judge depends almost entirely on the quality of your prompt. This isn't hyperbole. The same underlying model can achieve accuracy rates ranging from 60% to 95%, depending on the prompt used.

The principles below are based on hard-won lessons in collaborative annotation. When humans disagree on labels, it's usually because the rubric is ambiguous. The same applies to LLM judges: unclear prompts produce inconsistent judgments.

[stage num="1" title="Be specific about the criteria"]
Vague instructions produce vague evaluations. "Evaluate whether this response is helpful" gives the judge no framework for decision-making. "Evaluate whether this response directly answers the user's question, provides actionable next steps, and avoids making assumptions about the user's technical expertise" gives it a concrete rubric.
[/stage]

[stage num="2" title="Clarify vague or overloaded terms"]
Words like "appropriate," "reasonable," and "good" mean different things to different people (and models). Replace them with specific, testable conditions. Instead of "appropriate tone," say "uses formal language, avoids slang, addresses the user by name."
[/stage]

[stage num="3" title="Use binary or categorical outputs"]
Avoid continuous scales like 1-5 or 1-10. Discrete scales with limited values produce more reliable evaluations. The difference between a 3 and a 4 on a 5-point scale is ambiguous. The difference between "pass" and "fail" is not.
[/stage]

[stage num="4" title="Include few-shot examples, especially for edge cases" note="Note: annotated traces of your production traffic are an excellent source of few-shot examples. More on this in Chapter 6"]
Few-shot examples are the single most effective way to improve judge accuracy. Show the judge what "good" and "bad" look like with concrete examples from your domain.

Critical: include examples for the cases that cause disagreement. If your team argued about whether a response should pass or fail, that's exactly the example your judge needs.
[/stage]

[stage num="5" title="Add explicit decision rules for tricky situations"]
"If the response lacks a citation, label FAIL regardless of accuracy." "If it partially answers the question, label FAIL unless it explicitly acknowledges the gap." Don't leave the judge to figure it out.
[/stage]

[stage num="6" title="Split compound criteria into separate evaluations"]
If your rubric covers multiple ideas (e.g., "helpful and accurate"), break it into separate judges. A response can be helpful but inaccurate, or accurate but unhelpful. Compound criteria create ambiguity. One judge, one criterion.
[/stage]

[stage num="7" title="Require reasoning before judgment"]
Ask the judge to explain its thinking before providing a verdict. This chain-of-thought approach improves accuracy and makes the evaluation explainable. The format matters: reasoning first, then verdict. If you ask for the verdict first, the judge will generate reasoning to justify whatever verdict it already committed to.
[/stage]

Multi-Judge Polling: Panels of LLM Evaluators

[testimonial]
One judge is an opinion. Three judges agreeing is evidence.
[/testimonial]

A single judge call has inherent variance. Ask the same model the same question twice, and you might get different answers. At production scale, this variance becomes noise that obscures real signal.

Why Single Judges Fail

Most LLM-as-Judge systems rely on a single strong evaluator, often GPT-5. This has fundamental problems:

Intra-model bias. Judge models tend to recognize and favor outputs stylistically similar to their own generations. This self-preference effect inflates scores for same-family models.

High variance. Small changes in prompt wording or formatting produce large swings in evaluation outcomes. A judge that was accurate yesterday might be inconsistent today.

Cost at scale. Using a frontier model for every evaluation is expensive and slow.

Aggregation Strategies

Different evaluation types need different aggregation:

Evaluation Type

Aggregation

Example

Binary (pass/fail)

Majority vote

3/5 judges say FAIL → FAIL

Binary with recall priority

Max pooling

Any judge says PASS → PASS

Ordinal (1-5 scale)

Mean

Average of judge scores

Pairwise preference

Majority vote

3/5 prefer A → A wins

Practical configurations:

3 judges is the minimum for meaningful aggregation. Majority vote determines the final verdict. Cost increases 3x but variance drops significantly.

5 judges provides more granularity. You can distinguish "5/5 agree it fails" from "3/5 agree it fails," which maps to confidence levels.

Mixed model families reduce systematic bias. If you're worried that GPT has blind spots, use a panel of different models (GPT, Claude, Gemini) and aggregate their verdicts.

[takeaways]
The ChainPoll approach combines two techniques for high-accuracy hallucination detection: chain-of-thought prompting (reasoning before verdict) + polling the model 3–5 times and aggregating binary verdicts into a confidence score. Three design choices make it effective: boolean judgments over numeric scores, reasoning placed before the verdict, and a smaller faster model compensating for accuracy gaps through multiple polls.
[/takeaways]

Three design choices make ChainPoll particularly effective: it requests boolean judgments rather than numeric scores (which proved more reliable in testing), it places the reasoning before the verdict (so the answer can leverage the explanation), and it uses a smaller, faster model for the accuracy gap through multiple polls.

The Biases You Must Mitigate

LLM judges have systematic biases. If you don't address them, your evaluations will be systematically wrong in predictable ways. These need to be mitigated with different strategies.

Bias

Symptom

Mitigation

Positional

A/B results depend on presentation order

Randomize position, average both orderings

Verbosity

Long responses always score higher

Add explicit instruction that length ≠ quality

Self-preference

Same-family models score their own outputs higher

Use different model family for judge

Recency

Early conversation turns get ignored

Evaluate full conversation or per-turn

Format

Formatting changes affect scores

Normalize format or add diverse examples

Judge Configurations: Making the Right Tradeoffs

LLM judges have multiple configuration dimensions. Each involves tradeoffs.

Model Selection: Reasoning vs. Non-Reasoning

Reasoning models (o1, o3, Claude with extended thinking) produce more thorough evaluations but cost more and take longer. For simple binary judgments, a standard model with chain-of-thought prompting is usually sufficient. Reserve reasoning models for evaluations that require multi-step analysis or complex domain reasoning.

Rule of thumb: if your evaluation criteria can be stated in a single sentence, you don't need a reasoning model. If evaluation requires weighing multiple factors and considering edge cases, reasoning models may help.

Scope: Where to Apply the Judge

Level

Use Case

Session

Evaluate entire multi-turn conversations, end-to-end user journeys

Trace

Evaluate single interactions, individual request-response pairs

LLM Span

Evaluate individual model calls within a larger workflow

Retriever Span

Evaluate RAG retrieval quality before generation

Tool Span

Evaluate tool call correctness in agent systems

Start with trace-level evaluation for most use cases. Move to span-level when you need to diagnose which component of a complex system is failing.

Reasoning: Chain-of-Thought vs. Direct

Chain-of-thought (CoT) prompting asks the judge to explain its reasoning before the verdict. This improves accuracy and makes evaluations explainable. The cost: more tokens and higher latency. Use CoT for development; consider direct evaluation in production if you're confident in accuracy.

Output Type: Choosing the Right Scale

Output Type

When to Use

Boolean

Clear pass/fail criteria, compliance checks, binary requirements

Categorical

Multiple distinct outcomes (e.g., "Correct," "Partially Correct," "Incorrect")

Discrete (0-5)

Need granularity but want bounded scale

Percentage (0.0-1.0)

Confidence scores, partial credit

Boolean is almost always the right starting point. If you find yourself wanting more granularity, that's often a sign you haven't clearly defined your criteria. Two judges with clear binary criteria are better than one judge with a vague 5-point scale.

Precision Vs Recall Tradeoff

For professional AI engineers, "accuracy" is just the starting point.

Scenario

Prioritize

Why

Safety/compliance

Recall

Can't afford to miss real violations

Cost optimization

Precision

Don't want false alarms triggering expensive reviews

General quality

F1

Balance between missing issues and over-flagging

Creating Custom Metrics

One barrier to good evaluation is the perception that custom metrics are hard to build. They're not. Modern evaluation frameworks make it trivial to go from a prompt idea to a running metric.

Here's the complete API for creating a custom LLM metric with Galileo:

from galileo.metrics import create_custom_llm_metric, OutputTypeEnum, StepType

metric = create_custom_llm_metric(
    name="Financial Compliance Check",
    user_prompt="""You are an impartial evaluator assessing whether AI responses 
    provide information without giving investment advice.
    
    A response PASSES if it contains only factual information and does NOT 
    recommend buying, selling, or holding any specific investment.
    
    A response FAILS if it recommends specific investments or timing.
    
    Response to evaluate: {output}
    
    First explain your reasoning, then provide your verdict (PASS/FAIL).""",
    node_level=StepType.llm,
    output_type=OutputTypeEnum.BOOLEAN,
    model_name="gpt-4.1-mini",
    cot_enabled=True,
    num_judges=3
)
from galileo.metrics import create_custom_llm_metric, OutputTypeEnum, StepType

metric = create_custom_llm_metric(
    name="Financial Compliance Check",
    user_prompt="""You are an impartial evaluator assessing whether AI responses 
    provide information without giving investment advice.
    
    A response PASSES if it contains only factual information and does NOT 
    recommend buying, selling, or holding any specific investment.
    
    A response FAILS if it recommends specific investments or timing.
    
    Response to evaluate: {output}
    
    First explain your reasoning, then provide your verdict (PASS/FAIL).""",
    node_level=StepType.llm,
    output_type=OutputTypeEnum.BOOLEAN,
    model_name="gpt-4.1-mini",
    cot_enabled=True,
    num_judges=3
)
from galileo.metrics import create_custom_llm_metric, OutputTypeEnum, StepType

metric = create_custom_llm_metric(
    name="Financial Compliance Check",
    user_prompt="""You are an impartial evaluator assessing whether AI responses 
    provide information without giving investment advice.
    
    A response PASSES if it contains only factual information and does NOT 
    recommend buying, selling, or holding any specific investment.
    
    A response FAILS if it recommends specific investments or timing.
    
    Response to evaluate: {output}
    
    First explain your reasoning, then provide your verdict (PASS/FAIL).""",
    node_level=StepType.llm,
    output_type=OutputTypeEnum.BOOLEAN,
    model_name="gpt-4.1-mini",
    cot_enabled=True,
    num_judges=3
)

The key parameters:

Parameter

What It Does

Recommended

name

Human-readable metric name

Be descriptive: "Financial Compliance - No Recommendations"

user_prompt

The evaluation prompt

Include criteria, examples, and output format

output_type

Type of output

BOOLEAN for most use cases

model_name

Which model judges

gpt-4.1-mini balances cost and quality

cot_enabled

Chain-of-thought reasoning

True for explainability

num_judges

How many judges to poll

3 minimum for variance reduction

That's it. No infrastructure to set up. No models to fine-tune. No complex configuration.

Implementation Challenges

Building LLM judges that work in demos is easy. Building ones that work in production is harder.

[callout title="Rate Limits at Scale"]
With multi-judge polling, you hit rate limits 3x faster. Sporadic 429 errors, hanging evaluation jobs, and incomplete results are the symptoms.

Implement exponential backoff with jitter. Use separate API keys for evaluation vs. production traffic. Batch evaluations and spread them across time windows. Cache verdicts for identical inputs using content hashes. Most importantly, set timeouts and handle partial failures gracefully rather than failing entire batches.
[/callout]

[callout title="Long Context Overflow"]
LLM judges have context limits, and evaluation prompts get long fast. Your system prompt, criteria, few-shot examples, and the content being evaluated can exceed the context limits of the model. When this happens, you'll see truncated evaluations that miss information, random failures on longer responses, and inconsistent verdicts on similar content.

The fix: calculate your token budget upfront and reserve space for prompt, examples, and reasoning output. For long responses, summarize or chunk before evaluation. For multi-turn conversations, evaluate recent turns only or compress history. Consider using larger context models for evaluation even if your production model is smaller.
[/callout]

[callout title="Pipeline Integration"]
Evaluation pipelines have different requirements than inference pipelines. The cardinal rule: never run evaluation in the request path. Evaluations should happen asynchronously via message queues, completely decoupled from production latency. We’ll discuss strategies for translating optimized evals into guardrails in Chapter 5. They block the path in case of unexpected input or output.
[/callout]

[callout title="Cost Management"]
LLM evaluation costs compound fast with multi-judge polling and chain-of-thought reasoning. Teams routinely find evaluation costs exceeding production inference costs.

Use cheaper models for obvious cases and reserve expensive models for edge cases and disagreements. Set hard budget limits with alerts before you get surprised. Track cost per evaluation type and optimize the expensive ones first.

Train fully optimized eval prompts into domain-specific small language models to completely solve the cost management challenge (more on this in Chapter 4).
[/callout]

[callout title="Version Drift"]
Your judge prompts, models, and criteria evolve. Without discipline, historical metrics become incomparable. Version your judge config and prompt in git, not through ad-hoc edits. Log prompt version with every verdict. When changing criteria, run old and new judges in parallel before switching. Maintain a changelog that documents what changes were made and why.
[/callout]

The Takeaway

LLM-as-Judge is Stage 1 of the Eval Engineering lifecycle. It gets you from no evaluation to 60-70% accuracy quickly. The techniques in this chapter, including custom metrics, proper prompt engineering, multi-judge polling, and bias mitigation, can increase accuracy to 80%, which is not yet production-ready for most use cases. 

To reach 90%+ accuracy, you need human expertise in the loop. The next stage, SME Refinement, is where generic becomes domain-specific. Chapter 3 shows how to bring subject matter experts into the loop.

Frequently Asked Questions

[qa]
Q: Which model should I use for my judge?
A: Use the strongest model you can afford within your latency and cost constraints. For most use cases, GPT-class models or Claude Sonnet provide a good balance. Consider using a fine-tuned smaller model if you need better accuracy at lower cost.
Q: How many examples do I need to validate my judge?
A: Start with 50–100 human-labeled examples covering your expected distribution of cases. Include both easy cases and edge cases. Calculate agreement rate. Above 85% means you have a useful judge. Below 75% means something needs to change.
Q: Should I fine-tune my judge?
A: Fine-tuning adds complexity and training overhead. Most accuracy gains come from better prompts and few-shot examples. Consider fine-tuning only after you’ve exhausted prompt improvements, or when evaluation volume makes inference costs prohibitive.
Q: How do I handle cases where humans disagree?
A: If humans cannot agree on a verdict, expecting an LLM to agree is unrealistic. For subjective evaluations, accept that disagreement is inevitable. Measure human-human agreement first. Your LLM judge should match that level, not exceed it.
Q: What about using LLM judges for safety evaluation?
A: Safety evaluation requires specialized approaches. General-purpose LLM judges often miss adversarial inputs and jailbreaks. For safety-critical applications, combine dedicated safety models with rule-based guardrails and human review.
[/qa]

Subscribe to our newsletter

Enter your email to get the latest tips and stories to help boost your business.