Back to Book

    Chapter 02 ยท Mar 2, 2026

    Evals with LLM-as-Judge

    Pratik Bhavsar

    Pratik Bhavsar

    Evals & Leaderboards @ Galileo Labs

    Using an LLM to evaluate your AI's outputs is deceptively simple to set up โ€” and surprisingly hard to get right. This chapter shows you how to do it well.

    ~20 min read
    10 sections

    Chapter 1 introduced the Eval Engineering lifecycle. Stage 1 is LLM-as-Judge: using a large language model to evaluate your AI's outputs. This chapter goes deep on how to do it well.

    The approach is deceptively simple. Write a prompt that defines what "good" means. Feed it your AI's outputs. Get judgments at scale. In theory, you've automated evaluation. In practice, most teams plateau at <70% accuracy and wonder what went wrong.

    The gap between generic LLM judges and production-grade evaluation is a matter of understanding what judges actually do well, what they systematically fail at, and how to configure them for your specific domain. This chapter gives you that understanding.

    Two Sides of LLM Evaluation

    Generic judges get you started. Custom judges get you to production.

    What Goes In

    Generic Judge

    64โ€“68% accuracy ceiling

    โ†’

    What Comes Out

    Custom Judge

    Production-grade accuracy

    Foundation

    Why LLM Judges Work

    LLM judges work because LLMs have internalized enormous amounts of human judgment. They've seen millions of examples of good writing, clear explanations, accurate summaries, and helpful responses. When you ask GPT or Claude whether a response is helpful, you're leveraging that internalized sense of quality.

    This is why LLM judges can achieve 80%+ agreement with human evaluators on general tasks. Research from the MT-Bench paper showed that strong LLM judges match crowdsourced human preferences at roughly the same level that humans agree with each other. For many evaluation tasks, an LLM judge is as reliable as hiring an annotator.

    Key Takeaways

    • โ†’LLM judges scale infinitely
    • โ†’They don't get tired, don't have bad days, and don't develop annotation fatigue after reviewing their 500th example
    • โ†’They provide consistent application of criteria across thousands of evaluations
    • โ†’They're available 24/7 and can process evaluations in seconds rather than days
    • โ†’And critically, they're explainable: you can ask them to show their reasoning

    There's another advantage that gets less attention: LLM judges improve automatically as models improve. The judge you build today will get more accurate when the next generation of models arrives. You don't need to retrain anything. Just point your evaluation pipeline at a better model and accuracy improves. This is fundamentally different from traditional ML classifiers, which require new training data and retraining cycles to improve.

    Key Idea

    Your eval infrastructure appreciates over time. Every model upgrade makes your judges smarter for free.

    But here's the catch. LLM judges don't know what "good" means for your specific use case, your domain, your users, or your definition of success. And that's where most implementations fail.

    The Problem

    The Issue with Generic LLM Judges

    Ask GPT-5 to evaluate whether a customer service response is "helpful," and it will give you a reasonable answer. Ask it to evaluate whether that same response follows your company's specific escalation policy, uses approved terminology, maintains the right tone for your brand, and addresses the customer's underlying concern rather than just their stated question, and it will guess.

    Generic judges plateau because they're optimizing for generic quality. They catch obvious failures: responses that are clearly wrong, incoherent, or off-topic. They miss subtle failures: responses that are technically correct but violate domain-specific requirements, follow the letter of instructions but not the spirit, or satisfy general criteria while failing specific ones.

    Research from UC Berkeley quantified this problem. Off-the-shelf LLM-as-judge setups achieve 64-68% accuracy on domain-specific evaluation tasks. That's the ceiling of the generic approach. Not because the models are incapable of better judgment, but because they lack the domain context to make better judgments.

    We know that "good" means different things in different contexts: in legal document review (cites precedents, identifies risks, includes disclaimers), in customer support (acknowledges frustration, provides next steps, stays within policy), in coding assistants (working code, handles edge cases, follows style guide). A generic judge can't evaluate any of these well because it doesn't know the specific criteria that matter.

    Getting Started

    Building Your First LLM Judge

    Start with a single, specific failure mode. Not five. Not ten. One. This constraint forces clarity about what you're actually trying to measure.

    But before you write your judge prompt, let me show you what separates a basic prompt from one that actually works.

    Basic vs. Refined Eval Prompts

    Basic Prompt

    Evaluate whether this response is helpful and accurate.
    
    Response: {response}
    
    Is this response good? Answer yes or no

    This prompt will achieve roughly 60% agreement with human judgment. Here's why it fails:

    Key Takeaways

    • โ†’"Helpful" and "accurate" are vague and conflated
    • โ†’No context about the domain or use case
    • โ†’No examples of what pass/fail looks like
    • โ†’No reasoning required before verdict

    Refined Prompt โ€” 75โ€“80% accuracy

    You are an impartial evaluator assessing whether AI
    responses to financial questions provide accurate
    information without giving investment advice.
    
    CRITERIA:
    A response PASSES if it:
    1. Contains only factual, verifiable information about
       financial instruments or concepts
    2. Does NOT recommend buying, selling, or holding any
       specific investment
    3. Does NOT suggest allocation percentages or timing for trades
    4. Includes appropriate uncertainty when discussing
       future performance
    
    A response FAILS if it violates ANY of the above.
    
    EXAMPLES:
    
    Example 1 (FAIL):
    User: "Should I invest in tech stocks?"
    Response: "Based on current market trends, I'd recommend
    allocating 60% to tech ETFs like QQQ."
    Reasoning: Explicitly recommends a specific allocation
    and names a specific fund. This is investment advice.
    Verdict: FAIL
    
    Example 2 (PASS):
    User: "Should I invest in tech stocks?"
    Response: "Tech stocks have historically shown higher
    volatility than the broader market. The NASDAQ-100 has
    had average annual returns of about 10% over the past
    20 years, though past performance doesn't guarantee
    future results."
    Reasoning: Provides historical facts without recommending
    any action. Includes appropriate uncertainty language.
    Verdict: PASS
    
    Example 3 (FAIL):
    User: "What do you think about Apple stock?"
    Response: "Apple has strong fundamentals. Now would be a
    good time to buy while it's down."
    Reasoning: "Good time to buy" constitutes timing advice.
    Verdict: FAIL
    
    NOW EVALUATE:
    Response: {response}
    
    First explain your reasoning, then provide your verdict (PASS/FAIL)
    Element
    Basic
    Refined
    Role definition
    None
    Clear evaluator role with domain context
    Criteria
    Vague ("helpful")
    Explicit, numbered criteria
    Examples
    None
    Multiple few-shot examples with reasoning
    Edge cases
    Ignored
    Addressed in few-shot examples
    Output format
    Ambiguous
    Reasoning first, then binary verdict
    Domain context
    None
    Specific (financial compliance)
    Technique

    Prompt Engineering for Judges

    โ˜ฐ

    Specific Criteria

    โœ•VAGUE

    "Evaluate if this response is helpful"

    โœ“SPECIFIC

    "Answers the question directly, avoids personal opinions, uses clear language, avoids assumptions about expertise."

    โ—Ž

    Clear Terms

    โœ•AMBIGUOUS

    "Uses appropriate tone"

    โœ“TESTABLE

    "Uses formal language, avoids slang, addresses user by name"

    โ–ฅ

    Binary Output

    โœ•SCALE 1โ€“5

    "What's the difference between 3 and 4?"

    โœ“PASS / FAIL

    Unambiguous. No interpretation needed.

    โ—‡

    Few-Shot Examples

    โš CRITICAL

    Include examples for cases that caused disagreement. If your team argued about it, the judge needs it.

    Your examples are your specifications.

    โ–ฃ

    Explicit Decision Rules

    โœ“EDGE CASE RULES

    "If no citation โ†’ FAIL" "If partial answer without ask โ†’ FAIL" "If correct but wrong tone โ†’ FAIL"

    โœ‚

    Split Compound Criteria

    โœ•COMPOUND

    "Evaluate if helpful AND accurate"

    โœ“SEPARATE

    Judge 1: Helpful? โ†’ Judge 2: Accurate?

    The quality of your judge depends almost entirely on the quality of your prompt. This isn't hyperbole. The same underlying model can achieve accuracy rates ranging from 60% to 95%, depending on the prompt used.

    The principles below are based on hard-won lessons in collaborative annotation. When humans disagree on labels, it's usually because the rubric is ambiguous. The same applies to LLM judges: unclear prompts produce inconsistent judgments.

    Architecture

    Multi-Judge Polling: Panels of LLM Evaluators

    ChainPoll Algorithm

    Query / PromptCompletionCompletion + CoT Prompt[0.2, 0.8, 0.4, 0.6, 0.2]UserLLMDetailed CoT(LLM)ScorerFinal Score (0, 1)+ ExplanationN = 5Parallelised Reasoning ChainsโฌกOpenAI GPTโœฆGoogle GeminiAAzure ModelsLLMsGalileo ModulesGalileo Modules(+ LLM call)AIโฌกGAaws

    "One judge is an opinion. Three judges agreeing is evidence."

    A single judge call has inherent variance. Ask the same model the same question twice, and you might get different answers. At production scale, this variance becomes noise that obscures real signal.

    Why Single Judges Fail

    Most LLM-as-Judge systems rely on a single strong evaluator, often GPT-5. This has fundamental problems:

    โ†’

    Intra-model bias. Judge models tend to recognize and favor outputs stylistically similar to their own generations. This self-preference effect inflates scores for same-family models.

    โ†’

    High variance. Small changes in prompt wording or formatting produce large swings in evaluation outcomes. A judge that was accurate yesterday might be inconsistent today.

    โ†’

    Cost at scale. Using a frontier model for every evaluation is expensive and slow.

    Aggregation Strategies

    Evaluation Type
    Aggregation
    Example
    Binary (pass/fail)
    Majority vote
    3/5 judges say FAIL โ†’ FAIL
    Binary with recall priority
    Max pooling
    Any judge says PASS โ†’ PASS
    Ordinal (1โ€“5 scale)
    Mean
    Average of judge scores
    Pairwise preference
    Majority vote
    3/5 prefer A โ†’ A wins

    Practical Configurations

    โ†’

    3 judges. The minimum for meaningful aggregation. Majority vote determines the final verdict. Cost increases 3ร— but variance drops significantly.

    โ†’

    5 judges. Provides more granularity. You can distinguish "5/5 agree it fails" from "3/5 agree it fails," which maps to confidence levels.

    โ†’

    Mixed model families. Reduce systematic bias. If you're worried that GPT has blind spots, use a panel of different models (GPT, Claude, Gemini) and aggregate their verdicts.

    The ChainPoll Approach

    The ChainPoll approach, developed for hallucination detection, demonstrated this concretely. It combines two techniques to achieve high-accuracy hallucination detection:

    First, it uses a carefully engineered chain-of-thought prompt that asks the LLM judge to write out step-by-step reasoning before rendering a verdict. The prompt is designed to elicit systematic, detailed explanations rather than quick judgments.

    Second, it polls the model multiple times (typically 3 or 5) and aggregates the binary yes/no verdicts into a confidence score. If two out of three polls say "hallucinated," the score is 0.66. This aggregation captures uncertainty that a single judgment would miss.

    Three design choices make ChainPoll particularly effective: it requests boolean judgments rather than numeric scores (which proved more reliable in testing), it places the reasoning before the verdict (so the answer can leverage the explanation), and it uses a smaller, faster model for the accuracy gap through multiple polls.

    Risks

    The Biases You Must Mitigate

    LLM judges have systematic biases. If you don't address them, your evaluations will be systematically wrong in predictable ways. These need to be mitigated with different strategies.

    Positional Bias

    Click to expand

    Verbosity Bias

    Click to expand

    Self-Preference

    Click to expand

    Recency Bias

    Click to expand

    Format Bias

    Click to expand

    Configuration

    Judge Configurations

    LLM judges have multiple configuration dimensions. Each involves tradeoffs.

    Model Selection: Reasoning vs. Non-Reasoning

    Reasoning models (o1, o3, Claude with extended thinking) produce more thorough evaluations but cost more and take longer. For simple binary judgments, a standard model with chain-of-thought prompting is usually sufficient. Reserve reasoning models for evaluations that require multi-step analysis or complex domain reasoning.

    Rule of thumb: if your evaluation criteria can be stated in a single sentence, you don't need a reasoning model. If evaluation requires weighing multiple factors and considering edge cases, reasoning models may help.

    Scope: Where to Apply the Judge

    Level
    Use Case
    Session
    Evaluate entire multi-turn conversations, end-to-end user journeys
    Trace
    Evaluate single interactions, individual request-response pairs
    LLM Span
    Evaluate individual model calls within a larger workflow
    Retriever Span
    Evaluate RAG retrieval quality before generation
    Tool Span
    Evaluate tool call correctness in agent systems

    Start with trace-level evaluation for most use cases. Move to span-level when you need to diagnose which component of a complex system is failing.

    Output Type: Choosing the Right Scale

    Output Type
    When to Use
    Boolean
    Clear pass/fail criteria, compliance checks, binary requirements
    Categorical
    Multiple distinct outcomes (Correct / Partially Correct / Incorrect)
    Discrete (0โ€“5)
    Need granularity but want bounded scale
    Percentage (0.0โ€“1.0)
    Confidence scores, partial credit

    Boolean is almost always the right starting point. If you find yourself wanting more granularity, that's often a sign you haven't clearly defined your criteria. Two judges with clear binary criteria are better than one judge with a vague 5-point scale.

    Reasoning: Chain-of-Thought vs. Direct

    Chain-of-thought (CoT) prompting asks the judge to explain its reasoning before the verdict. This improves accuracy and makes evaluations explainable. The cost: more tokens and higher latency.

    Rule of Thumb

    Use CoT for development; consider direct evaluation in production if you're confident in accuracy.

    Precision vs. Recall Tradeoff

    For professional AI engineers, "accuracy" is just the starting point.

    Scenario
    Prioritize
    Why
    Safety / compliance
    Recall
    Can't afford to miss real violations
    Cost optimization
    Precision
    Don't want false alarms triggering expensive reviews
    General quality
    F1
    Balance between missing issues and over-flagging
    Implementation

    Creating Custom Metrics

    One barrier to good evaluation is the perception that custom metrics are hard to build. They're not. Modern evaluation frameworks make it trivial to go from a prompt idea to a running metric.

    Galileo Custom Metric API

    from galileo.metrics import create_custom_llm_metric, OutputTypeEnum, StepType
    
    metric = create_custom_llm_metric(
        name="Financial Compliance Check",
        user_prompt="""You are an impartial evaluator assessing whether AI responses
        provide information without giving investment advice.
    
        A response PASSES if it contains only factual information and does NOT
        recommend buying, selling, or holding any specific investment.
    
        A response FAILS if it recommends specific investments or timing.
    
        Response to evaluate: {output}
    
        First explain your reasoning, then provide your verdict (PASS/FAIL).""",
        node_level=StepType.llm,
        output_type=OutputTypeEnum.BOOLEAN,
        model_name="gpt-4.1-mini",
        cot_enabled=True,
        num_judges=3
    )
    Parameter
    What It Does
    Recommended
    name
    Human-readable metric name
    Be descriptive: "Financial Compliance - No Recommendations"
    user_prompt
    The evaluation prompt
    Include criteria, examples, and output format
    output_type
    Type of output
    BOOLEAN for most use cases
    model_name
    Which model judges
    gpt-4.1-mini balances cost and quality
    cot_enabled
    Chain-of-thought reasoning
    True for explainability
    num_judges
    How many judges to poll
    3 minimum for variance reduction

    That's it. No infrastructure to set up. No models to fine-tune. No complex configuration.

    Challenges

    Implementation Challenges

    Building LLM judges that work in demos is easy. Building ones that work in production is harder. Click to expand.

    Takeaway

    The Takeaway

    LLM-as-Judge is Stage 1 of the Eval Engineering lifecycle. It gets you from no evaluation to 60โ€“70% accuracy quickly. The techniques in this chapter, including custom metrics, proper prompt engineering, multi-judge polling, and bias mitigation, can increase accuracy to 80%, which is not yet production-ready for most use cases.

    To reach 90%+ accuracy, you need human expertise in the loop. The next stage, SME Refinement, is where generic becomes domain-specific. Chapter 3 shows how to bring subject matter experts into the loop.

    "60% to 80% is engineering. 80% to 95% is domain expertise."

    Frequently Asked Questions

    Use the strongest model you can afford within your latency and cost constraints. For most use cases, GPT class models or Claude Sonnet provide good balance. Consider using a fine-tuned smaller model for better accuracy at a lower cost.

    Safety evaluation requires specialized approaches. General-purpose LLM judges can miss adversarial inputs and jailbreaks. For safety-critical applications, combine dedicated models with rule-based guardrails and human review.

    Free Download

    Eval Engineering Cheatsheet

    Key concepts, frameworks, and best practices โ€” all in one page.

    Stay in the loop

    New chapters, practical guides, and eval engineering insights delivered to your inbox.