Back to Book

    Chapter 01 ยท Mar 6, 2026

    What is Eval Engineering?

    Pratik Bhavsar

    Pratik Bhavsar

    Evals & Leaderboards @ Galileo Labs

    Building AI has never been easier. Deploying it reliably has never been harder. Eval Engineering is the discipline that closes that gap.

    ~17 min read
    9 sections
    Case Study

    In November 2022, a man named Jake Moffatt visited Air Canada's website to book a flight to his grandmother's funeral. He asked the airline's chatbot about bereavement fares. The chatbot told him that he could book a full-price ticket now and apply for a bereavement discount within 90 days. So he did.

    The chatbot was wrong. Air Canada's actual policy was the opposite: bereavement fares had to be requested before travel, not after. When Moffatt applied for his refund, the airline refused. When he pushed back with screenshots, an Air Canada representative admitted the chatbot had used "misleading words" but shrugged it off. The correct policy existed elsewhere on the website. He should have checked.

    The case went to a tribunal. Air Canada's defense was, in the tribunal's own language, "a remarkable submission." The airline argued that its chatbot was essentially a separate legal entity, responsible for its own actions. The tribunal didn't buy it. The ruling was straightforward: a company is responsible for all information on its website, whether it comes from a static page or a chatbot. Air Canada owed Moffatt $812.

    The $812 lesson

    This wasn't a catastrophic financial loss, but that's precisely what made it so damaging. The story went global. Washington Post, BBC, CBS, every major tech publication โ€” because it crystallized a fear that every executive deploying AI already felt in their gut: what happens when the system says something wrong, and nobody catches it until it's too late?

    Air Canada had the right policy. They had it written down, reviewed by lawyers, posted on their website. What they didn't have was a system to ensure their AI actually reflected that policy. No mechanism to catch the hallucination before a grieving customer acted on it. No guardrail between generation and delivery.

    Modern LLMs and agent frameworks have compressed development from months to days. Spinning up a RAG system takes an afternoon. Getting an agent to call tools takes a few hours of prompt engineering. The building part, for the most part, is solved.

    And yet most AI projects never make it to production.

    The teams that are shipping production AI are using the same LLMs as everyone else. The difference is what happens after generation. The evaluation layer. The part the industry treats as an afterthought is actually the whole game.

    Fundamentals

    What is Evals?

    Every AI project follows the same arc. In the beginning, evals feel like overhead. The team is moving fast, testing by hand, gathering hallway feedback, iterating on intuition. It works. A surprising amount of progress can happen this way.

    Then the product goes live.

    Users scale up. Someone makes a prompt change to improve response quality, and three customers report that the agent "feels worse." Another change fixes hallucinations but breaks formatting. The team debates whether the new version is actually better or just different. Nobody knows for sure. There's no shared definition of "good," no baseline to compare against, no way to tell signal from noise. The team is flying blind.

    This is the moment where organizations stall because they lack certainty. And here's what most technical teams underestimate: the biggest obstacle to shipping AI isn't engineering. It's organizational sign-off.

    No VP of Product is going to greenlight a production deployment they can't explain. No legal team will approve a customer-facing agent without understanding what could go wrong and how often. No CFO will fund the next phase of an AI initiative that can't quantify its own reliability. Organizations don't kill AI projects because the technology failed. They kill them because nobody could prove it worked.

    Evals are the mechanism that creates that proof.

    Evals can help a product team say, with evidence, "this works in 97% of cases, here are the 3% it doesn't, and here's what happens when it fails." The system that turns an executive's anxiety into a quantified risk they can actually manage.

    Without that system, the pattern is predictable. Wait for complaints. Try to reproduce the issue. Fix the bug. Hope nothing else broke. You can't distinguish real regressions from noise. You can't test changes against hundreds of scenarios before shipping. You can't measure whether you're actually improving.

    Key Idea

    You can't improve what you can't measure. You can't measure what you haven't defined.

    Evals Force Clarity

    The first benefit of evals isn't measurement. It's specification.

    Two engineers reading the same product spec will interpret edge cases differently. "The agent should be helpful" means different things to different people. An eval suite resolves this ambiguity. Writing test cases forces the team to articulate what success actually looks like. What should the agent do when the user asks something outside its scope? When context is ambiguous? When the request conflicts with policy?

    You can't evaluate what you haven't defined. The act of writing evals exposes gaps in your requirements that would otherwise surface as bugs in production.

    Evals Accelerate Everything

    When a new model drops, teams without evals face weeks of manual testing. Teams with evals run their suite overnight and know by morning which capabilities improved, which regressed, and whether the upgrade is worth it. They tune prompts, validate changes, and ship in days while competitors are still guessing.

    The same acceleration applies to prompt changes, architecture updates, and new features. Every modification can be tested against your full scenario bank before it reaches users. One change fixing a bug while creating another is no longer an acceptable outcome. You can see the tradeoff before you ship.

    Stage
    Without Evals
    With Evals
    Model upgrade
    Weeks of manual testing
    Run suite overnight, ship in days
    Prompt change
    Hope nothing broke
    Know exactly what changed
    Bug report
    Guess and check
    Reproduce, fix, add to regression suite
    New hire onboarding
    Tribal knowledge
    Read the test cases

    Compounding Value

    The costs of evals are visible upfront: time to write cases, infrastructure to run them, effort to maintain them. The benefits accumulate later and are easy to miss.

    Evals give baselines for free. Latency, token usage, cost per task, error rates. All trackable on a static bank of scenarios. You can answer "are we getting better?" with data instead of opinions.

    Key Idea

    Evals are how you turn opinions into evidence.

    Evals also become the communication channel between product and engineering. Instead of vague requests to "improve quality," product teams can point to specific failing cases. Instead of arguing about whether a change helped, teams can look at the numbers.

    Evals let you say no. When someone proposes a change that "feels better," you can run the suite. If the numbers don't move, you skip it. No more over-engineering. No more unnecessary complexity justified by intuition.

    Over time, your eval suite becomes institutional knowledge. New team members onboard by reading test cases. Edge cases that burned you once are captured forever. The system remembers what individual engineers forget.

    Warning Signs

    What Happens Without Evals

    Gartner said that at least 30% of generative AI projects will be abandoned after proof of concept by the end of 2025. The common assumption is that they lack talent, or models, or frameworks. But look closer: they have plenty of prototypes. What they lack is the ability to trust what they've built.

    This trust gap shows up in predictable patterns.

    The Endless Pilot Loop

    Click to expand

    The Firefighting Trap

    Click to expand

    The Rebuild Tax

    Click to expand

    The Accuracy Plateau

    Click to expand

    These aren't failures of effort. They're symptoms of a missing discipline. The industry built sophisticated tools for training models and deploying inference. It neglected the layer in between: the system that determines whether outputs are actually good.

    "Winners don't differentiate on the models they call. They differentiate on how they know it's working."

    The teams shipping production AI at scale have figured out one thing โ€” you can swap your model in a weekend. You can't swap your understanding of what good looks like.

    This is the insight that led to Eval Engineering.

    Definition

    What Eval Engineering Actually Means

    Most teams think about evaluation as a quality check. Something you do before shipping. A gate to pass through on the way to production.

    Eval Engineering inverts this entirely.

    Eval Engineering is the discipline of building production-grade evaluation systems that govern AI behavior at scale. It treats evaluation not as a checkpoint but as infrastructure. The layer where your domain expertise becomes executable. Where your definition of "good" becomes enforceable at every interaction, not just the ones you sample. Where trust stops being a hope and becomes something you manufacture systematically.

    The shift in mental model looks like this:

    Old Mental Model
    Eval Engineering
    Evals are debugging tools
    Evals are governance systems
    Run once before deployment
    Run continuously in production
    Generic accuracy metrics
    Domain-specific precision
    Sample 10% of traffic
    Cover 100% of traffic
    Static test sets
    Adaptive evaluation
    Cost center
    Strategic asset
    Measures behavior
    Governs behavior

    The left column describes how most teams operate today. The right column describes what production AI actually requires.

    Traditional evaluation asks: "Did this work?" Eval Engineering asks a harder question: "How do we ensure this keeps working, at scale, in production, as conditions change, while catching failures before users ever see them?" The first question gets you a report. The second gets you a system.

    This distinction matters because measurement alone changes nothing. Evals tell you that 30% of outputs were bad. They don't stop those outputs from reaching users. They don't improve over time. Eval Engineering transforms measurement into action: if an output fails evaluation, it doesn't ship. If patterns of failure emerge, the system adapts. If domain expertise exists in someone's head, it gets encoded into infrastructure that runs at scale.

    Eval Engineering is not...

    Not Testing

    Testing asks whether something works. Eval Engineering ensures it keeps working.

    Not Observability

    Observability shows you what went wrong after it happened. Eval Engineering is proactive. It catches problems before they escape.

    Not Metrics

    A dashboard displaying 70% accuracy doesn't fix the 30% that's broken. Metrics inform. Eval Engineering acts. The goal isn't to know your accuracy. The goal is to improve it and enforce it at runtime.

    Not A Project

    Projects end. Eval Engineering is a capability you build and maintain indefinitely. The lifecycle is continuous because production data drifts, user behavior shifts, and model providers push updates without warning.

    Key Idea

    The goal isn't to measure how much your AI fails. It's to build systems that prevent it from failing in the first place.

    Lifecycle

    The Eval Engineering Lifecycle

    Let's look at the five-stage lifecycle of eval engineering. Each stage builds on the previous one. Skip a stage and the system breaks down. Try to jump straight to production guardrails without SME refinement and you'll encode mediocre judgment at scale. The stages exist because they solve different problems in sequence.

    MonitorProduction Trafficwith GuardrailsCreateLLM-as-JudgeEvalEval TuningScoreDevelopmentTrafficFine-tuneSLM-as-JudgeLLMPromptDatasetIdentify ProductionEdge CasesAdd SME AnnotatedDatasetsHigh-AccuracyLLM Eval97% Inference CostSavingsAutotune

    Key Idea

    If you only remember one thing: Eval Engineering is a lifecycle, not a one-time activity.

    Why Now

    Why Evals Matter Now?

    Multiple forces are converging to make Eval Engineering an urgent priority.

    01

    Economics

    The model layer is commoditizing

    The model layer is commoditizing fast. GPT, Claude, Gemini, Llama: they're converging in capability and competing on price. Switching costs are dropping. The model you use matters less every quarter. Differentiation is moving to the application layer: how you deploy, how you customize, how you ensure quality. The teams that win will be the ones who can reliably ship, not just prototype. Eval Engineering is the capability that enables reliable shipping.

    02

    Trust

    Trust has become the bottleneck

    The enterprises falling behind in AI adoption are not lacking models or frameworks or talent. They're lacking trust. Their legal teams won't sign off. Their compliance teams have concerns. Their executives got burned by a demo that failed in production and are now gun-shy. Eval Engineering is about systematically manufacturing trust. It turns "we think it works" into "we can prove it works, and we can prove it keeps working."

    03

    Industry Shift

    The industry is formalizing how agents get built

    Agent Development Lifecycle (ADLC) is called a rethinking of the traditional SDLC for probabilistic systems. The pattern across the ADLC framework from multiple companies is consistent: less time on upfront planning, significantly more time on tuning and optimization, and far greater emphasis on automated governance. At the heart of every ADLC framework sits evaluation. ADLC describes how to organize your team and structure your workflow. Eval Engineering is the craft that makes ADLC actually work.

    These forces connect to one final shift worth understanding: the rise of context engineering.

    Cognition AI says context engineering is effectively the number one job of engineers building agents. Andrej Karpathy said real engineering work is deciding what goes into that working memory.

    They're right. But here's what the context engineering conversation often misses: you can assemble the perfect context and still produce garbage outputs. Context engineering determines what goes in. Eval engineering determines whether what comes out is any good. They're two halves of the same system. The engineers who master both will define the next generation of AI development.

    Two Halves of the Same System

    What Goes In

    Context Engineering

    +

    What Comes Out

    Eval Engineering

    Investment

    When to Invest in Eval Engineering

    Not every AI project needs the full lifecycle on day one. Match the investment to the risk.

    Stakes โ†‘

    Volume โ†’

    Low stakes + low volume: Internal tools, experimental features, non-customer-facing applications. Basic evals are sufficient. Your time is better spent elsewhere.

    High stakes + any volume: Customer-facing agents, regulated industries, decisions with legal or financial consequences. Start the Eval Engineering lifecycle as soon as possible. The cost of one bad outcome exceeds the cost of building proper evaluation infrastructure.

    High volume + any stakes: At scale, even "minor" failures add up. If 1% of outputs are problematic and you have 100,000 daily interactions, that's 1,000 problems per day. You need cost-effective evaluation, which means SLMs. You need coverage, which means guardrails.

    Pitfalls

    Common Pitfalls in Evals

    Click to expand and see what goes wrong โ€” and how to avoid it.

    Takeaway

    The Takeaway

    DevelopmentProductionCreate evalsRun experimentsImprove promptsDebugMonitor metricsCurateImprove metricsRun CI/CD tests

    In the future, evals will be more valuable than your code. OpenAI's recent guidance to business leaders makes this explicit: "If done well, evals become unique differentiators... robust evals create compounding advantages and institutional know-how as your systems improve."

    Evaluations encode your definition of quality. They capture your domain expertise. They represent thousands of hours of SME judgment, distilled into executable criteria. Models are commodities you can swap. Frameworks are open source. But your evals? Those are yours. They embody how your organization thinks, what your customers expect, and what "good" means in your specific context.

    A company's catalog of evals will become core IP, the same way proprietary data and customer relationships are IP today. The teams that treat eval development as a strategic investment, not a technical overhead, will have assets their competitors can't replicate.

    The industry has shown that companies shipping production AI at scale aren't using better models than you. They've built the infrastructure to manufacture trust. They've turned evaluation from a tax into an asset.

    Eval Engineering transforms AI evaluation from a debugging tool into a governance system. It's about moving from "hope it works" to "prove it works" to "ensure it keeps working."

    "AI without evals is just expensive hope."

    Frequently Asked Questions

    If your AI is customer-facing, handles decisions with consequences, or operates at scale where failures compound: start now. You're already accumulating risk. If you're prototyping with genuinely low stakes and low volume: basic evals are fine. But be honest about what "low stakes" means. Most teams underestimate the stakes of their AI applications.

    You can build the components. LLM-as-judge is just API calls. SME workflows are spreadsheets. SLM fine-tuning is documented. Guardrails are inference-time checks. The question is whether you want your engineers building evaluation infrastructure or building your actual product. The build-vs-buy calculation usually favors buying infrastructure and building domain-specific customization. But we'll cover this in detail in a later chapter.

    Free Download

    Eval Engineering Cheatsheet

    Key concepts, frameworks, and best practices โ€” all in one page.

    Stay in the loop

    New chapters, practical guides, and eval engineering insights delivered to your inbox.