What is Eval Engineering?
[intro]Building AI has never been easier. Deploying it reliably has never been harder. Eval Engineering is the discipline that closes that gap.[/intro]
In November 2022, a man named Jake Moffatt visited Air Canada's website to book a flight to his grandmother's funeral. He asked the airline's chatbot about bereavement fares. The chatbot told him, clearly and confidently, that he could book a full-price ticket now and apply for a bereavement discount within 90 days. So he did.
The chatbot was wrong. Air Canada's actual policy was the opposite: bereavement fares had to be requested before travel, not after. When Moffatt applied for his refund, the airline refused. When he pushed back with screenshots, an Air Canada representative admitted the chatbot had used "misleading words" but shrugged it off. The correct policy existed elsewhere on the website. He should have checked.
The case went to a tribunal. Air Canada's defense was, in the tribunal's own language, "a remarkable submission." The airline argued that its chatbot was essentially a separate legal entity, responsible for its own actions. The tribunal didn't buy it. The ruling was straightforward: a company is responsible for all information on its website, whether it comes from a static page or a chatbot. Air Canada owed Moffatt $812.
This wasn't a catastrophic financial loss. But that's precisely what made it so damaging. The story went global. Washington Post, BBC, CBS, every major tech publication. Not because the dollar amount was significant, but because it crystallized a fear that every executive deploying AI already felt in their gut: what happens when the system says something wrong, and nobody catches it until it's too late?
Air Canada had the right policy. They had it written down, reviewed by lawyers, posted on their website. What they didn't have was a system to ensure their AI actually reflected that policy. No mechanism to catch the hallucination before a grieving customer acted on it. No guardrail between generation and delivery.
They had an AI problem. But the solution was never a better model. The solution was evaluation.
Here's the uncomfortable truth about where the industry stands today: building AI has never been easier, and deploying it reliably has never been harder. Modern LLMs and agent frameworks have compressed development from months to days. Spinning up a RAG system takes an afternoon. Getting an agent to call tools takes a few hours of prompt engineering. The building part, for the most part, is solved.
And yet most AI projects never make it to production.
The teams that are shipping production AI at scale won't say this at conferences, but the model barely matters anymore. They're using the same LLMs as everyone else. The difference is what happens after generation. The evaluation layer. The part the industry treats as an afterthought is actually the whole game.
What is Evals?
Every AI project follows the same arc. In the beginning, evals feel like overhead. The team is moving fast, testing by hand, gathering hallway feedback, iterating on intuition. It works. A surprising amount of progress can happen this way.
Then the product goes live.
Users scale up. Someone makes a prompt change to improve response quality, and three customers report that the agent "feels worse." Another change fixes hallucinations but breaks formatting. The team debates whether the new version is actually better or just different. Nobody knows for sure. There's no shared definition of "good," no baseline to compare against, no way to tell signal from noise. The team is flying blind.
This is the moment where organizations stall. Not because they lack talent or technology, but because they lack certainty. And here's what most technical teams underestimate: the biggest obstacle to shipping AI isn't engineering. It's organizational sign-off.
No VP of Product is going to greenlight a production deployment they can't explain. No legal team will approve a customer-facing agent without understanding what could go wrong and how often. No CFO will fund the next phase of an AI initiative that can't quantify its own reliability. Organizations don't kill AI projects because the technology failed. They kill them because nobody could prove it worked.
Evals are the mechanism that creates that proof.
Not evals as a vague best practice. Not evals as a checkbox before launch. Evals as the system that translates AI behavior into organizational certainty. The system that lets a product team say, with evidence, "this works in 97% of cases, here are the 3% it doesn't, and here's what happens when it fails." The system that turns an executive's anxiety into a quantified risk they can actually manage.
Without that system, the pattern is predictable. Wait for complaints. Try to reproduce the issue. Fix the bug. Hope nothing else broke. You can't distinguish real regressions from noise. You can't test changes against hundreds of scenarios before shipping. You can't measure whether you're actually improving. And critically, you can't walk into a room full of stakeholders and explain why this thing is safe to deploy.
[alert:idea]
You can't improve what you can't measure. You can't measure what you haven't defined.
[/alert]
Evals Force Clarity
The first benefit of evals isn't measurement. It's specification.
Two engineers reading the same product spec will interpret edge cases differently. "The agent should be helpful" means different things to different people. An eval suite resolves this ambiguity. Writing test cases forces the team to articulate what success actually looks like. What should the agent do when the user asks something outside its scope? When context is ambiguous? When the request conflicts with policy?
You can't evaluate what you haven't defined. The act of writing evals exposes gaps in your requirements that would otherwise surface as bugs in production.
Evals Accelerate Everything
When a new model drops, teams without evals face weeks of manual testing. Teams with evals run their suite overnight and know by morning which capabilities improved, which regressed, and whether the upgrade is worth it. They tune prompts, validate changes, and ship in days while competitors are still guessing.
The same acceleration applies to prompt changes, architecture updates, and new features. Every modification can be tested against your full scenario bank before it reaches users. One change fixing a bug while creating another is no longer an acceptable outcome. You can see the tradeoff before you ship.
Stage | Without Evals | With Evals |
|---|---|---|
Model upgrade | Weeks of manual testing | Run suite overnight, ship in days |
Prompt change | Hope nothing broke | Know exactly what changed |
Bug report | Guess and check | Reproduce, fix, add to regression suite |
New hire onboarding | Tribal knowledge | Read the test cases |
Compounding Value
The costs of evals are visible upfront: time to write cases, infrastructure to run them, effort to maintain them. The benefits accumulate later and are easy to miss.
Once evals exist, you get baselines for free. Latency, token usage, cost per task, error rates. All trackable on a static bank of scenarios. You can answer "are we getting better?" with data instead of opinions.
[alert:idea]
Evals are how you turn opinions into evidence.
[/alert]
Evals also become the communication channel between product and engineering. Instead of vague requests to "improve quality," product teams can point to specific failing cases. Instead of arguing about whether a change helped, teams can look at the numbers.
Evals let you say no. When someone proposes a change that "feels better," you can run the suite. If the numbers don't move, you skip it. No more over-engineering. No more unnecessary complexity justified by intuition.
Over time, your eval suite becomes institutional knowledge. New team members onboard by reading test cases. Edge cases that burned you once are captured forever. The system remembers what individual engineers forget.
What Happens Without Evals
Gartner said that at least 30% of generative AI projects will be abandoned after proof of concept by the end of 2025. The common assumption is that they lack talent, or models, or frameworks. But look closer: they have plenty of prototypes. What they lack is the ability to trust what they've built.
This trust gap shows up in predictable patterns.
[traps title="What happens without evals"]
loop | The endless pilot loop | Projects that demo well but never graduate to production. Stakeholders ask for "just a bit more testing." Months pass. The prototype sits in staging.
fire | The firefighting trap | Teams react to production failures instead of preventing them. Each incident erodes confidence. Leadership gets gun-shy about new deployments.
hammer | The rebuild tax | Every new project reconstructs evaluation infrastructure from scratch. No shared learnings. No compound improvement. Constant reinvention.
trend | The accuracy plateau | 70% accuracy that never improves. Teams try different prompts, switch models, add more test cases. Nothing moves the needle because they're optimizing within a broken paradigm.
[/traps]
These aren't failures of effort: They're symptoms of a missing discipline. The industry built sophisticated tools for training models and deploying inference. It neglected the layer in between: the system that determines whether outputs are actually good.
[testimonial]
Winners don't differentiate on the models they call.
They differentiate on how they know it's working.
[/testimonial]
The teams shipping production AI at scale have figured out one thing – You can swap your model in a weekend. You can't swap your understanding of what good looks like.
This is the insight that led to Eval Engineering.
What Eval Engineering Actually Means
Most teams think about evaluation as a quality check. Something you do before shipping. A gate to pass through on the way to production.
Eval Engineering inverts this entirely.
Eval Engineering is the discipline of building production-grade evaluation systems that govern AI behavior at scale. It treats evaluation not as a checkpoint but as infrastructure. The layer where your domain expertise becomes executable. Where your definition of "good" becomes enforceable at every interaction, not just the ones you sample. Where trust stops being a hope and becomes something you manufacture systematically.
The shift in mental model looks like this:
[compare left="OLD MENTAL MODEL" right="EVAL ENGINEERING"]
Evals are debugging tools || Evals are governance systems
Run once before deployment || Run continuously in production
Generic accuracy metrics || Domain-specific precision
Sample 10% of traffic || Cover 100% of traffic
Static test sets || Adaptive evaluation
Cost center || Strategic asset
Measures behavior || Governs behavior
[/compare]
The left column describes how most teams operate today. The right column describes what production AI actually requires.
Traditional evaluation asks: "Did this work?" Eval Engineering asks a harder question: "How do we ensure this keeps working, at scale, in production, as conditions change, while catching failures before users ever see them?" The first question gets you a report. The second gets you a system.
This distinction matters because measurement alone changes nothing. Evals tell you that 30% of outputs were bad. They don't stop those outputs from reaching users. They don't improve over time.
Eval Engineering transforms measurement into action: if an output fails evaluation, it doesn't ship. If patterns of failure emerge, the system adapts. If domain expertise exists in someone's head, it gets encoded into infrastructure that runs at scale.
Eval Engineering is not testing. Testing asks whether something works. Eval Engineering ensures it keeps working.
Eval Engineering is not observability. Observability shows you what went wrong after it happened. Eval Engineering is proactive. It catches problems before they escape.
Eval Engineering is not metrics. A dashboard displaying 70% accuracy doesn't fix the 30% that's broken. Metrics inform. Eval Engineering acts. The goal isn't to know your accuracy. The goal is to improve it and enforce it at runtime.
Eval Engineering is not a project. Projects end. Eval Engineering is a capability you build and maintain indefinitely. The lifecycle is continuous because production data drifts, user behavior shifts, and model providers push updates without warning.
[alert:idea]
The goal isn't to measure how much your AI fails.
It's to build systems to prevent it from failing in the first place.
[/alert]
The Eval Engineering Lifecycle

Let's look at the five-stage lifecycle of eval engineering. Each stage builds on the previous one. Skip a stage and the system breaks down. Try to jump straight to production guardrails without SME refinement and you'll encode mediocre judgment at scale. The stages exist because they solve different problems in sequence.
[stage num="1" title="Stage 1: LLM-as-Judge" key="60-70% accuracy"]
Start by using a large language model to evaluate your AI's outputs. Write a prompt that defines what "good" means for your use case. Run it against a test dataset. Measure agreement with human judgment.
Researchers found this achieves only 60-70% accuracy. It's fast to set up, sometimes just a few hours of work. It's infinitely better than no evaluation at all. But it's not production-ready. At 70% accuracy, you're wrong 3 times out of 10. That's a starting point, not a destination. Most teams stop here and wonder why they can't ship.
[/stage]
[stage num="2" title="Stage 2: SME Refinement" key="90-95% accuracy"]
Bring in subject matter experts. Not just any humans: the people who actually know what "good" looks like in your specific domain. Your senior customer service rep who's handled 10,000 tickets. Your compliance officer who knows which phrasings trigger regulatory issues. Your clinical expert who can spot medically questionable advice.
SMEs review the failures your LLM judge missed. They identify patterns. They articulate criteria that seemed obvious to them but weren't captured in your judge prompt. They help you distinguish between "technically correct but unhelpful" and "actually good." This stage pushes accuracy to 90-95%. The gap between 70% and 95% is where domain expertise lives. It cannot be skipped. Generic LLM judges plateau precisely because they lack this expertise.
[/stage]
[stage num="3" title="Stage 3: SLM Fine-Tuning" key="Cost reduction + 100% coverage"]
Here's the economic problem with LLM judges: they're expensive. Running GPT or Claude on every production interaction costs serious money. A Fortune 50 company was spending more than $20 million annually on LLM-based evaluation. At those costs, teams make a rational but dangerous choice: sample 10% of traffic and hope it's representative.
Hope is not a strategy. The alternative is to fine-tune a small language model on your refined evaluation data. Take the judgments from Stage 2: the SME-validated labels, the refined criteria, the failure patterns. Use them to train a model that runs 100x cheaper and 10x faster. Suddenly 100% coverage is economically viable. You're not sampling anymore. You're seeing everything.
[/stage]
[stage num="4" title="Stage 4: Production Guardrails" key="Inline enforcement"]
This stage contains the insight that separates Eval Engineering from traditional evaluation: why measure bad behavior if you're going to let it through anyway?
Transform your evals into guardrails. Run them inline at inference time, before outputs reach users. When the eval score drops below threshold, block the response. Escalate to a human. Ask a clarifying question. Trigger a fallback behavior. Your evaluation system stops being a measurement tool and becomes a governance system. The same logic that told you "this output is bad" now prevents that output from causing harm.
This is the stage where evaluation becomes valuable to the business, not just the ML team. Executives understand "we catch bad outputs before users see them" in a way they don't understand F1 scores.
[/stage]
[stage num="5" title="Stage 5: Continuous Adaptation" key="Self-improving flywheel"]
Production data is constantly changing. User behavior evolves. Model providers push updates. Competitors launch features that change user expectations. Static evals decay within weeks. A test set that perfectly captured failure modes in January is missing half the failure modes by March.
Close the loop. Monitor your guardrail trigger rates. When they spike, investigate. Feed production failures back to SME review. Identify new patterns. Retrain your SLMs. Update your guardrail thresholds. Evaluation becomes a flywheel that improves itself over time, not a checkpoint you pass once and forget.
[/stage]
[alert:idea]
If you only remember one thing: Eval Engineering is a lifecycle, not a one-time activity.
[/alert]
Why Evals Matter Now?
Multiple forces are converging to make Eval Engineering an urgent priority.
[callout title="First, the economics of AI are shifting"]
The model layer is commoditizing fast. GPT, Claude, Gemini, Llama: they're converging in capability and competing on price. Switching costs are dropping. The model you use matters less every quarter. Differentiation is moving to the application layer: how you deploy, how you customize, how you ensure quality. The teams that win will be the ones who can reliably ship, not just prototype. Eval Engineering is the capability that enables reliable shipping.
[/callout]
[callout title="Second, trust has become the bottleneck"]
The enterprises falling behind in AI adoption are not lacking models or frameworks or talent. They're lacking trust. Their legal teams won't sign off. Their compliance teams have concerns. Their executives got burned by a demo that failed in production and are now gun-shy. Eval Engineering is about systematically manufacturing trust. It turns "we think it works" into "we can prove it works, and we can prove it keeps working."
[/callout]
[callout title="Third, the industry is formalizing how agents get built"]
Agent Development Lifecycle (ADLC) is called a rethinking of the traditional SDLC for probabilistic systems. The pattern across the ADLC framework from multiple companies is consistent: less time on upfront planning, significantly more time on tuning and optimization, and far greater emphasis on automated governance. At the heart of every ADLC framework sits evaluation. ADLC describes how to organize your team and structure your workflow. Eval Engineering is the craft that makes ADLC actually work.
These forces connect to one final shift worth understanding: the rise of context engineering.
Cognition AI says context engineering is effectively the number one job of engineers building agents. Andrej Karpathy said real engineering work is deciding what goes into that working memory.
They're right. But here's what the context engineering conversation often misses: you can assemble the perfect context and still produce garbage outputs. Context engineering determines what goes in. Eval engineering determines whether what comes out is any good. They're two halves of the same system. The engineers who master both will define the next generation of AI development.
[/callout]
When to Invest in Eval Engineering
Not every AI project needs the full lifecycle on day one. The investment should match the risk.
[quadrant x-left="Low Volume" x-right="High Volume" y-top="High Stakes" y-bottom="Low Stakes"]
Start now. One bad output in healthcare, finance, or legal can be devastating regardless of volume.
||Full lifecycle is urgent. You're accumulating risk with every unmonitored interaction.
||Basic LLM-as-judge is fine. Sample manually. Fix obvious issues. Don't over-engineer.
||Start the lifecycle. You'll need SLMs for cost efficiency even if failures aren't catastrophic.
[/quadrant]
Low stakes + low volume: Internal tools, experimental features, non-customer-facing applications. Basic evals are sufficient. Your time is better spent elsewhere.
High stakes + any volume: Customer-facing agents, regulated industries, decisions with legal or financial consequences. Start the Eval Engineering lifecycle as soon as possible. The cost of one bad outcome exceeds the cost of building proper evaluation infrastructure.
High volume + any stakes: At scale, even "minor" failures add up. If 1% of outputs are problematic and you have 100,000 daily interactions, that's 1,000 problems per day. You need cost-effective evaluation, which means SLMs. You need coverage, which means guardrails.
Common Pitfalls in Evals
[callout title="Pitfall 1: Starting with too many evals"]
The instinct is comprehensive coverage. Evaluate everything. Measure every dimension of quality. Teams end up with 15 different metrics, conflicting signals, and no clear path forward. Start with your top 3 failure modes. Get those to 95% accuracy. Then add more. Depth beats breadth in early stages.
[/callout]
[callout title="Pitfall 2: Skipping SME involvement"]
LLM judges are seductively convenient. Set up a prompt, run it at scale, get numbers. No meetings required. No SME calendars to coordinate. So teams skip human refinement and plateau at 70% accuracy. They blame the model, try different prompts, switch providers. Nothing works because the problem isn't the LLM. The problem is missing domain expertise. Generic judges produce generic accuracy. The 70% → 95% jump requires SMEs. There is no shortcut.
[/callout]
[callout title="Pitfall 3: Stopping at measurement"]
Teams build elaborate evaluation pipelines. Beautiful dashboards. Real-time accuracy tracking. Executive reports. Then they do nothing with the data. Bad outputs keep reaching users while the dashboard faithfully records each failure. Evals that don't become guardrails are expensive documentation of your problems. Measure to act, not to admire.
[/callout]
[callout title="Treating eval as a one-time setup"]
"We built our evals in Q1. We're done." This is how static test sets become fiction. Production data drifts every week. User behavior shifts every month. Model providers update without warning. An eval system built in January is partially obsolete by March and mostly obsolete by June. Eval Engineering is continuous or it's theater.
[/callout]
The Takeaway

In the future, evals will be more valuable than your code. OpenAI's recent guidance to business leaders makes this explicit: "If done well, evals become unique differentiators... robust evals create compounding advantages and institutional know-how as your systems improve".
Evaluations encode your definition of quality. They capture your domain expertise. They represent thousands of hours of SME judgment, distilled into executable criteria. Models are commodities you can swap. Frameworks are open source. But your evals? Those are yours. They embody how your organization thinks, what your customers expect, and what "good" means in your specific context. A company's catalog of evals will become core IP, the same way proprietary data and customer relationships are IP today. The teams that treat eval development as a strategic investment, not a technical overhead, will have assets their competitors can't replicate.
The industry has shown that companies shipping production AI at scale aren't using better models than you. They've built the infrastructure to manufacture trust. They've turned evaluation from a tax into an asset. Eval Engineering transforms AI evaluation from a debugging tool into a governance system. It's about moving from "hope it works" to "prove it works" to "ensure it keeps working."
[testimonial]
AI without evals is just expensive hope.
[/testimonial]
Frequently Asked Questions
[qa]
Q: How is this different from traditional ML evaluation?
A: Traditional ML evals test model accuracy on held-out datasets. You train, you test, you report metrics, you publish. Eval Engineering operates at the application layer, not the model layer. It evaluates system behavior in production with domain-specific criteria at 100% coverage continuously. The unit shifts from "model" to "deployed application." The timeline shifts from "before launch" to "forever."
Q: Do I need to do all five stages?
A: Start where you are. If you have no evals, start with LLM-as-Judge: even 70% accuracy beats 0%. If you're stuck at 70%, add SME refinement: that's where the accuracy jump happens. If costs are blocking scale, fine-tune SLMs: that's where the economics flip. The lifecycle is progressive. Each stage unlocks the next. But partial progress still beats no progress.
Q: How much SME time does this require?
A: Less than teams fear. 50-100 labeled examples can drive significant accuracy gains in Stage 2. The key is structured involvement: clear workflows, focused review sessions, documented criteria. Two hours per week from one SME beats eight hours of unfocused committee meetings. Quality of SME engagement matters more than quantity.
Q: When should I start investing in Eval Engineering?
A: If your AI is customer-facing, handles decisions with consequences, or operates at scale where failures compound: start now. You're already accumulating risk. If you're prototyping with genuinely low stakes and low volume: basic evals are fine. But be honest about what "low stakes" means. Most teams underestimate the stakes of their AI applications.
Q: Can I build this myself or do I need a platform?
A: You can build the components. LLM-as-judge is just API calls. SME workflows are spreadsheets. SLM fine-tuning is documented. Guardrails are inference-time checks. The question is whether you want your engineers building evaluation infrastructure or building your actual product. The build-vs-buy calculation usually favors buying infrastructure and building domain-specific customization. But we'll cover this in detail in a later chapter.
[/qa]
