Eval Engineering

Platform

Docs

Pricing

Resources

About

Book a Demo

Get Started for Free

Platform

Docs

Pricing

Resources

About

Book a Demo

Get Started for Free

Introducing

Eval Engineering

Support for the entire lifecycle of your evals, from prompt tuning to production guardrails

Enable low-cost production monitoring and real-time guardrailing for every AI system — without the GPT-sized bill.

Enroll In the Course

Introduction

What is Eval Engineering?

As we've worked with many of the world's most advanced AI teams to build, test and deploy agents, we've noticed an interesting trend:

Data science and software engineering are merging.

LLM-based features are built as software and tested like ML systems. There's a new workflow required to create and maintain the high-accuracy evals needed to test and govern these systems at scale. That workflow is eval engineering.

As we've worked with many of the world's most advanced AI teams to build, test and deploy agents, we've noticed an interesting trend:

Data science and software engineering are merging.

The Workflow

Eval engineering follows the entire lifecycle of an eval, from creating the initial LLM-as-a-judge prompt to deploying it as a production guardrail. Each phase has its unique challenges. The Galileo platform was designed to accelerate each phase of this lifecycle.

Step 1

Creating the LLM-as-a-Judge

The process starts when you identify an eval required to track a desired behavior from your agent or LLM-system. This eval can be code-based, an off-the-shelf LLM judge from an open source library, or something entirely bespoke for your use case.

Every LLM-as-a-judge has three components:

→

The LLM (e.g. GPT-4o)

→

The prompt used to perform the evaluation

→

A reference dataset to test accuracy against

The output of an LLM-as-a-judge is a metric score that measures performance for a given span according to your criteria (e.g. friendliness, conciseness, etc).

Step 2

Tuning the Eval

An eval is only as good as its tuning. Without this, industry research suggests LLM-based evals achieve approximately 66%-68% agreement with human subject matter experts at best. Tuning is achieved via the following process:

→

SME Annotations

Experts review real LLM outputs and score according to eval criteria, to establish a ground truth reference for tuning

→

Prompt Optimization

Prompt engineers make tweaks and test the prompt F1 scores to measure improvements against the SME reference dataset

→

Few-Shot Examples

Difficult edge cases are added to the prompt as few-shot examples to add more explicit instructions to the model

Galileo includes many capabilities to accelerate this tuning process, including our auto-tune feature, which allows engineers to quickly select outliers in their observed behavior and automatically re-tune the eval.

Step 3

Developing with Evals

Once you have an LLM-as-a-judge with >95% accuracy, you can begin scoring your development traffic. Here evals are an effective tool to:

→

Measure App Performance

Benchmark the performance of your app and measure specific improvements with A/B tests.

→

Run CI/CD Tests

Execute evals regularly to catch regressions during normal development and maintenance.

→

Measure Production Performance

LLM-as-a-judge evals can be used to score production traffic for daily rollups and casual monitoring.

Most LLM-judges are both slow and expensive. Meaning that apps with significant traction often face 2-4s delays for each eval, and can quickly experience costs up to $50k/day for millions of spans.

Step 4

Fine-tuning the SLM

To bring eval latency below 150ms and observe 100% of your production traffic, you'll need to fine-tune a small-language model to run the highly optimized eval prompt you developed in step 2. This is where true data science enters the agent development lifecycle.

Galileo's suite of fine-tuning capabilities and team of forward deployed engineers have helped dozens of teams fine-tune SLMs and turn bespoke evals into low-cost evals that run at extreme scale.

Learn About Luna-2

Step 5

Deploying the Guardrail

With an SLM-as-a-judge built on Galileo's Luna-2 family of models, teams can use evals as real-time scores that give them the intelligence they need to govern their systems. Imagine being able to make decisions based on trusted scores on every LLM output. This is the only way to achieve trust in AI systems at enterprise scale.

Deploying the Guardrail

Step 6

Recycle Learnings

Real-world environments will confound even the most meticulous engineers. Consumer behavior will change. Models will drift. And along the way, you'll discover unanticipated failure modes.

When you do, a system that's observing 100% of your traffic like Galileo helps you quickly capture these failure modes into reference datasets so you can repeat the process of refining your evals (or creating new ones). In this way, you guarantee the performance of your app over time.

Trusted by enterprises, loved by developers

Ready to start?

Take our free developer course and learn how to execute eval engineering in your projects today.

Enroll Today

Book a Demo

Flexible pricing

Start for free and upgrade when you're ready to customize your evaluations and scale your AI applications to production.

Pricing details

Learn more

See how companies like Twilio and Comcast are achieving reliable AI with Galieo - and explore the platform’s capabilities for yourself.

View our docs