Introducing
Eval Engineering
Support for the entire lifecycle of your evals, from prompt tuning to production guardrails
Enable low-cost production monitoring and real-time guardrailing for every AI system — without the GPT-sized bill.



Introduction
What is Eval Engineering?
As we've worked with many of the world's most advanced AI teams to build, test and deploy agents, we've noticed an interesting trend:
Data science and software engineering are merging.
LLM-based features are built as software and tested like ML systems. There's a new workflow required to create and maintain the high-accuracy evals needed to test and govern these systems at scale. That workflow is eval engineering.
As we've worked with many of the world's most advanced AI teams to build, test and deploy agents, we've noticed an interesting trend:
Data science and software engineering are merging.
LLM-based features are built as software and tested like ML systems. There's a new workflow required to create and maintain the high-accuracy evals needed to test and govern these systems at scale. That workflow is eval engineering.
As we've worked with many of the world's most advanced AI teams to build, test and deploy agents, we've noticed an interesting trend:
Data science and software engineering are merging.
LLM-based features are built as software and tested like ML systems. There's a new workflow required to create and maintain the high-accuracy evals needed to test and govern these systems at scale. That workflow is eval engineering.
As we've worked with many of the world's most advanced AI teams to build, test and deploy agents, we've noticed an interesting trend:
Data science and software engineering are merging.
LLM-based features are built as software and tested like ML systems. There's a new workflow required to create and maintain the high-accuracy evals needed to test and govern these systems at scale. That workflow is eval engineering.
The Workflow
Eval engineering follows the entire lifecycle of an eval, from creating the initial LLM-as-a-judge prompt to deploying it as a production guardrail. Each phase has its unique challenges. The Galileo platform was designed to accelerate each phase of this lifecycle.





Step 1
Creating the LLM-as-a-Judge
The process starts when you identify an eval required to track a desired behavior from your agent or LLM-system. This eval can be code-based, an off-the-shelf LLM judge from an open source library, or something entirely bespoke for your use case.
Every LLM-as-a-judge has three components:
The process starts when you identify an eval required to track a desired behavior from your agent or LLM-system. This eval can be code-based, an off-the-shelf LLM judge from an open source library, or something entirely bespoke for your use case.
Every LLM-as-a-judge has three components:
The process starts when you identify an eval required to track a desired behavior from your agent or LLM-system. This eval can be code-based, an off-the-shelf LLM judge from an open source library, or something entirely bespoke for your use case.
Every LLM-as-a-judge has three components:
The process starts when you identify an eval required to track a desired behavior from your agent or LLM-system. This eval can be code-based, an off-the-shelf LLM judge from an open source library, or something entirely bespoke for your use case.
Every LLM-as-a-judge has three components:
→
The LLM (e.g. GPT-4o)
→
The prompt used to perform the evaluation
→
A reference dataset to test accuracy against
The output of an LLM-as-a-judge is a metric score that measures performance for a given span according to your criteria (e.g. friendliness, conciseness, etc).
The output of an LLM-as-a-judge is a metric score that measures performance for a given span according to your criteria (e.g. friendliness, conciseness, etc).
The output of an LLM-as-a-judge is a metric score that measures performance for a given span according to your criteria (e.g. friendliness, conciseness, etc).
The output of an LLM-as-a-judge is a metric score that measures performance for a given span according to your criteria (e.g. friendliness, conciseness, etc).




Step 2
Tuning the Eval
An eval is only as good as its tuning. Without this, industry research suggests LLM-based evals achieve approximately 66%-68% agreement with human subject matter experts at best. Tuning is achieved via the following process:
An eval is only as good as its tuning. Without this, industry research suggests LLM-based evals achieve approximately 66%-68% agreement with human subject matter experts at best. Tuning is achieved via the following process:
An eval is only as good as its tuning. Without this, industry research suggests LLM-based evals achieve approximately 66%-68% agreement with human subject matter experts at best. Tuning is achieved via the following process:
An eval is only as good as its tuning. Without this, industry research suggests LLM-based evals achieve approximately 66%-68% agreement with human subject matter experts at best. Tuning is achieved via the following process:
→
SME Annotations
Experts review real LLM outputs and score according to eval criteria, to establish a ground truth reference for tuning
→
Prompt Optimization
Prompt engineers make tweaks and test the prompt F1 scores to measure improvements against the SME reference dataset
→
Few-Shot Examples
Difficult edge cases are added to the prompt as few-shot examples to add more explicit instructions to the model
Galileo includes many capabilities to accelerate this tuning process, including our auto-tune feature, which allows engineers to quickly select outliers in their observed behavior and automatically re-tune the eval.
Galileo includes many capabilities to accelerate this tuning process, including our auto-tune feature, which allows engineers to quickly select outliers in their observed behavior and automatically re-tune the eval.
Galileo includes many capabilities to accelerate this tuning process, including our auto-tune feature, which allows engineers to quickly select outliers in their observed behavior and automatically re-tune the eval.
Galileo includes many capabilities to accelerate this tuning process, including our auto-tune feature, which allows engineers to quickly select outliers in their observed behavior and automatically re-tune the eval.



Step 3



Developing with Evals
Once you have an LLM-as-a-judge with >95% accuracy, you can begin scoring your development traffic. Here evals are an effective tool to:
Once you have an LLM-as-a-judge with >95% accuracy, you can begin scoring your development traffic. Here evals are an effective tool to:
Once you have an LLM-as-a-judge with >95% accuracy, you can begin scoring your development traffic. Here evals are an effective tool to:
Once you have an LLM-as-a-judge with >95% accuracy, you can begin scoring your development traffic. Here evals are an effective tool to:
→
Measure App Performance
Benchmark the performance of your app and measure specific improvements with A/B tests.
→
Run CI/CD Tests
Execute evals regularly to catch regressions during normal development and maintenance.
→
Measure Production Performance
LLM-as-a-judge evals can be used to score production traffic for daily rollups and casual monitoring.
Most LLM-judges are both slow and expensive. Meaning that apps with significant traction often face 2-4s delays for each eval, and can quickly experience costs up to $50k/day for millions of spans.
Most LLM-judges are both slow and expensive. Meaning that apps with significant traction often face 2-4s delays for each eval, and can quickly experience costs up to $50k/day for millions of spans.
Most LLM-judges are both slow and expensive. Meaning that apps with significant traction often face 2-4s delays for each eval, and can quickly experience costs up to $50k/day for millions of spans.
Most LLM-judges are both slow and expensive. Meaning that apps with significant traction often face 2-4s delays for each eval, and can quickly experience costs up to $50k/day for millions of spans.
Step 4
Fine-tuning the SLM
To bring eval latency below 150ms and observe 100% of your production traffic, you'll need to fine-tune a small-language model to run the highly optimized eval prompt you developed in step 2. This is where true data science enters the agent development lifecycle.
To bring eval latency below 150ms and observe 100% of your production traffic, you'll need to fine-tune a small-language model to run the highly optimized eval prompt you developed in step 2. This is where true data science enters the agent development lifecycle.
To bring eval latency below 150ms and observe 100% of your production traffic, you'll need to fine-tune a small-language model to run the highly optimized eval prompt you developed in step 2. This is where true data science enters the agent development lifecycle.
To bring eval latency below 150ms and observe 100% of your production traffic, you'll need to fine-tune a small-language model to run the highly optimized eval prompt you developed in step 2. This is where true data science enters the agent development lifecycle.
Galileo's suite of fine-tuning capabilities and team of forward deployed engineers have helped dozens of teams fine-tune SLMs and turn bespoke evals into low-cost evals that run at extreme scale.
Galileo's suite of fine-tuning capabilities and team of forward deployed engineers have helped dozens of teams fine-tune SLMs and turn bespoke evals into low-cost evals that run at extreme scale.
Galileo's suite of fine-tuning capabilities and team of forward deployed engineers have helped dozens of teams fine-tune SLMs and turn bespoke evals into low-cost evals that run at extreme scale.
Galileo's suite of fine-tuning capabilities and team of forward deployed engineers have helped dozens of teams fine-tune SLMs and turn bespoke evals into low-cost evals that run at extreme scale.





Step 5
Deploying the Guardrail
Deploying the Guardrail
With an SLM-as-a-judge built on Galileo's Luna-2 family of models, teams can use evals as real-time scores that give them the intelligence they need to govern their systems. Imagine being able to make decisions based on trusted scores on every LLM output. This is the only way to achieve trust in AI systems at enterprise scale.
With an SLM-as-a-judge built on Galileo's Luna-2 family of models, teams can use evals as real-time scores that give them the intelligence they need to govern their systems. Imagine being able to make decisions based on trusted scores on every LLM output. This is the only way to achieve trust in AI systems at enterprise scale.
With an SLM-as-a-judge built on Galileo's Luna-2 family of models, teams can use evals as real-time scores that give them the intelligence they need to govern their systems. Imagine being able to make decisions based on trusted scores on every LLM output. This is the only way to achieve trust in AI systems at enterprise scale.


Deploying the Guardrail
With an SLM-as-a-judge built on Galileo's Luna-2 family of models, teams can use evals as real-time scores that give them the intelligence they need to govern their systems. Imagine being able to make decisions based on trusted scores on every LLM output. This is the only way to achieve trust in AI systems at enterprise scale.


Step 6




Recycle Learnings
Real-world environments will confound even the most meticulous engineers. Consumer behavior will change. Models will drift. And along the way, you'll discover unanticipated failure modes.
Real-world environments will confound even the most meticulous engineers. Consumer behavior will change. Models will drift. And along the way, you'll discover unanticipated failure modes.
Real-world environments will confound even the most meticulous engineers. Consumer behavior will change. Models will drift. And along the way, you'll discover unanticipated failure modes.
Real-world environments will confound even the most meticulous engineers. Consumer behavior will change. Models will drift. And along the way, you'll discover unanticipated failure modes.
When you do, a system that's observing 100% of your traffic like Galileo helps you quickly capture these failure modes into reference datasets so you can repeat the process of refining your evals (or creating new ones). In this way, you guarantee the performance of your app over time.
When you do, a system that's observing 100% of your traffic like Galileo helps you quickly capture these failure modes into reference datasets so you can repeat the process of refining your evals (or creating new ones). In this way, you guarantee the performance of your app over time.
When you do, a system that's observing 100% of your traffic like Galileo helps you quickly capture these failure modes into reference datasets so you can repeat the process of refining your evals (or creating new ones). In this way, you guarantee the performance of your app over time.
When you do, a system that's observing 100% of your traffic like Galileo helps you quickly capture these failure modes into reference datasets so you can repeat the process of refining your evals (or creating new ones). In this way, you guarantee the performance of your app over time.
Trusted by enterprises, loved by developers
Ready to start?
Take our free developer course and learn how to execute eval engineering in your projects today.
Flexible pricing
Start for free and upgrade when you're ready to customize your evaluations and scale your AI applications to production.
Start for free and upgrade when you're ready to customize your evaluations and scale your AI applications to production.
Start for free and upgrade when you're ready to customize your evaluations and scale your AI applications to production.
Learn more
See how companies like Twilio and Comcast are achieving reliable AI with Galieo - and explore the platform’s capabilities for yourself.
See how companies like Twilio and Comcast are achieving reliable AI with Galieo - and explore the platform’s capabilities for yourself.
See how companies like Twilio and Comcast are achieving reliable AI with Galieo - and explore the platform’s capabilities for yourself.