Evals You Can Trust Without the Bill: How We Built Luna Studio

Joyal Palackel

Technical Marketer

Paul Lacey

Head of Product Marketing

Observability and guardrails you can trust, without the bill.

The bill nobody budgets for

Offline, evals cost basically nothing. You run a few thousand traces through GPT-4.1 as a judge, you score them, you tweak the prompt, you do it again. Across a quarter, maybe a few thousand dollars. The eval line item is a rounding error next to the LLM bill of the workflow you're actually testing, and nobody from finance is going to ask about a rounding error.

Then you ship.

The math at production scale is a different conversation. A modest agent generates several thousand traces per hour, and each trace is running through five or six metrics. Toxicity. Prompt injection. Context adherence. Tool-call quality. Policy compliance. Every token in every trace is getting scanned three or four times over, and at a dollar per trace per metric, the spreadsheet starts to look unfriendly fast.

Do the arithmetic on your phone. Five metrics, five cents an eval call on GPT-4.1, a thousand traces an hour. That's $250 an hour. Running continuously, $180,000 a month. And that's before you add the metric to catch the failure mode that's actually keeping your VP up at night.

We've watched this play out at a couple of customers now. The conversation that follows is always the same. Someone from finance walks over, very politely, and asks what exactly this line item is supposed to be. Then they ask whether you're aware it's growing faster than the agent itself.

That's when you find out your eval bill can outgrow your agent bill. Not rival it. Outgrow it.

Two unpalatable choices

So now the eval line item is real. You have two relief valves, and both of them are ugly.

Switch to a cheaper judge model. Easy on paper. In practice, you've just torched your eval engineering cycle. The new judge has different biases, different failure modes, and different calibration. Every prompt you'd tuned to ~90% accuracy needs to be re-tuned against a different baseline. Different few-shot examples. Different disagreement patterns. The Dropbox engineering team walked through what this actually looks like when they swapped judges on Dropbox Dash, and the short version is: it's a multi-week recalibration project, not a flag you flip on a Friday.

Start sampling and accept the blind spots. This one is cheaper still, and it's the path most teams quietly take. It works fine for deterministic data feeds where the failure modes are uniformly distributed. It doesn't work for nondeterministic agents, where the bad thing you're trying to catch is rare by definition. If you're sampling at 5% and your worst failure case shows up in 0.1% of traffic, you won't see that failure until it has already happened many times over. Probably to real users. Possibly to your most important customer.

The sampling option is the one we want to flag the hardest, because on the surface it looks like a reasonable cost-coverage tradeoff, and it isn't. The agent that says the wrong thing to a customer, the one that confidently fabricates a number on an earnings call, the one that mishandles a refund flow at 2am: those are exactly the failures that live below your sampling threshold. They're invisible until they're not.

Anthropic's write-up on building effective agents gets at the same shape of problem from a different angle. Agent failures don't concentrate cleanly in a single answer the way completion-model failures do. They smear across multi-step traces, which means the failure surface gets bigger and stranger as the agent becomes more capable. The smarter the agent, the more places things can go wrong, and the more invisible those wrong things tend to be.

So the cheap options aren't actually cheap. They just push the bill somewhere harder to see.

SLM to the rescue

Which is why everyone with this problem eventually arrives at the same answer: small language models.

A small evaluator (3 to 8 billion parameters, running on a single GPU) does for roughly two cents what a frontier model does for a dollar. Same job. Read a trace, decide whether it passed or failed the metric. Same accuracy ceiling on the right kind of metric. The bill, however, is a different bill entirely.

The catch is on your specific data.

Out of the box, a small model evaluating your specific domain is going to underperform a frontier judge. Your specific policy. Your specific voice agent. Your specific definition of "good." Small models don't have the same generalization headroom as a frontier LLM, and the smaller the model, the more the gap shows up on domain-specific work. If you want the SLM economics, you have to fine-tune the SLM on your data. There's no free version of this.

Which puts you in fine-tuning territory. LoRA adapters, labeled training set, held-out test set, GPU pipeline, evaluation harness. None of it is new. The LoRA paper has been out since 2021, and most data science teams have done some version of this before. The complication is that most teams haven't fine-tuned an evaluator before. They've fine-tuned a generator, which is a related problem and a different one. Tuning a model to write a paragraph and tuning a model to say true or false about a paragraph look the same from the outside. They aren't the same job, and the recipe for one isn't quite the recipe for the other.

The actually interesting trick is downstream of the architecture choice anyway. Once you've decided to use an SLM, the question becomes how to stop it from generating.

A standard 8B language model has a vocabulary of around 128,000 tokens. When it answers, it samples across all of them. We train a Luna evaluator to put almost all of its probability mass on two tokens, true and false. Then, a post-processor pulls those two probabilities and normalizes them into a confidence score.

# Algorithm 1 from the Luna 2 paper: Binary-class confidence normalization.
probs = softmax(logits)
p_true  = probs[true_token_id]
p_false = probs[false_token_id]
confidence = p_true / (p_true + p_false)
Diagram: One forward pass. A small language model with a 128,000-token vocabulary collapses to two tokens — true and false — via softmax, and a post-processor normalizes the probabilities into a single confidence score at roughly 150ms on one GPU.

One forward pass. One token. One normalization. No decoding loop, no chain-of-thought, no multi-token generation tax. That's where the 98% cost reduction and the 150-millisecond latency come from, and it's worth saying explicitly: those numbers aren't a function of the model being small. They're a function of the inference path being short.

The math problem is solved. The harder problem is the data.

The fine-tuning gap

Here's where most teams stall.

Fine-tuning works when you have data, and the classic recipe wants tens of thousands of labeled examples spread evenly across the failure modes you care about. It assumes you have an annotation team. It assumes you have enough production traffic to mine. For foundational evals (tone, coherence, basic safety, the things that fire on common events), those assumptions hold, and you can usually scrape together what you need from any production log.

The trouble starts as soon as you go past foundational.

The evals that actually move the needle catch rare things. A specific policy violation that occurs in 0.3% of conversations. A tool-call adherence failure that only shows up when the user phrases a request a certain way. A hallucination pattern specific to your domain that nobody outside your company has even seen. By the nature of being valuable, those events are rare in production. Which means they're rare in your annotation queue. Which means by the time you're ready to fine-tune a metric to catch them, your training set is a few hundred labeled examples, and a few hundred is not enough.

So the question is: how do you stretch a few hundred labeled examples into the thousands of training examples a fine-tune actually needs, without losing the signal that made you care about those examples in the first place?

The answer is synthetic data generation. We'd already built most of this for a different product. Generating realistic, distribution-matched test data is what powers Galileo's agent testing, where customers don't have years of production traffic to hand us, and we have to manufacture the test cases. Same shape of problem, different target. For Luna Studio, we point that pipeline at training sets instead of test sets.

That's the engine underneath what comes next.

What Luna Studio does

Luna Studio is a turnkey workflow for training custom SLM judges inside your environment. You bring 300 to 500 labeled samples, and you walk out with a deployable, registered evaluator. Days, not weeks, and no vendor team sitting in your data once handover is done.

Four pieces under the hood.

The architecture

A Luna evaluator is a small base model (3 to 8B parameters), a LoRA adapter, and the post-processor we already walked through. Fine-tuned, it behaves like a classifier, not a generator. One forward pass, one token, one normalization. The full architecture is documented in our Luna 2 paper.

We chose to fine-tune into the same architecture as the preset Luna metrics on purpose. The practical effect is that a trained model from Luna Studio shows up in your Galileo console the same way a preset metric does. Same dropdown. Same eval surface. Same runtime guardrail layer. No second integration project, no second dashboard, no juggling two different ways of looking at the same kind of thing.

The synthetic data pipeline

You bring 300 to 500 labeled examples. Luna Studio uses 10% to 20% of them as seed data and holds the rest out for evaluation. If we trained on the full set, the eval signal would leak into the model, and you'd have no way to tell whether the metric was actually generalizing or just memorizing what you handed us. So our synthetic data pipeline never sees more than about fifty samples of the real test set. From those fifty seeds, it synthesizes thousands of training examples.

Diagram: Luna Studio synthetic data pipeline. A few hundred labeled examples are split into seeds, a synthetic data generator expands the seeds, and the LLM-as-judge prompt supervises the output, producing thousands of fine-tune-ready training examples.

The supervisor is the LLM-as-judge prompt that defined the metric in the first place. It scores the synthetic outputs, filters the ones that don't match the metric definition, and shapes the distribution so it actually resembles the production traffic the metric will see in the wild.

A couple of things about this pipeline that are worth saying out loud.

First, fine-tuning is engineering, and generating training data is research. The fine-tuning takes hours once the data exists. The pipeline that produces the data took months to build, and most of that time wasn't pushing gradients around. It was figuring out how to generate plausible negative examples for any metric a customer might describe, from boolean policy compliance to PCI redaction to tool-call adherence on a voice agent. Each metric has its own shape of "wrong" and the pipeline has to handle all of them. The pipeline itself is the part we don't open-source. The architecture underneath it is in the paper.

Second, the judge prompt is the ceiling. A Luna model fine-tuned on synthetic data inherits whatever errors are baked into the judge that supervised that data. Which is why Luna Studio sits downstream of Galileo's Auto-Tune in the eval engineering loop, not in place of it. You tune the prompt first, you get it where you want it, and then Luna distills it.

Where it runs

Inside your environment. Always. Vertex AI on Google Cloud, Azure ML, SageMaker on AWS, or your own Kubernetes cluster if you'd rather host the pipeline yourself. The base model gets pulled into your environment, the synthetic data is generated there, the LoRA adapter is trained there, and the trained adapter weights live in your object store. We don't see any of it.

This part of the design was harder than it should have been. We initially built Luna fine-tuning as a feature inside Galileo, which meant it ran on the same cluster as the rest of the platform, which meant our customers' platform teams had to procure new GPUs to run it. That conversation stalled, often. We spent longer than we'd like to admit before figuring out that fine-tuning GPUs and platform GPUs are funded out of different budgets at most enterprises, and the right move was to ship Luna Studio as a separate product that runs on the data science team's existing fine-tuning compute. Once we accepted that the model needed to go to the data and not the other way around, the deployment topology fell out naturally.

Two surfaces

A UI for AI engineers who just want to push a button. The hyperparameters that aren't safe to fiddle with are abstracted, the defaults are reasonable, the workflow is linear. Most users live here.

An SDK for data scientists who want every knob. Python package, YAML config, source code that's readable, a fine-tuning run that's reproducible from a file in git. The hyperparameters the UI hides (seed sampling rate, learning rate, LoRA rank, epoch count, the prompt template itself) all live in the config.

training:
  model:
    name: "mistralai/Mistral-7B-Instruct-v0.3"
  training:
    num_train_epochs: 1
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 8
    max_seq_length: 4096
  prompt_template: |
    You will be given a text. Determine if the text is toxic or not.
    Message: {input}
    Respond with "true" if toxic, "false" if not.
  output:
    model_name: "toxicity-luna-model"

Both surfaces produce the same artifact, and both register it into the same Galileo eval surface. The choice between them is mostly about which team is driving the workflow and how much they want to see under the hood.

The shorter version

Evals aren't the optional layer on top of an AI program. They are the program. Without them, you can't tell whether your agent is getting better, getting worse, or about to land you in a Bloomberg headline.

But evals at production scale have been a bill problem and a coverage problem at the same time, and neither side has been comfortable to live with. You either pay frontier-model rates on every call until finance stops you, or you sample and pray the failures you most need to catch happen to live in the slice you're looking at. Neither is sustainable once the agent is doing real work.

SLMs solve the bill side of that. Custom SLMs trained on your data solve the accuracy side. Luna Studio is how you actually get from "we should do this" to "we have a working evaluator in production" without building a six-month platform engineering project to get there.

What you get:

  • A way past the data problem. A few hundred labeled examples is enough.

  • A way past the scaling problem. SLM economics let you evaluate every trace, not just samples.

  • A way past the accuracy problem. The model is trained on your data, your metric, your definition of "good."

If you're running production agents and your eval bill is starting to look like your agent bill, talk to us. The Luna 2 architecture is documented in our research paper.

Observability and guardrails you can trust, without the bill.

The bill nobody budgets for

Offline, evals cost basically nothing. You run a few thousand traces through GPT-4.1 as a judge, you score them, you tweak the prompt, you do it again. Across a quarter, maybe a few thousand dollars. The eval line item is a rounding error next to the LLM bill of the workflow you're actually testing, and nobody from finance is going to ask about a rounding error.

Then you ship.

The math at production scale is a different conversation. A modest agent generates several thousand traces per hour, and each trace is running through five or six metrics. Toxicity. Prompt injection. Context adherence. Tool-call quality. Policy compliance. Every token in every trace is getting scanned three or four times over, and at a dollar per trace per metric, the spreadsheet starts to look unfriendly fast.

Do the arithmetic on your phone. Five metrics, five cents an eval call on GPT-4.1, a thousand traces an hour. That's $250 an hour. Running continuously, $180,000 a month. And that's before you add the metric to catch the failure mode that's actually keeping your VP up at night.

We've watched this play out at a couple of customers now. The conversation that follows is always the same. Someone from finance walks over, very politely, and asks what exactly this line item is supposed to be. Then they ask whether you're aware it's growing faster than the agent itself.

That's when you find out your eval bill can outgrow your agent bill. Not rival it. Outgrow it.

Two unpalatable choices

So now the eval line item is real. You have two relief valves, and both of them are ugly.

Switch to a cheaper judge model. Easy on paper. In practice, you've just torched your eval engineering cycle. The new judge has different biases, different failure modes, and different calibration. Every prompt you'd tuned to ~90% accuracy needs to be re-tuned against a different baseline. Different few-shot examples. Different disagreement patterns. The Dropbox engineering team walked through what this actually looks like when they swapped judges on Dropbox Dash, and the short version is: it's a multi-week recalibration project, not a flag you flip on a Friday.

Start sampling and accept the blind spots. This one is cheaper still, and it's the path most teams quietly take. It works fine for deterministic data feeds where the failure modes are uniformly distributed. It doesn't work for nondeterministic agents, where the bad thing you're trying to catch is rare by definition. If you're sampling at 5% and your worst failure case shows up in 0.1% of traffic, you won't see that failure until it has already happened many times over. Probably to real users. Possibly to your most important customer.

The sampling option is the one we want to flag the hardest, because on the surface it looks like a reasonable cost-coverage tradeoff, and it isn't. The agent that says the wrong thing to a customer, the one that confidently fabricates a number on an earnings call, the one that mishandles a refund flow at 2am: those are exactly the failures that live below your sampling threshold. They're invisible until they're not.

Anthropic's write-up on building effective agents gets at the same shape of problem from a different angle. Agent failures don't concentrate cleanly in a single answer the way completion-model failures do. They smear across multi-step traces, which means the failure surface gets bigger and stranger as the agent becomes more capable. The smarter the agent, the more places things can go wrong, and the more invisible those wrong things tend to be.

So the cheap options aren't actually cheap. They just push the bill somewhere harder to see.

SLM to the rescue

Which is why everyone with this problem eventually arrives at the same answer: small language models.

A small evaluator (3 to 8 billion parameters, running on a single GPU) does for roughly two cents what a frontier model does for a dollar. Same job. Read a trace, decide whether it passed or failed the metric. Same accuracy ceiling on the right kind of metric. The bill, however, is a different bill entirely.

The catch is on your specific data.

Out of the box, a small model evaluating your specific domain is going to underperform a frontier judge. Your specific policy. Your specific voice agent. Your specific definition of "good." Small models don't have the same generalization headroom as a frontier LLM, and the smaller the model, the more the gap shows up on domain-specific work. If you want the SLM economics, you have to fine-tune the SLM on your data. There's no free version of this.

Which puts you in fine-tuning territory. LoRA adapters, labeled training set, held-out test set, GPU pipeline, evaluation harness. None of it is new. The LoRA paper has been out since 2021, and most data science teams have done some version of this before. The complication is that most teams haven't fine-tuned an evaluator before. They've fine-tuned a generator, which is a related problem and a different one. Tuning a model to write a paragraph and tuning a model to say true or false about a paragraph look the same from the outside. They aren't the same job, and the recipe for one isn't quite the recipe for the other.

The actually interesting trick is downstream of the architecture choice anyway. Once you've decided to use an SLM, the question becomes how to stop it from generating.

A standard 8B language model has a vocabulary of around 128,000 tokens. When it answers, it samples across all of them. We train a Luna evaluator to put almost all of its probability mass on two tokens, true and false. Then, a post-processor pulls those two probabilities and normalizes them into a confidence score.

# Algorithm 1 from the Luna 2 paper: Binary-class confidence normalization.
probs = softmax(logits)
p_true  = probs[true_token_id]
p_false = probs[false_token_id]
confidence = p_true / (p_true + p_false)
Diagram: One forward pass. A small language model with a 128,000-token vocabulary collapses to two tokens — true and false — via softmax, and a post-processor normalizes the probabilities into a single confidence score at roughly 150ms on one GPU.

One forward pass. One token. One normalization. No decoding loop, no chain-of-thought, no multi-token generation tax. That's where the 98% cost reduction and the 150-millisecond latency come from, and it's worth saying explicitly: those numbers aren't a function of the model being small. They're a function of the inference path being short.

The math problem is solved. The harder problem is the data.

The fine-tuning gap

Here's where most teams stall.

Fine-tuning works when you have data, and the classic recipe wants tens of thousands of labeled examples spread evenly across the failure modes you care about. It assumes you have an annotation team. It assumes you have enough production traffic to mine. For foundational evals (tone, coherence, basic safety, the things that fire on common events), those assumptions hold, and you can usually scrape together what you need from any production log.

The trouble starts as soon as you go past foundational.

The evals that actually move the needle catch rare things. A specific policy violation that occurs in 0.3% of conversations. A tool-call adherence failure that only shows up when the user phrases a request a certain way. A hallucination pattern specific to your domain that nobody outside your company has even seen. By the nature of being valuable, those events are rare in production. Which means they're rare in your annotation queue. Which means by the time you're ready to fine-tune a metric to catch them, your training set is a few hundred labeled examples, and a few hundred is not enough.

So the question is: how do you stretch a few hundred labeled examples into the thousands of training examples a fine-tune actually needs, without losing the signal that made you care about those examples in the first place?

The answer is synthetic data generation. We'd already built most of this for a different product. Generating realistic, distribution-matched test data is what powers Galileo's agent testing, where customers don't have years of production traffic to hand us, and we have to manufacture the test cases. Same shape of problem, different target. For Luna Studio, we point that pipeline at training sets instead of test sets.

That's the engine underneath what comes next.

What Luna Studio does

Luna Studio is a turnkey workflow for training custom SLM judges inside your environment. You bring 300 to 500 labeled samples, and you walk out with a deployable, registered evaluator. Days, not weeks, and no vendor team sitting in your data once handover is done.

Four pieces under the hood.

The architecture

A Luna evaluator is a small base model (3 to 8B parameters), a LoRA adapter, and the post-processor we already walked through. Fine-tuned, it behaves like a classifier, not a generator. One forward pass, one token, one normalization. The full architecture is documented in our Luna 2 paper.

We chose to fine-tune into the same architecture as the preset Luna metrics on purpose. The practical effect is that a trained model from Luna Studio shows up in your Galileo console the same way a preset metric does. Same dropdown. Same eval surface. Same runtime guardrail layer. No second integration project, no second dashboard, no juggling two different ways of looking at the same kind of thing.

The synthetic data pipeline

You bring 300 to 500 labeled examples. Luna Studio uses 10% to 20% of them as seed data and holds the rest out for evaluation. If we trained on the full set, the eval signal would leak into the model, and you'd have no way to tell whether the metric was actually generalizing or just memorizing what you handed us. So our synthetic data pipeline never sees more than about fifty samples of the real test set. From those fifty seeds, it synthesizes thousands of training examples.

Diagram: Luna Studio synthetic data pipeline. A few hundred labeled examples are split into seeds, a synthetic data generator expands the seeds, and the LLM-as-judge prompt supervises the output, producing thousands of fine-tune-ready training examples.

The supervisor is the LLM-as-judge prompt that defined the metric in the first place. It scores the synthetic outputs, filters the ones that don't match the metric definition, and shapes the distribution so it actually resembles the production traffic the metric will see in the wild.

A couple of things about this pipeline that are worth saying out loud.

First, fine-tuning is engineering, and generating training data is research. The fine-tuning takes hours once the data exists. The pipeline that produces the data took months to build, and most of that time wasn't pushing gradients around. It was figuring out how to generate plausible negative examples for any metric a customer might describe, from boolean policy compliance to PCI redaction to tool-call adherence on a voice agent. Each metric has its own shape of "wrong" and the pipeline has to handle all of them. The pipeline itself is the part we don't open-source. The architecture underneath it is in the paper.

Second, the judge prompt is the ceiling. A Luna model fine-tuned on synthetic data inherits whatever errors are baked into the judge that supervised that data. Which is why Luna Studio sits downstream of Galileo's Auto-Tune in the eval engineering loop, not in place of it. You tune the prompt first, you get it where you want it, and then Luna distills it.

Where it runs

Inside your environment. Always. Vertex AI on Google Cloud, Azure ML, SageMaker on AWS, or your own Kubernetes cluster if you'd rather host the pipeline yourself. The base model gets pulled into your environment, the synthetic data is generated there, the LoRA adapter is trained there, and the trained adapter weights live in your object store. We don't see any of it.

This part of the design was harder than it should have been. We initially built Luna fine-tuning as a feature inside Galileo, which meant it ran on the same cluster as the rest of the platform, which meant our customers' platform teams had to procure new GPUs to run it. That conversation stalled, often. We spent longer than we'd like to admit before figuring out that fine-tuning GPUs and platform GPUs are funded out of different budgets at most enterprises, and the right move was to ship Luna Studio as a separate product that runs on the data science team's existing fine-tuning compute. Once we accepted that the model needed to go to the data and not the other way around, the deployment topology fell out naturally.

Two surfaces

A UI for AI engineers who just want to push a button. The hyperparameters that aren't safe to fiddle with are abstracted, the defaults are reasonable, the workflow is linear. Most users live here.

An SDK for data scientists who want every knob. Python package, YAML config, source code that's readable, a fine-tuning run that's reproducible from a file in git. The hyperparameters the UI hides (seed sampling rate, learning rate, LoRA rank, epoch count, the prompt template itself) all live in the config.

training:
  model:
    name: "mistralai/Mistral-7B-Instruct-v0.3"
  training:
    num_train_epochs: 1
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 8
    max_seq_length: 4096
  prompt_template: |
    You will be given a text. Determine if the text is toxic or not.
    Message: {input}
    Respond with "true" if toxic, "false" if not.
  output:
    model_name: "toxicity-luna-model"

Both surfaces produce the same artifact, and both register it into the same Galileo eval surface. The choice between them is mostly about which team is driving the workflow and how much they want to see under the hood.

The shorter version

Evals aren't the optional layer on top of an AI program. They are the program. Without them, you can't tell whether your agent is getting better, getting worse, or about to land you in a Bloomberg headline.

But evals at production scale have been a bill problem and a coverage problem at the same time, and neither side has been comfortable to live with. You either pay frontier-model rates on every call until finance stops you, or you sample and pray the failures you most need to catch happen to live in the slice you're looking at. Neither is sustainable once the agent is doing real work.

SLMs solve the bill side of that. Custom SLMs trained on your data solve the accuracy side. Luna Studio is how you actually get from "we should do this" to "we have a working evaluator in production" without building a six-month platform engineering project to get there.

What you get:

  • A way past the data problem. A few hundred labeled examples is enough.

  • A way past the scaling problem. SLM economics let you evaluate every trace, not just samples.

  • A way past the accuracy problem. The model is trained on your data, your metric, your definition of "good."

If you're running production agents and your eval bill is starting to look like your agent bill, talk to us. The Luna 2 architecture is documented in our research paper.

Observability and guardrails you can trust, without the bill.

The bill nobody budgets for

Offline, evals cost basically nothing. You run a few thousand traces through GPT-4.1 as a judge, you score them, you tweak the prompt, you do it again. Across a quarter, maybe a few thousand dollars. The eval line item is a rounding error next to the LLM bill of the workflow you're actually testing, and nobody from finance is going to ask about a rounding error.

Then you ship.

The math at production scale is a different conversation. A modest agent generates several thousand traces per hour, and each trace is running through five or six metrics. Toxicity. Prompt injection. Context adherence. Tool-call quality. Policy compliance. Every token in every trace is getting scanned three or four times over, and at a dollar per trace per metric, the spreadsheet starts to look unfriendly fast.

Do the arithmetic on your phone. Five metrics, five cents an eval call on GPT-4.1, a thousand traces an hour. That's $250 an hour. Running continuously, $180,000 a month. And that's before you add the metric to catch the failure mode that's actually keeping your VP up at night.

We've watched this play out at a couple of customers now. The conversation that follows is always the same. Someone from finance walks over, very politely, and asks what exactly this line item is supposed to be. Then they ask whether you're aware it's growing faster than the agent itself.

That's when you find out your eval bill can outgrow your agent bill. Not rival it. Outgrow it.

Two unpalatable choices

So now the eval line item is real. You have two relief valves, and both of them are ugly.

Switch to a cheaper judge model. Easy on paper. In practice, you've just torched your eval engineering cycle. The new judge has different biases, different failure modes, and different calibration. Every prompt you'd tuned to ~90% accuracy needs to be re-tuned against a different baseline. Different few-shot examples. Different disagreement patterns. The Dropbox engineering team walked through what this actually looks like when they swapped judges on Dropbox Dash, and the short version is: it's a multi-week recalibration project, not a flag you flip on a Friday.

Start sampling and accept the blind spots. This one is cheaper still, and it's the path most teams quietly take. It works fine for deterministic data feeds where the failure modes are uniformly distributed. It doesn't work for nondeterministic agents, where the bad thing you're trying to catch is rare by definition. If you're sampling at 5% and your worst failure case shows up in 0.1% of traffic, you won't see that failure until it has already happened many times over. Probably to real users. Possibly to your most important customer.

The sampling option is the one we want to flag the hardest, because on the surface it looks like a reasonable cost-coverage tradeoff, and it isn't. The agent that says the wrong thing to a customer, the one that confidently fabricates a number on an earnings call, the one that mishandles a refund flow at 2am: those are exactly the failures that live below your sampling threshold. They're invisible until they're not.

Anthropic's write-up on building effective agents gets at the same shape of problem from a different angle. Agent failures don't concentrate cleanly in a single answer the way completion-model failures do. They smear across multi-step traces, which means the failure surface gets bigger and stranger as the agent becomes more capable. The smarter the agent, the more places things can go wrong, and the more invisible those wrong things tend to be.

So the cheap options aren't actually cheap. They just push the bill somewhere harder to see.

SLM to the rescue

Which is why everyone with this problem eventually arrives at the same answer: small language models.

A small evaluator (3 to 8 billion parameters, running on a single GPU) does for roughly two cents what a frontier model does for a dollar. Same job. Read a trace, decide whether it passed or failed the metric. Same accuracy ceiling on the right kind of metric. The bill, however, is a different bill entirely.

The catch is on your specific data.

Out of the box, a small model evaluating your specific domain is going to underperform a frontier judge. Your specific policy. Your specific voice agent. Your specific definition of "good." Small models don't have the same generalization headroom as a frontier LLM, and the smaller the model, the more the gap shows up on domain-specific work. If you want the SLM economics, you have to fine-tune the SLM on your data. There's no free version of this.

Which puts you in fine-tuning territory. LoRA adapters, labeled training set, held-out test set, GPU pipeline, evaluation harness. None of it is new. The LoRA paper has been out since 2021, and most data science teams have done some version of this before. The complication is that most teams haven't fine-tuned an evaluator before. They've fine-tuned a generator, which is a related problem and a different one. Tuning a model to write a paragraph and tuning a model to say true or false about a paragraph look the same from the outside. They aren't the same job, and the recipe for one isn't quite the recipe for the other.

The actually interesting trick is downstream of the architecture choice anyway. Once you've decided to use an SLM, the question becomes how to stop it from generating.

A standard 8B language model has a vocabulary of around 128,000 tokens. When it answers, it samples across all of them. We train a Luna evaluator to put almost all of its probability mass on two tokens, true and false. Then, a post-processor pulls those two probabilities and normalizes them into a confidence score.

# Algorithm 1 from the Luna 2 paper: Binary-class confidence normalization.
probs = softmax(logits)
p_true  = probs[true_token_id]
p_false = probs[false_token_id]
confidence = p_true / (p_true + p_false)
Diagram: One forward pass. A small language model with a 128,000-token vocabulary collapses to two tokens — true and false — via softmax, and a post-processor normalizes the probabilities into a single confidence score at roughly 150ms on one GPU.

One forward pass. One token. One normalization. No decoding loop, no chain-of-thought, no multi-token generation tax. That's where the 98% cost reduction and the 150-millisecond latency come from, and it's worth saying explicitly: those numbers aren't a function of the model being small. They're a function of the inference path being short.

The math problem is solved. The harder problem is the data.

The fine-tuning gap

Here's where most teams stall.

Fine-tuning works when you have data, and the classic recipe wants tens of thousands of labeled examples spread evenly across the failure modes you care about. It assumes you have an annotation team. It assumes you have enough production traffic to mine. For foundational evals (tone, coherence, basic safety, the things that fire on common events), those assumptions hold, and you can usually scrape together what you need from any production log.

The trouble starts as soon as you go past foundational.

The evals that actually move the needle catch rare things. A specific policy violation that occurs in 0.3% of conversations. A tool-call adherence failure that only shows up when the user phrases a request a certain way. A hallucination pattern specific to your domain that nobody outside your company has even seen. By the nature of being valuable, those events are rare in production. Which means they're rare in your annotation queue. Which means by the time you're ready to fine-tune a metric to catch them, your training set is a few hundred labeled examples, and a few hundred is not enough.

So the question is: how do you stretch a few hundred labeled examples into the thousands of training examples a fine-tune actually needs, without losing the signal that made you care about those examples in the first place?

The answer is synthetic data generation. We'd already built most of this for a different product. Generating realistic, distribution-matched test data is what powers Galileo's agent testing, where customers don't have years of production traffic to hand us, and we have to manufacture the test cases. Same shape of problem, different target. For Luna Studio, we point that pipeline at training sets instead of test sets.

That's the engine underneath what comes next.

What Luna Studio does

Luna Studio is a turnkey workflow for training custom SLM judges inside your environment. You bring 300 to 500 labeled samples, and you walk out with a deployable, registered evaluator. Days, not weeks, and no vendor team sitting in your data once handover is done.

Four pieces under the hood.

The architecture

A Luna evaluator is a small base model (3 to 8B parameters), a LoRA adapter, and the post-processor we already walked through. Fine-tuned, it behaves like a classifier, not a generator. One forward pass, one token, one normalization. The full architecture is documented in our Luna 2 paper.

We chose to fine-tune into the same architecture as the preset Luna metrics on purpose. The practical effect is that a trained model from Luna Studio shows up in your Galileo console the same way a preset metric does. Same dropdown. Same eval surface. Same runtime guardrail layer. No second integration project, no second dashboard, no juggling two different ways of looking at the same kind of thing.

The synthetic data pipeline

You bring 300 to 500 labeled examples. Luna Studio uses 10% to 20% of them as seed data and holds the rest out for evaluation. If we trained on the full set, the eval signal would leak into the model, and you'd have no way to tell whether the metric was actually generalizing or just memorizing what you handed us. So our synthetic data pipeline never sees more than about fifty samples of the real test set. From those fifty seeds, it synthesizes thousands of training examples.

Diagram: Luna Studio synthetic data pipeline. A few hundred labeled examples are split into seeds, a synthetic data generator expands the seeds, and the LLM-as-judge prompt supervises the output, producing thousands of fine-tune-ready training examples.

The supervisor is the LLM-as-judge prompt that defined the metric in the first place. It scores the synthetic outputs, filters the ones that don't match the metric definition, and shapes the distribution so it actually resembles the production traffic the metric will see in the wild.

A couple of things about this pipeline that are worth saying out loud.

First, fine-tuning is engineering, and generating training data is research. The fine-tuning takes hours once the data exists. The pipeline that produces the data took months to build, and most of that time wasn't pushing gradients around. It was figuring out how to generate plausible negative examples for any metric a customer might describe, from boolean policy compliance to PCI redaction to tool-call adherence on a voice agent. Each metric has its own shape of "wrong" and the pipeline has to handle all of them. The pipeline itself is the part we don't open-source. The architecture underneath it is in the paper.

Second, the judge prompt is the ceiling. A Luna model fine-tuned on synthetic data inherits whatever errors are baked into the judge that supervised that data. Which is why Luna Studio sits downstream of Galileo's Auto-Tune in the eval engineering loop, not in place of it. You tune the prompt first, you get it where you want it, and then Luna distills it.

Where it runs

Inside your environment. Always. Vertex AI on Google Cloud, Azure ML, SageMaker on AWS, or your own Kubernetes cluster if you'd rather host the pipeline yourself. The base model gets pulled into your environment, the synthetic data is generated there, the LoRA adapter is trained there, and the trained adapter weights live in your object store. We don't see any of it.

This part of the design was harder than it should have been. We initially built Luna fine-tuning as a feature inside Galileo, which meant it ran on the same cluster as the rest of the platform, which meant our customers' platform teams had to procure new GPUs to run it. That conversation stalled, often. We spent longer than we'd like to admit before figuring out that fine-tuning GPUs and platform GPUs are funded out of different budgets at most enterprises, and the right move was to ship Luna Studio as a separate product that runs on the data science team's existing fine-tuning compute. Once we accepted that the model needed to go to the data and not the other way around, the deployment topology fell out naturally.

Two surfaces

A UI for AI engineers who just want to push a button. The hyperparameters that aren't safe to fiddle with are abstracted, the defaults are reasonable, the workflow is linear. Most users live here.

An SDK for data scientists who want every knob. Python package, YAML config, source code that's readable, a fine-tuning run that's reproducible from a file in git. The hyperparameters the UI hides (seed sampling rate, learning rate, LoRA rank, epoch count, the prompt template itself) all live in the config.

training:
  model:
    name: "mistralai/Mistral-7B-Instruct-v0.3"
  training:
    num_train_epochs: 1
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 8
    max_seq_length: 4096
  prompt_template: |
    You will be given a text. Determine if the text is toxic or not.
    Message: {input}
    Respond with "true" if toxic, "false" if not.
  output:
    model_name: "toxicity-luna-model"

Both surfaces produce the same artifact, and both register it into the same Galileo eval surface. The choice between them is mostly about which team is driving the workflow and how much they want to see under the hood.

The shorter version

Evals aren't the optional layer on top of an AI program. They are the program. Without them, you can't tell whether your agent is getting better, getting worse, or about to land you in a Bloomberg headline.

But evals at production scale have been a bill problem and a coverage problem at the same time, and neither side has been comfortable to live with. You either pay frontier-model rates on every call until finance stops you, or you sample and pray the failures you most need to catch happen to live in the slice you're looking at. Neither is sustainable once the agent is doing real work.

SLMs solve the bill side of that. Custom SLMs trained on your data solve the accuracy side. Luna Studio is how you actually get from "we should do this" to "we have a working evaluator in production" without building a six-month platform engineering project to get there.

What you get:

  • A way past the data problem. A few hundred labeled examples is enough.

  • A way past the scaling problem. SLM economics let you evaluate every trace, not just samples.

  • A way past the accuracy problem. The model is trained on your data, your metric, your definition of "good."

If you're running production agents and your eval bill is starting to look like your agent bill, talk to us. The Luna 2 architecture is documented in our research paper.

Observability and guardrails you can trust, without the bill.

The bill nobody budgets for

Offline, evals cost basically nothing. You run a few thousand traces through GPT-4.1 as a judge, you score them, you tweak the prompt, you do it again. Across a quarter, maybe a few thousand dollars. The eval line item is a rounding error next to the LLM bill of the workflow you're actually testing, and nobody from finance is going to ask about a rounding error.

Then you ship.

The math at production scale is a different conversation. A modest agent generates several thousand traces per hour, and each trace is running through five or six metrics. Toxicity. Prompt injection. Context adherence. Tool-call quality. Policy compliance. Every token in every trace is getting scanned three or four times over, and at a dollar per trace per metric, the spreadsheet starts to look unfriendly fast.

Do the arithmetic on your phone. Five metrics, five cents an eval call on GPT-4.1, a thousand traces an hour. That's $250 an hour. Running continuously, $180,000 a month. And that's before you add the metric to catch the failure mode that's actually keeping your VP up at night.

We've watched this play out at a couple of customers now. The conversation that follows is always the same. Someone from finance walks over, very politely, and asks what exactly this line item is supposed to be. Then they ask whether you're aware it's growing faster than the agent itself.

That's when you find out your eval bill can outgrow your agent bill. Not rival it. Outgrow it.

Two unpalatable choices

So now the eval line item is real. You have two relief valves, and both of them are ugly.

Switch to a cheaper judge model. Easy on paper. In practice, you've just torched your eval engineering cycle. The new judge has different biases, different failure modes, and different calibration. Every prompt you'd tuned to ~90% accuracy needs to be re-tuned against a different baseline. Different few-shot examples. Different disagreement patterns. The Dropbox engineering team walked through what this actually looks like when they swapped judges on Dropbox Dash, and the short version is: it's a multi-week recalibration project, not a flag you flip on a Friday.

Start sampling and accept the blind spots. This one is cheaper still, and it's the path most teams quietly take. It works fine for deterministic data feeds where the failure modes are uniformly distributed. It doesn't work for nondeterministic agents, where the bad thing you're trying to catch is rare by definition. If you're sampling at 5% and your worst failure case shows up in 0.1% of traffic, you won't see that failure until it has already happened many times over. Probably to real users. Possibly to your most important customer.

The sampling option is the one we want to flag the hardest, because on the surface it looks like a reasonable cost-coverage tradeoff, and it isn't. The agent that says the wrong thing to a customer, the one that confidently fabricates a number on an earnings call, the one that mishandles a refund flow at 2am: those are exactly the failures that live below your sampling threshold. They're invisible until they're not.

Anthropic's write-up on building effective agents gets at the same shape of problem from a different angle. Agent failures don't concentrate cleanly in a single answer the way completion-model failures do. They smear across multi-step traces, which means the failure surface gets bigger and stranger as the agent becomes more capable. The smarter the agent, the more places things can go wrong, and the more invisible those wrong things tend to be.

So the cheap options aren't actually cheap. They just push the bill somewhere harder to see.

SLM to the rescue

Which is why everyone with this problem eventually arrives at the same answer: small language models.

A small evaluator (3 to 8 billion parameters, running on a single GPU) does for roughly two cents what a frontier model does for a dollar. Same job. Read a trace, decide whether it passed or failed the metric. Same accuracy ceiling on the right kind of metric. The bill, however, is a different bill entirely.

The catch is on your specific data.

Out of the box, a small model evaluating your specific domain is going to underperform a frontier judge. Your specific policy. Your specific voice agent. Your specific definition of "good." Small models don't have the same generalization headroom as a frontier LLM, and the smaller the model, the more the gap shows up on domain-specific work. If you want the SLM economics, you have to fine-tune the SLM on your data. There's no free version of this.

Which puts you in fine-tuning territory. LoRA adapters, labeled training set, held-out test set, GPU pipeline, evaluation harness. None of it is new. The LoRA paper has been out since 2021, and most data science teams have done some version of this before. The complication is that most teams haven't fine-tuned an evaluator before. They've fine-tuned a generator, which is a related problem and a different one. Tuning a model to write a paragraph and tuning a model to say true or false about a paragraph look the same from the outside. They aren't the same job, and the recipe for one isn't quite the recipe for the other.

The actually interesting trick is downstream of the architecture choice anyway. Once you've decided to use an SLM, the question becomes how to stop it from generating.

A standard 8B language model has a vocabulary of around 128,000 tokens. When it answers, it samples across all of them. We train a Luna evaluator to put almost all of its probability mass on two tokens, true and false. Then, a post-processor pulls those two probabilities and normalizes them into a confidence score.

# Algorithm 1 from the Luna 2 paper: Binary-class confidence normalization.
probs = softmax(logits)
p_true  = probs[true_token_id]
p_false = probs[false_token_id]
confidence = p_true / (p_true + p_false)
Diagram: One forward pass. A small language model with a 128,000-token vocabulary collapses to two tokens — true and false — via softmax, and a post-processor normalizes the probabilities into a single confidence score at roughly 150ms on one GPU.

One forward pass. One token. One normalization. No decoding loop, no chain-of-thought, no multi-token generation tax. That's where the 98% cost reduction and the 150-millisecond latency come from, and it's worth saying explicitly: those numbers aren't a function of the model being small. They're a function of the inference path being short.

The math problem is solved. The harder problem is the data.

The fine-tuning gap

Here's where most teams stall.

Fine-tuning works when you have data, and the classic recipe wants tens of thousands of labeled examples spread evenly across the failure modes you care about. It assumes you have an annotation team. It assumes you have enough production traffic to mine. For foundational evals (tone, coherence, basic safety, the things that fire on common events), those assumptions hold, and you can usually scrape together what you need from any production log.

The trouble starts as soon as you go past foundational.

The evals that actually move the needle catch rare things. A specific policy violation that occurs in 0.3% of conversations. A tool-call adherence failure that only shows up when the user phrases a request a certain way. A hallucination pattern specific to your domain that nobody outside your company has even seen. By the nature of being valuable, those events are rare in production. Which means they're rare in your annotation queue. Which means by the time you're ready to fine-tune a metric to catch them, your training set is a few hundred labeled examples, and a few hundred is not enough.

So the question is: how do you stretch a few hundred labeled examples into the thousands of training examples a fine-tune actually needs, without losing the signal that made you care about those examples in the first place?

The answer is synthetic data generation. We'd already built most of this for a different product. Generating realistic, distribution-matched test data is what powers Galileo's agent testing, where customers don't have years of production traffic to hand us, and we have to manufacture the test cases. Same shape of problem, different target. For Luna Studio, we point that pipeline at training sets instead of test sets.

That's the engine underneath what comes next.

What Luna Studio does

Luna Studio is a turnkey workflow for training custom SLM judges inside your environment. You bring 300 to 500 labeled samples, and you walk out with a deployable, registered evaluator. Days, not weeks, and no vendor team sitting in your data once handover is done.

Four pieces under the hood.

The architecture

A Luna evaluator is a small base model (3 to 8B parameters), a LoRA adapter, and the post-processor we already walked through. Fine-tuned, it behaves like a classifier, not a generator. One forward pass, one token, one normalization. The full architecture is documented in our Luna 2 paper.

We chose to fine-tune into the same architecture as the preset Luna metrics on purpose. The practical effect is that a trained model from Luna Studio shows up in your Galileo console the same way a preset metric does. Same dropdown. Same eval surface. Same runtime guardrail layer. No second integration project, no second dashboard, no juggling two different ways of looking at the same kind of thing.

The synthetic data pipeline

You bring 300 to 500 labeled examples. Luna Studio uses 10% to 20% of them as seed data and holds the rest out for evaluation. If we trained on the full set, the eval signal would leak into the model, and you'd have no way to tell whether the metric was actually generalizing or just memorizing what you handed us. So our synthetic data pipeline never sees more than about fifty samples of the real test set. From those fifty seeds, it synthesizes thousands of training examples.

Diagram: Luna Studio synthetic data pipeline. A few hundred labeled examples are split into seeds, a synthetic data generator expands the seeds, and the LLM-as-judge prompt supervises the output, producing thousands of fine-tune-ready training examples.

The supervisor is the LLM-as-judge prompt that defined the metric in the first place. It scores the synthetic outputs, filters the ones that don't match the metric definition, and shapes the distribution so it actually resembles the production traffic the metric will see in the wild.

A couple of things about this pipeline that are worth saying out loud.

First, fine-tuning is engineering, and generating training data is research. The fine-tuning takes hours once the data exists. The pipeline that produces the data took months to build, and most of that time wasn't pushing gradients around. It was figuring out how to generate plausible negative examples for any metric a customer might describe, from boolean policy compliance to PCI redaction to tool-call adherence on a voice agent. Each metric has its own shape of "wrong" and the pipeline has to handle all of them. The pipeline itself is the part we don't open-source. The architecture underneath it is in the paper.

Second, the judge prompt is the ceiling. A Luna model fine-tuned on synthetic data inherits whatever errors are baked into the judge that supervised that data. Which is why Luna Studio sits downstream of Galileo's Auto-Tune in the eval engineering loop, not in place of it. You tune the prompt first, you get it where you want it, and then Luna distills it.

Where it runs

Inside your environment. Always. Vertex AI on Google Cloud, Azure ML, SageMaker on AWS, or your own Kubernetes cluster if you'd rather host the pipeline yourself. The base model gets pulled into your environment, the synthetic data is generated there, the LoRA adapter is trained there, and the trained adapter weights live in your object store. We don't see any of it.

This part of the design was harder than it should have been. We initially built Luna fine-tuning as a feature inside Galileo, which meant it ran on the same cluster as the rest of the platform, which meant our customers' platform teams had to procure new GPUs to run it. That conversation stalled, often. We spent longer than we'd like to admit before figuring out that fine-tuning GPUs and platform GPUs are funded out of different budgets at most enterprises, and the right move was to ship Luna Studio as a separate product that runs on the data science team's existing fine-tuning compute. Once we accepted that the model needed to go to the data and not the other way around, the deployment topology fell out naturally.

Two surfaces

A UI for AI engineers who just want to push a button. The hyperparameters that aren't safe to fiddle with are abstracted, the defaults are reasonable, the workflow is linear. Most users live here.

An SDK for data scientists who want every knob. Python package, YAML config, source code that's readable, a fine-tuning run that's reproducible from a file in git. The hyperparameters the UI hides (seed sampling rate, learning rate, LoRA rank, epoch count, the prompt template itself) all live in the config.

training:
  model:
    name: "mistralai/Mistral-7B-Instruct-v0.3"
  training:
    num_train_epochs: 1
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 8
    max_seq_length: 4096
  prompt_template: |
    You will be given a text. Determine if the text is toxic or not.
    Message: {input}
    Respond with "true" if toxic, "false" if not.
  output:
    model_name: "toxicity-luna-model"

Both surfaces produce the same artifact, and both register it into the same Galileo eval surface. The choice between them is mostly about which team is driving the workflow and how much they want to see under the hood.

The shorter version

Evals aren't the optional layer on top of an AI program. They are the program. Without them, you can't tell whether your agent is getting better, getting worse, or about to land you in a Bloomberg headline.

But evals at production scale have been a bill problem and a coverage problem at the same time, and neither side has been comfortable to live with. You either pay frontier-model rates on every call until finance stops you, or you sample and pray the failures you most need to catch happen to live in the slice you're looking at. Neither is sustainable once the agent is doing real work.

SLMs solve the bill side of that. Custom SLMs trained on your data solve the accuracy side. Luna Studio is how you actually get from "we should do this" to "we have a working evaluator in production" without building a six-month platform engineering project to get there.

What you get:

  • A way past the data problem. A few hundred labeled examples is enough.

  • A way past the scaling problem. SLM economics let you evaluate every trace, not just samples.

  • A way past the accuracy problem. The model is trained on your data, your metric, your definition of "good."

If you're running production agents and your eval bill is starting to look like your agent bill, talk to us. The Luna 2 architecture is documented in our research paper.

Joyal Palackel