Platform

Resources

About

Platform

Resources

About

Scaling Evaluation Infrastructure

Pratik Bhavsar

Evals & Leaderboards @ Galileo Labs

[intro]LLM judges work in development. They break at production scale — on cost, on latency, on coverage. Small language models are how you fix that.[/intro]

In late 2024, Meta shipped a safety classifier that runs on a Motorola Razr. Not a safety classifier that calls an API. Not one that routes traffic to a GPU cluster. A model that lives on the phone itself, in 440 megabytes of storage, classifying content as safe or unsafe at 30 tokens per second.

That model, Llama Guard 3-1B-INT4, outperforms GPT-4 on the MLCommons safety taxonomy. Read that again. A model compressed to fit on a budget smartphone beats a trillion-parameter frontier model at safety classification. Not by being smarter in general. By being smarter at one thing.

This is the SLM insight, and it rewrites the economics of everything covered in previous chapters.

SLMs close that gap by specializing. When a team runs GPT-4 as a toxicity judge, it's paying for the model's knowledge of Shakespeare, organic chemistry, and Baroque architecture. It's paying for a Swiss Army knife to do the work of a scalpel. An SLM trained specifically for toxicity detection dedicates its entire capacity to that single judgment. NVIDIA Research shows the cost reduction is 10-30 times. The latency improvement is even more dramatic: 15 to 150 milliseconds instead of 1 to 3 seconds, which is the difference between "not feasible for real-time guardrails" and "runs before the user sees the response."

The fintech compliance team from earlier chapters illustrates the shift. Their 94% accurate LLM judge costs $30,000 per day at full coverage. An SLM judge trained on the same SME-refined criteria costs $600 per day and runs faster. More importantly, it can evaluate every conversation, not 500 out of a million. 

This chapter is about making that transition: taking evaluation systems that work in development and making them work at a production scale.

When LLM Judges Stop Working

Three forces conspire against LLM judges at the production scale.

[callout title="The cost ceiling"]
A single evaluation with chain-of-thought reasoning consumes 1,000-2,000 tokens. At $0.01-0.05 per evaluation, a customer service operation handling 1 million conversations monthly faces $10,000-50,000 in evaluation costs—assuming one metric. Most applications need multiple metrics: safety, accuracy, tone, and compliance. The costs compound fast.

The natural response is to evaluate 1% instead of 100%. But sampling creates blind spots. If failures cluster in specific user segments or query types, random sampling misses entire categories of problems.
[/callout]

[callout title="The latency barrier"]
LLM inference takes 1-3 seconds per evaluation. For batch processing, this is fine. For real-time guardrails that must evaluate a response before showing it to users, it's not. Users won't wait 3 seconds for every chatbot response.
[/callout]

[callout title="The prompting ceiling"]
Chapter 3 showed how SME refinement pushes accuracy from 70% to 94%. But the last 5-6% often resist further prompting. You add more examples, more criteria, and more edge case handling. The prompt grows. Latency increases. And accuracy plateaus as the model's attention fragments across too many instructions.

Prompting asks a trillion-parameter general-purpose model to simulate a specialist. At some point, you need an actual specialist.
[/callout]

[callout title="The subsampling trap"]
Teams facing this wall resort to sampling and extrapolation. "Evaluate 1% and multiply by 100." The math breaks for three reasons.

First, rare failure modes don't survive sampling. If a specific failure occurs in 0.1% of traffic, a 1% sample contains only ~10 examples—not enough to detect reliably, let alone characterize.

Second, failures aren't randomly distributed. They correlate with user segments, times, query types, and conversation length. Random sampling breaks these correlations. You might sample heavily from simple queries while missing complex ones where failures concentrate.

Third, the worst failures are the rarest. Compliance violations, safety incidents, and catastrophic errors are tail events by definition. Random sampling systematically undersamples tails.

For mission-critical applications, you need 100% coverage. The only way to get there economically is to change the underlying cost structure.
[/callout]

The SLM Unlock

SLMs solve the scaling problem through specialization. Instead of prompting an LLM to act like a compliance detector, you train a 3-8B parameter model to actually be one.

An LLM judge running a toxicity check uses perhaps 0.1% of GPT-4's capabilities. You're paying for a Swiss Army knife when you need a scalpel. The trade-off is intentional: SLMs sacrifice generalization for speed and accuracy on narrow tasks.

NVIDIA Research shows serving a 7B SLM costs 10-30× less in latency, energy, and compute than 70-175B LLMs. The fintech team's $30,000/day evaluation budget drops to $600/day. Suddenly, 100% coverage is cheaper than 1% sampling with LLMs.

Metric

LLM Judge

SLM Judge

Cost per 1M evals

$10,000-50,000

$200-1,000

Latency

1-3 seconds

15-150ms

Real-time guardrails

Not feasible

Feasible

100% coverage

Economically prohibitive

Standard practice

Fine-tuning often improves accuracy, too. The model dedicates its entire capacity to your task rather than encoding knowledge about poetry, history, and millions of irrelevant topics. Teams can get 5-10% accuracy gains over prompted LLM judges after fine-tuning on domain data.

Where SLMs excel: Binary and multi-class classification with clear criteria (Is this toxic? Does this contain PII?), high-volume, consistent evaluation, real-time guardrails requiring sub-100ms latency, privacy-sensitive deployments where data cannot leave your infrastructure.

Where SLMs struggle: Subjective quality assessment where "good" depends on context, novel failure modes they weren't trained to catch, rapidly evolving criteria that would require constant retraining, complex multi-factor judgments with context-dependent trade-offs.

Industrial SLMs for Evals

The hypothesis that fine-tuned SLMs can match or beat frontier LLMs on narrow evaluation tasks isn't theoretical. Multiple production systems prove it daily.

Llama Guard (Meta)

Llama Guard is built on Llama 3.1-8B, Llama Guard 3 outperforms GPT-4 on safety classification while achieving significantly lower false positive rates.

Model

Size

Key Result

Llama Guard 3

8B

Beats GPT-4 on MLCommons safety taxonomy

Llama Guard 3-1B-INT4

440MB

7× compression, runs on mobile devices

Llama Guard 3 Vision

11B

Multimodal safety for image+text

The architectural insight: Llama Guard outputs classification by examining the probability of the first token, using that as the "unsafe" class probability. This is why an 8B model beats a trillion-parameter one: it's not trying to be general-purpose.

The 1B quantized version demonstrates compression potential. Distillation from the 8B teacher yields only a 1.3% F1 drop while reducing model size from 2.8GB to 440MB. The model runs on smartphones via ExecuTorch.

Luna (Galileo)

Galileo's Luna takes the evaluation-specific SLM approach further. Built on fine-tuned Llama and Mistral models (3B and 8B variants), Luna outputs normalized log-probabilities rather than generated text, enabling sub-50ms evaluations.

Key architectural choices: single-token output mode for maximum throughput, LoRA adapters for multi-metric evaluation on a shared base model, calibrated confidence scores that correlate with actual accuracy, and optimization for evaluation-specific tasks like hallucination detection and context adherence.

Luna demonstrates that purpose-built beats general-purpose. A model trained specifically for "does this response contradict the source?" outperforms GPT-5 prompted with the same question, at 1/50th the cost and 20× the speed.

PHUDGE

PHUDGE, a fine-tuned Phi-3 model (3.8B parameters), achieved state-of-the-art results in 2024 on 4 evaluation benchmarks, surpassing every existing model in both accuracy and throughput.

The key insight: causal modeling (text generation) is often the wrong approach for evaluation tasks. Converting evaluation from generation to classification improves both speed and accuracy. The model's entire capacity focuses on judgment rather than explanation.

Prometheus 2

Prometheus 2 provides an open-weight alternative. The 7B variant achieves 0.6-0.7 Pearson correlation with GPT-4 evaluations while requiring only 16GB VRAM—runnable on consumer GPUs.

Key techniques that transfer to custom SLMs: merged training on both direct assessment and pairwise ranking produces unified evaluators, swap augmentation (reversing response order) reduces position bias, and reference support + reference drop during training improves robustness.

Anatomy of an SLM Judge

SLM judges are decoder-only transformer models in the 1-8B parameter range, fine-tuned on task-specific data. The base model provides language understanding; fine-tuning adds task-specific judgment. Unlike LLM judges that produce reasoning chains, SLM judges often output only the minimal tokens needed for classification.

Context levels determine what gets evaluated. Span-level evaluates a single LLM call or retrieval result ("Is this response toxic?"). Trace-level evaluates the entire request-response cycle, catching compounding errors in agentic workflows. Session-level evaluates multiple conversation turns for boundary violations or goal completion, but requires long context and is most expensive.

Most production systems use span-level models for speed, with trace or session-level evaluation on sampled traffic.

Output modes trade latency for interpretability. Single-token mode (15-50ms) extracts logits for True/False only, which is ideal for real-time guardrails. Verdict-only mode (50-100ms) returns PASS/FAIL without explanation, used for routing decisions. Reasoning mode (200-500ms) adds explanations for audit trails and debugging.

Building an SLM Judge

Training Data

Your SME-labeled examples from Chapter 3 become training data. The ground truth dataset, disagreement analyses, and criteria library translate directly.

[callout title="Test set first"]
Before training, create 300-500 manually labeled examples with diverse coverage and at least 100 examples of each class. This must be SME-labeled, not generated. This is your ground truth for measuring whether training succeeded.
[/callout]

[callout title="Training set requirements"]
Target 1,000-10,000 labeled examples with balanced class distribution (roughly 50/50 for binary classification), even if production is skewed. Imbalanced training produces models that default to the majority class.
[/callout]

[callout title="Expand through augmentation"]
Synthetic examples generated by LLMs can expand 1,000 manually labeled examples to 10,000. The key is SME verification: generate candidates, have experts label a sample, and keep only those that align with ground truth. Synthetic data without verification encodes the same biases you're trying to correct.
[/callout]

[callout title="Include hard negatives"]
Examples that look like violations but aren't are as important as actual violations. "I'd recommend diversifying across asset classes" looks like investment advice but is educational content. Without hard negatives, models learn superficial patterns rather than meaningful distinctions.
[/callout]

[callout title="Document everything"]
Track the source of each example (production, synthetic, SME-created), the labeler, the date, and the criteria version. When you retrain months later, you'll need this provenance to understand what the model learned and why.
[/callout]

Model Selection

Size

Latency (L4 GPU)

Best For

3B

15-60ms

Real-time guardrails, cost-sensitive

8B

50-150ms

Async monitoring, strict accuracy (>95%)

Base model selection matters less than fine-tuning quality. Llama 3.1 variants are well-supported. The differences between base models shrink after task-specific fine-tuning.

Fine-Tuning

Full fine-tuning updates all model weights. Highest accuracy but requires more compute (8-16 GPU hours for an 8B model) and produces complete checkpoints (15-30GB). Use when maximum accuracy matters.

LoRA fine-tuning updates only 1-5% of weights while keeping the base model frozen. Training is faster (1-2 GPU hours), produces smaller adapter files (10-50 MB), and allows stacking multiple adapters on a single base model. Use when you have multiple evaluation tasks.

Training configuration: learning rate 1e-5 to 5e-5, batch size 8-32, 3-5 epochs. Watch for validation loss diverging from training loss—that signals overfitting. Stop when validation metrics plateau.

A single A100 or L4 GPU can fine-tune an 8B model in hours. Cloud spot instances offer 60-80% discounts. Compute cost typically ranges from $50-500.

Validation

Validate on production samples. Hold out 500+ examples from actual production traffic, labeled by SMEs. If production F1 differs from test F1 by more than 3-5%, your training data doesn't represent production.

Check calibration. A model outputting 0.95 confidence on wrong predictions is dangerous. Predictions with 80% confidence should be correct 80% of the time.

Run shadow mode. Route all traffic through the new SLM while keeping your existing system authoritative. Compare outputs. Investigate every disagreement. Shadow mode reveals production edge cases no test set captures.

Serving at Scale

[callout title="Deployment options"]
Self-hosted inference using vLLM, TGI, or Triton offers maximum control and the lowest per-inference cost at high volume. Managed inference through Modal, Replicate, or Baseten abstracts infrastructure complexity. On-premise deployment handles air-gapped environments or strict data residency.
[/callout]

[callout title="Optimization techniques"]
Quantization reduces precision from FP16 to INT8, doubling inference speed with 1-2% accuracy loss. Batching groups 8-32 requests for simultaneous processing, tripling throughput for async monitoring. Request-aware load balancing matters more than most teams realize—naive round-robin wastes 40-60% of GPU capacity because inference time varies dramatically with input size.
[/callout]

[callout title="Multi-metric serving with adapters"]
Real applications need multiple evaluation metrics. Running separate models quadruples infrastructure costs. The adapter pattern solves this: one base model stays loaded in GPU memory, multiple LoRA adapters (10-50MB each) swap in as needed. A single GPU serves PII detection, toxicity, compliance, and quality metrics through adapter switching.
[/callout]

[callout title="Integration patterns"]
Synchronous evaluation blocks until complete—use for guardrails that must prevent harmful content. Asynchronous evaluation runs after response—use for monitoring and analytics. Batch evaluation processes accumulated data on schedule—use for daily quality reports.
[/callout]

Most production systems combine patterns: synchronous SLM evaluation for safety-critical checks, async evaluation for quality assessment, batch processing for aggregate analytics. The combination captures different time horizons—immediate intervention, operational monitoring, and strategic insights.

Why SLMs Are Harder Than They Look

Training Pitfalls

[callout title="Data quality dominates model quality"]
A perfectly trained model on noisy labels produces noisy predictions. Inconsistent labeling (different SMEs applying different standards) produces a model that mimics that inconsistency. Verify inter-rater reliability before training. If humans don't agree, the model can't learn a coherent pattern.
[/callout]

[callout title="Class imbalance sabotages learning"]
If 95% of examples are "pass," the model learns to always predict "pass" to achieve high accuracy. Balance training sets, but recognize this creates miscalibrated models that predict a 50% failure rate when the production rate is 5%. Post-training calibration is required.
[/callout]

[callout title="Domain shift is invisible until deployment"]
Training data comes from historical traffic. But production evolves—new products launch, user behavior shifts, edge cases emerge. The model's accuracy on test data doesn't predict accuracy on future data. Continuous monitoring for distribution shift is essential.
[/callout]

[callout title="Hard negatives are hard to find"]
Examples that look like positives but aren't don't occur naturally in proportion to their importance. Deliberate curation is required, which takes SME time and domain expertise.
[/callout]

Serving Pitfalls

[callout title="Inference infrastructure is a different skill set"]
Training a model is machine learning. Serving it reliably at scale is systems engineering—GPU memory management, request batching, load balancing, and failover handling. Teams with strong ML skills often underestimate the complexity of serving.
[/callout]

[callout title="GPU utilization is surprisingly hard to optimize"]
Naive deployments waste 40-60% of GPU capacity. Request sizes vary dramatically, but standard load balancers treat all requests equally. Without request-aware routing, some GPUs sit idle while others queue requests.
[/callout]

[callout title="Reliability engineering is non-negotiable"]
What happens when the SLM service fails? If evaluation is in the critical path, you need graceful degradation: fall back to LLM evaluation, rule-based heuristics, or let traffic through with monitoring. Design these fallbacks before failures occur.
[/callout]

[callout title="Retraining pipelines need automation"]
Models drift. Data distributions shift. Periodic retraining is necessary. Manual retraining is error-prone and often neglected. Automated pipelines that monitor performance, trigger retraining when metrics degrade, validate new models, and deploy safely require significant engineering investment.
[/callout]

Common Failures

Distribution mismatch: Training came from customer support, but production includes sales, legal, and HR conversations.

Spurious pattern learning: All PII examples in training happened to be long messages, so the model learned "long = PII" rather than detecting actual personal information.

Confidence miscalibration: High confidence on wrong predictions erodes user trust. Fix with explicit calibration using held-out data.

Catastrophic forgetting: Retraining on new data causes the model to forget previously learned patterns. Accuracy improves on new cases but regresses on previously working cases. Maintain comprehensive regression tests.

When to Use What

The bottom-right quadrant is where SLMs shine: high volume with stable, well-defined criteria.

Use LLM judges when: Volume below 10K/day (operational overhead exceeds savings), criteria still evolving (prompts iterate in minutes; models retrain in hours), reasoning chains needed for auditability, exploring new failure modes that SLMs weren't trained to detect.

Use SLM judges when: Volume above 10K/day, real-time guardrails required (sub-100ms latency), clear classification criteria that have been stable for 2+ months, privacy requires local processing, and tasks have clear ground truth where humans consistently agree.

Use hybrid approaches when: 100% SLM coverage with LLM escalation for low-confidence cases, different metrics have different requirements (SLMs for PII/toxicity, LLMs for quality), transitioning between approaches via shadow mode validation.

The Transition Path

This timeline aligns with NVIDIA Research's LLM-to-SLM conversion approach: log all LLM calls, curate task-specific data, cluster by task type, fine-tune SLMs, iterate. The key insight is that agentic interactions themselves generate the training data you need for specialization.

Teams that have built one SLM judge typically build subsequent ones in 2-4 weeks. The infrastructure and expertise transfer.

The Takeaway

LLM judges hit a scaling wall. Cost and latency force sampling, and sampling misses failures. SLMs break through by trading generalization for efficiency.

The path: refine with LLMs until criteria stabilize (Chapter 3), accumulate labeled data through SME cycles, train an SLM when volume or latency requirements exceed LLM capabilities, deploy with monitoring for drift, maintain hybrid systems where SLMs handle volume and LLMs handle edge cases.

The mistakes that AIs make are unpredictable and unintuitive. The only defense is comprehensive monitoring. SLMs make comprehensive monitoring economically feasible.

Subscribe to our newsletter

Enter your email to get the latest tips and stories to help boost your business.