Back to Book

    Chapter 04 · Apr 8, 2026

    Scaling Evals with SLMs

    Pratik Bhavsar

    Pratik Bhavsar

    Evals & Leaderboards @ Galileo Labs

    "The mistakes that AIs make are unpredictable (same AI makes different mistakes at different points) but also unintuitive (we don't have a good model of what mistakes they will make)."

    — Dwarkesh Patel

    ~30 min read
    11 sections

    In late 2024, Meta shipped a safety classifier that runs on a Motorola Razr. Not a safety classifier that calls an API. Not one that routes traffic to a GPU cluster. A model that lives on the phone itself, in 440 megabytes of storage, classifying content as safe or unsafe at 30 tokens per second.

    That model, Llama Guard 3-1B-INT4, outperforms GPT-4 on the MLCommons safety taxonomy. Read that again. A model compressed to fit on a budget smartphone beats a trillion-parameter frontier model at safety classification. Not by being smarter in general. By being smarter at one thing.

    This is the SLM insight, and it rewrites the economics of everything covered in previous chapters.

    SLMs close that gap by specializing. When a team runs GPT-4 as a toxicity judge, it's paying for the model's knowledge of Shakespeare, organic chemistry, and Baroque architecture. It's paying for a Swiss Army knife to do the work of a scalpel. An SLM trained specifically for toxicity detection dedicates its entire capacity to that single judgment. NVIDIA Research shows the cost reduction is 10-30 times. The latency improvement is even more dramatic: 15 to 150 milliseconds instead of 1 to 3 seconds, which is the difference between "not feasible for real-time guardrails" and "runs before the user sees the response."

    The fintech compliance team from earlier chapters illustrates the shift. Their 94% accurate LLM judge costs $30,000 per day at full coverage. An SLM judge trained on the same SME-refined criteria costs $600 per day and runs faster. More importantly, it can evaluate every conversation, not 500 out of a million.

    This chapter is about making that transition: taking evaluation systems that work in development and making them work at a production scale.

    The Specialization Trade-off

    Swiss Army knife vs. scalpel. General-purpose vs. purpose-built.

    LLM Judge

    $30,000/day

    1-3s latency · 1% sampling

    SLM Judge

    $600/day

    15-150ms · 100% coverage

    The Wall

    When LLM Judges Stop Working

    Three forces conspire against LLM judges at the production scale.

    01

    The cost ceiling

    A single evaluation with chain-of-thought reasoning consumes 1,000-2,000 tokens. At $0.01-0.05 per evaluation, a customer service operation handling 1 million conversations monthly faces $10,000-50,000 in evaluation costs—assuming one metric. Most applications need multiple metrics: safety, accuracy, tone, and compliance. The costs compound fast. The natural response is to evaluate 1% instead of 100%. But sampling creates blind spots. If failures cluster in specific user segments or query types, random sampling misses entire categories of problems.

    02

    The latency barrier

    LLM inference takes 1-3 seconds per evaluation. For batch processing, this is fine. For real-time guardrails that must evaluate a response before showing it to users, it's not. Users won't wait 3 seconds for every chatbot response.

    03

    The prompting ceiling

    Chapter 3 showed how SME refinement pushes accuracy from 70% to 94%. But the last 5-6% often resist further prompting. You add more examples, more criteria, and more edge case handling. The prompt grows. Latency increases. And accuracy plateaus as the model's attention fragments across too many instructions. Prompting asks a trillion-parameter general-purpose model to simulate a specialist. At some point, you need an actual specialist.

    The subsampling trap

    Teams facing this wall resort to sampling and extrapolation. "Evaluate 1% and multiply by 100." The math breaks for three reasons.

    1.

    Rare failure modes don't survive sampling. If a specific failure occurs in 0.1% of traffic, a 1% sample contains only ~10 examples—not enough to detect reliably, let alone characterize.

    2.

    Failures aren't randomly distributed. They correlate with user segments, times, query types, and conversation length. Random sampling breaks these correlations. You might sample heavily from simple queries while missing complex ones where failures concentrate.

    3.

    The worst failures are the rarest. Compliance violations, safety incidents, and catastrophic errors are tail events by definition. Random sampling systematically undersamples tails.

    For mission-critical applications, you need 100% coverage. The only way to get there economically is to change the underlying cost structure.

    The Scaling Wall

    Daily evaluation costs as volume increases from 10K to 1M conversations

    $0$5K$10K$15K$20K$25K$30K010K50K100K250K500K750K1MDaily Evaluation Volume⚡ Break-even
    LLM Judge (GPT-4 class) — $0.03/eval
    SLM Judge (8B fine-tuned) — $500/day infra + $0.0006/eval

    LLM Economics

    Pure variable cost. Every evaluation costs the same. No infrastructure to manage, but costs scale linearly forever.

    SLM Economics

    Fixed infrastructure cost (GPU hosting) + low marginal cost per evaluation. High initial investment, but dramatically cheaper at scale.

    The Unlock

    The SLM Unlock

    SLMs solve the scaling problem through specialization. Instead of prompting an LLM to act like a compliance detector, you train a 3-8B parameter model to actually be one.

    An LLM judge running a toxicity check uses perhaps 0.1% of GPT-4's capabilities. You're paying for a Swiss Army knife when you need a scalpel. The trade-off is intentional: SLMs sacrifice generalization for speed and accuracy on narrow tasks.

    NVIDIA Research shows serving a 7B SLM costs 10-30× less in latency, energy, and compute than 70-175B LLMs. The fintech team's $30,000/day evaluation budget drops to $600/day. Suddenly, 100% coverage is cheaper than 1% sampling with LLMs.

    MetricLLM JudgeSLM Judge
    Cost per 1M evals$10,000-50,000$200-1,000
    Latency1-3 seconds15-150ms
    Real-time guardrailsNot feasibleFeasible
    100% coverageEconomically prohibitiveStandard practice

    Fine-tuning often improves accuracy, too. The model dedicates its entire capacity to your task rather than encoding knowledge about poetry, history, and millions of irrelevant topics. Teams can get 5-10% accuracy gains over prompted LLM judges after fine-tuning on domain data.

    Where SLMs Excel

    Binary and multi-class classification with clear criteria (Is this toxic? Does this contain PII?)

    High-volume, consistent evaluation

    Privacy-sensitive deployments where data cannot leave your infrastructure

    Where SLMs Struggle

    Subjective quality assessment where "good" depends on context

    Novel failure modes they weren't trained to catch

    Rapidly evolving criteria that would require constant retraining

    Complex multi-factor judgments with context-dependent trade-offs

    Production Systems

    Industrial SLMs for Evals

    The hypothesis that fine-tuned SLMs can match or beat frontier LLMs on narrow evaluation tasks isn't theoretical. Multiple production systems prove it daily.

    Llama Guard (Meta)

    Llama Guard is built on Llama 3.1-8B, Llama Guard 3 outperforms GPT-4 on safety classification while achieving significantly lower false positive rates.

    ModelSizeKey Result
    Llama Guard 38BBeats GPT-4 on MLCommons safety taxonomy
    Llama Guard 3-1B-INT4440MB7× compression, runs on mobile devices
    Llama Guard 3 Vision11BMultimodal safety for image+text

    The architectural insight: Llama Guard outputs classification by examining the probability of the first token, using that as the "unsafe" class probability. This is why an 8B model beats a trillion-parameter one: it's not trying to be general-purpose.

    The 1B quantized version demonstrates compression potential. Distillation from the 8B teacher yields only a 1.3% F1 drop while reducing model size from 2.8GB to 440MB. The model runs on smartphones via ExecuTorch.

    Luna (Galileo)

    Galileo's Luna takes the evaluation-specific SLM approach further. Built on fine-tuned Llama and Mistral models (3B and 8B variants), Luna outputs normalized log-probabilities rather than generated text, enabling sub-50ms evaluations.

    Key architectural choices: single-token output mode for maximum throughput, LoRA adapters for multi-metric evaluation on a shared base model, calibrated confidence scores that correlate with actual accuracy, and optimization for evaluation-specific tasks like hallucination detection and context adherence.

    Luna demonstrates that purpose-built beats general-purpose. A model trained specifically for "does this response contradict the source?" outperforms GPT-5 prompted with the same question, at 1/50th the cost and 20× the speed.

    Multi-Metric Serving with LoRA Adapters

    One base model, multiple evaluation metrics, single GPU

    Incoming Requests

    PII Detection

    35% of traffic

    Toxicity Check

    30% of traffic

    Compliance

    25% of traffic

    Quality Score

    10% of traffic

    Single GPU
    60% VRAM

    Base Model

    Llama 3.1 8B

    ~16GB loaded once

    Hot-swap (~5ms)

    PII

    Adapter

    12MB

    Toxicity

    Adapter

    15MB

    Compliance

    Adapter

    18MB

    Quality

    Adapter

    10MB

    Total adapters: ~55MB (vs 64GB for 4 separate models)

    Evaluation Results

    PII: 0.92 confidence

    ~25ms

    Toxic: SAFE

    ~30ms

    Compliant: PASS

    ~35ms

    Quality: 4.2/5

    ~40ms

    ✕ Without Adapters (4 models)

    4 GPUs × 16GB = 64GB VRAM

    ✓ With LoRA Adapters

    1 GPU × 16GB + 55MB = ~16GB VRAM

    PHUDGE

    PHUDGE, a fine-tuned Phi-3 model (3.8B parameters), achieved state-of-the-art results in 2024 on 4 evaluation benchmarks, surpassing every existing model in both accuracy and throughput.

    The key insight: causal modeling (text generation) is often the wrong approach for evaluation tasks. Converting evaluation from generation to classification improves both speed and accuracy. The model's entire capacity focuses on judgment rather than explanation.

    Prometheus 2

    Prometheus 2 provides an open-weight alternative. The 7B variant achieves 0.6-0.7 Pearson correlation with GPT-4 evaluations while requiring only 16GB VRAM—runnable on consumer GPUs.

    Key techniques that transfer to custom SLMs: merged training on both direct assessment and pairwise ranking produces unified evaluators, swap augmentation (reversing response order) reduces position bias, and reference support + reference drop during training improves robustness.

    Architecture

    Anatomy of an SLM Judge

    SLM judges are decoder-only transformer models in the 1-8B parameter range, fine-tuned on task-specific data. The base model provides language understanding; fine-tuning adds task-specific judgment. Unlike LLM judges that produce reasoning chains, SLM judges often output only the minimal tokens needed for classification.

    Context Levels

    Context levels determine what gets evaluated.

    Span-level

    Evaluates a single LLM call or retrieval result ("Is this response toxic?").

    Fastest. Used by most production systems.

    Trace-level

    Evaluates the entire request-response cycle, catching compounding errors in agentic workflows.

    Moderate cost. Good for agentic systems.

    Session-level

    Evaluates multiple conversation turns for boundary violations or goal completion, but requires long context and is most expensive.

    Highest cost. Used on sampled traffic.

    Most production systems use span-level models for speed, with trace or session-level evaluation on sampled traffic.

    Output Modes

    Output modes trade latency for interpretability.

    ModeLatencyUse Case
    Single-token15-50msExtracts logits for True/False only, which is ideal for real-time guardrails.
    Verdict-only50-100msReturns PASS/FAIL without explanation, used for routing decisions.
    Reasoning200-500msAdds explanations for audit trails and debugging.
    Implementation

    Building an SLM Judge

    Building an SLM Judge

    From labeled examples to production deployment

    1
    DATA PREP

    Test Set First

    · 300-500 SME-labeled

    · Diverse edge cases

    Training Set

    · 1K-10K examples

    · 50/50 class balance

    · Include hard negatives

    · Synthetic + verified

    2
    MODEL SELECT

    3B Model

    15-60ms · 90-93% F1

    8B Model

    50-150ms · 93-96% F1

    Base: Llama, Phi, Qwen

    Or start from: Luna, Prometheus

    3
    FINE-TUNE

    LoRA Adapters

    · Train 1-5% of weights

    · 2-5 days training

    · 1-2 days validation

    Output Modes

    · Single-token: 15-50ms

    · Verdict-only: 50-100ms

    · With reasoning: 200ms+

    4
    DEPLOY

    Serving

    · vLLM / TGI

    · Modal / Replicate

    · Self-hosted GPU

    Optimize

    · INT8 quantize: 2×

    · Batch requests

    · Multi-LoRA serving

    Target Metrics

    >95%

    F1 Score

    <50ms

    P95 Latency

    <2%

    False Positive Rate

    50×

    Cost Reduction vs LLM

    100%

    Coverage

    Training Data

    Your SME-labeled examples from Chapter 3 become training data. The ground truth dataset, disagreement analyses, and criteria library translate directly.

    01

    Test set first

    Before training, create 300-500 manually labeled examples with diverse coverage and at least 100 examples of each class. This must be SME-labeled, not generated. This is your ground truth for measuring whether training succeeded.

    02

    Training set requirements

    Target 1,000-10,000 labeled examples with balanced class distribution (roughly 50/50 for binary classification), even if production is skewed. Imbalanced training produces models that default to the majority class.

    03

    Expand through augmentation

    Synthetic examples generated by LLMs can expand 1,000 manually labeled examples to 10,000. The key is SME verification: generate candidates, have experts label a sample, and keep only those that align with ground truth. Synthetic data without verification encodes the same biases you're trying to correct.

    04

    Include hard negatives

    Examples that look like violations but aren't are as important as actual violations. "I'd recommend diversifying across asset classes" looks like investment advice but is educational content. Without hard negatives, models learn superficial patterns rather than meaningful distinctions.

    05

    Document everything

    Track the source of each example (production, synthetic, SME-created), the labeler, the date, and the criteria version. When you retrain months later, you'll need this provenance to understand what the model learned and why.

    Model Selection

    SizeLatency (L4 GPU)Best For
    3B15-60msReal-time guardrails, cost-sensitive
    8B50-150msAsync monitoring, strict accuracy (>95%)

    Base model selection matters less than fine-tuning quality. Llama 3.1 variants are well-supported. The differences between base models shrink after task-specific fine-tuning.

    Fine-Tuning

    Full fine-tuning

    Updates all model weights. Highest accuracy but requires more compute (8-16 GPU hours for an 8B model) and produces complete checkpoints (15-30GB). Use when maximum accuracy matters.

    LoRA fine-tuning

    Updates only 1-5% of weights while keeping the base model frozen. Training is faster (1-2 GPU hours), produces smaller adapter files (10-50 MB), and allows stacking multiple adapters on a single base model. Use when you have multiple evaluation tasks.

    Training configuration: learning rate 1e-5 to 5e-5, batch size 8-32, 3-5 epochs. Watch for validation loss diverging from training loss—that signals overfitting. Stop when validation metrics plateau.

    A single A100 or L4 GPU can fine-tune an 8B model in hours. Cloud spot instances offer 60-80% discounts. Compute cost typically ranges from $50-500.

    Validation

    Validate on production samples

    Hold out 500+ examples from actual production traffic, labeled by SMEs. If production F1 differs from test F1 by more than 3-5%, your training data doesn't represent production.

    Check calibration

    A model outputting 0.95 confidence on wrong predictions is dangerous. Predictions with 80% confidence should be correct 80% of the time.

    Run shadow mode

    Route all traffic through the new SLM while keeping your existing system authoritative. Compare outputs. Investigate every disagreement. Shadow mode reveals production edge cases no test set captures.

    Infrastructure

    Serving at Scale

    Deployment Options

    01
    Self-hosted inference. Using vLLM, TGI, or Triton offers maximum control and the lowest per-inference cost at high volume.
    02
    Managed inference. Through Modal, Replicate, or Baseten abstracts infrastructure complexity.
    03
    On-premise. On-premise deployment handles air-gapped environments or strict data residency.

    Optimization Techniques

    Quantization

    Quantization reduces precision from FP16 to INT8, doubling inference speed with 1-2% accuracy loss.

    Batching

    Groups 8-32 requests for simultaneous processing, tripling throughput for async monitoring.

    Request-aware load balancing

    Matters more than most teams realize—naive round-robin wastes 40-60% of GPU capacity because inference time varies dramatically with input size.

    Multi-metric serving with adapters

    Real applications need multiple evaluation metrics. Running separate models quadruples infrastructure costs. The adapter pattern solves this: one base model stays loaded in GPU memory, multiple LoRA adapters (10-50MB each) swap in as needed. A single GPU serves PII detection, toxicity, compliance, and quality metrics through adapter switching.

    Integration Patterns

    Synchronous

    Blocks until complete—use for guardrails that must prevent harmful content.

    Asynchronous

    Runs after response—use for monitoring and analytics.

    Batch

    Processes accumulated data on schedule—use for daily quality reports.

    Most production systems combine patterns: synchronous SLM evaluation for safety-critical checks, async evaluation for quality assessment, batch processing for aggregate analytics. The combination captures different time horizons—immediate intervention, operational monitoring, and strategic insights.

    Pitfalls

    Why SLMs Are Harder Than They Look

    Training Pitfalls

    Data quality dominates model quality

    A perfectly trained model on noisy labels produces noisy predictions. Inconsistent labeling (different SMEs applying different standards) produces a model that mimics that inconsistency. Verify inter-rater reliability before training. If humans don't agree, the model can't learn a coherent pattern.

    Class imbalance sabotages learning

    If 95% of examples are "pass," the model learns to always predict "pass" to achieve high accuracy. Balance training sets, but recognize this creates miscalibrated models that predict a 50% failure rate when the production rate is 5%. Post-training calibration is required.

    Domain shift is invisible until deployment

    Training data comes from historical traffic. But production evolves—new products launch, user behavior shifts, edge cases emerge. The model's accuracy on test data doesn't predict accuracy on future data. Continuous monitoring for distribution shift is essential.

    Hard negatives are hard to find

    Examples that look like positives but aren't don't occur naturally in proportion to their importance. Deliberate curation is required, which takes SME time and domain expertise.

    Serving Pitfalls

    Inference infrastructure is a different skill set

    Training a model is machine learning. Serving it reliably at scale is systems engineering—GPU memory management, request batching, load balancing, and failover handling. Teams with strong ML skills often underestimate the complexity of serving.

    GPU utilization is surprisingly hard to optimize

    Naive deployments waste 40-60% of GPU capacity. Request sizes vary dramatically, but standard load balancers treat all requests equally. Without request-aware routing, some GPUs sit idle while others queue requests.

    Reliability engineering is non-negotiable

    What happens when the SLM service fails? If evaluation is in the critical path, you need graceful degradation: fall back to LLM evaluation, rule-based heuristics, or let traffic through with monitoring. Design these fallbacks before failures occur.

    Retraining pipelines need automation

    Models drift. Data distributions shift. Periodic retraining is necessary. Manual retraining is error-prone and often neglected. Automated pipelines that monitor performance, trigger retraining when metrics degrade, validate new models, and deploy safely require significant engineering investment.

    SLM Failure Modes & Mitigations

    Where small models break down and how to handle it

    Unknown Languages

    Input:

    "नमस्ते, मुझे मदद चाहिए"

    SLM output:

    ??? (random scores)

    Long Context

    Training window:

    2,048 – 4,096 tokens

    20-turn conversation:

    8,000+ tokens (fails)

    abc

    Format Mismatch

    Trained on:

    User: {query}\n...

    Production uses:

    <user>{query}</user>

    ?

    Novel Failures

    Trained on:

    Historical jailbreaks

    New attack variant:

    Completely invisible

    Mitigation Strategies

    Failure Type
    Detection
    Mitigation
    Unknown language
    Language detection classifier
    Route to multilingual LLM
    Long context
    Token count check
    Truncate or escalate to LLM
    abcFormat mismatch
    Template validation
    Strict preprocessing pipeline
    ?Novel failures
    Confidence thresholding
    Escalate low-confidence to LLM

    Key pattern: SLMs handle the known distribution efficiently; LLMs catch everything that falls outside.

    Common Failures

    Distribution mismatch

    Training came from customer support, but production includes sales, legal, and HR conversations.

    Spurious pattern learning

    All PII examples in training happened to be long messages, so the model learned "long = PII" rather than detecting actual personal information.

    Confidence miscalibration

    High confidence on wrong predictions erodes user trust. Fix with explicit calibration using held-out data.

    Catastrophic forgetting

    Retraining on new data causes the model to forget previously learned patterns. Accuracy improves on new cases but regresses on previously working cases. Maintain comprehensive regression tests.

    Decision Framework

    When to Use What

    The bottom-right quadrant is where SLMs shine: high volume with stable, well-defined criteria.

    When to Use What

    Decision framework for choosing between LLM and SLM judges

    Low Volume

    (<10K/day)

    High Volume

    (>10K/day)

    Evolving Criteria

    Stable Criteria

    LLM Judges

    Iterate via prompts

    Fast experimentation

    ⚠ PAIN ZONE

    LLM + Sampling

    Accept coverage gaps

    Or retrain SLMs frequently

    LLM Judges

    SLM not worth investment

    Operational overhead exceeds savings

    ★ MAXIMUM ROI ZONE

    SLM Judges

    100% coverage at 1/50th cost

    Real-time guardrails enabled

    Decision rule: Volume >10K/day + Criteria stable for 2+ months = SLM judges

    Use LLM judges when:

    Volume below 10K/day (operational overhead exceeds savings)

    Criteria still evolving (prompts iterate in minutes; models retrain in hours)

    Reasoning chains needed for auditability

    Exploring new failure modes that SLMs weren't trained to detect

    Use SLM judges when:

    Volume above 10K/day

    Real-time guardrails required (sub-100ms latency)

    Clear classification criteria that have been stable for 2+ months

    Privacy requires local processing

    Tasks have clear ground truth where humans consistently agree

    Use hybrid approaches when:

    100% SLM coverage with LLM escalation for low-confidence cases

    Different metrics have different requirements (SLMs for PII/toxicity, LLMs for quality)

    Transitioning between approaches via shadow mode validation

    Roadmap

    The Transition Path

    This timeline aligns with NVIDIA Research's LLM-to-SLM conversion approach: log all LLM calls, curate task-specific data, cluster by task type, fine-tune SLMs, iterate. The key insight is that agentic interactions themselves generate the training data you need for specialization.

    The Transition Path: LLM to SLM

    A 7-month journey from prompt-based judges to production SLMs

    LLM BASELINE

    Months 1–2

    SME REFINEMENT

    Months 3–4

    DATA PREP

    Month 5

    TRAINING

    Month 6

    PRODUCTION

    Month 7+

    Month 1Month 3Month 5Month 6Month 7+

    Output:

    Baseline accuracy

    Initial labels

    Output:

    1,000+ labeled examples

    Stable criteria

    Output:

    Training-ready dataset

    Output:

    Validated SLM

    in shadow mode

    Output:

    100% coverage

    SLM + LLM hybrid

    Teams that have built one SLM judge typically build subsequent ones in 2-4 weeks. The infrastructure and expertise transfer.

    Takeaway

    The Takeaway

    LLM judges hit a scaling wall. Cost and latency force sampling, and sampling misses failures. SLMs break through by trading generalization for efficiency.

    The path:

    1.

    Refine with LLMs until criteria stabilize (Chapter 3)

    2.

    Accumulate labeled data through SME cycles

    3.

    Train an SLM when volume or latency requirements exceed LLM capabilities

    4.

    Deploy with monitoring for drift

    5.

    Maintain hybrid systems where SLMs handle volume and LLMs handle edge cases

    "The mistakes that AIs make are unpredictable and unintuitive. The only defense is comprehensive monitoring. SLMs make comprehensive monitoring economically feasible."

    Free Download

    Eval Engineering Cheatsheet

    Key concepts, frameworks, and best practices — all in one page.

    Stay in the loop

    New chapters, practical guides, and eval engineering insights delivered to your inbox.