Luna eval cost at scale

Per-token inference is 10–375× cheaper than LLM-as-judge depending on the model. Watch what happens to your eval bill as agents and metrics multiply.

Inputs

Agent archetype
Evaluator model$3.25/1M
Effective rate at 90% input / 10% output:  $3.25/1M
Number of agents5
150
Traces per agent / day40,000
1K10M
Baseline shared metrics5
020
Custom metrics per agent3
010
Metric scope
Annual savings with Luna
$45.73M
LLM-as-judge costs 27× more than Luna at this scale
Per-eval response latency
Luna 2
80ms
GPT-4o
2.5s
Inline guardrail viable on Luna · GPT-4o is offline-only
31× faster
Luna is faster per evaluation
Tokens scanned / mo
1.22T
6M traces × 10K tok × 20 metrics
Trace volume
6M traces/mo
5 agents · 40,000 traces/day each
LLM-as-judge cost
$3.96M/mo
$47.48M annual
Luna cost
$146.1K/mo
$1.75M annual
Same budget. Different coverage.
Luna 2
100%
GPT-4o
3.69%
If you pin the budget at Luna's cost ($146.1K/mo)
96%
Blind spot with judge at equal budget
Methodology & sources
ParameterNotesSource
Simple · 3K tokSingle-turn or short RAG. Anchored to ~3,700 tok/ticket.Anthropic support agent
Tool-Using · 10K tokReAct loop, real tool surface, 3–8 iterations.τ-bench · τ²-bench leaderboard
Code/Research · 50K tokMulti-step code or browsing. Claude Code 33K, Cursor 188K on SWE-bench.Cognition SWE-bench · HAL · GAIA
Judge ratesPick a model in the inputs panel. Effective rate uses 90% input / 10% output blend.Anthropic · OpenAI
F1 accuracy hitSmaller judges sacrifice eval accuracy. Mini-class −5% F1, Nano-class −10%.Patronus LLM-Judge Leaderboard
Luna 2 rateLuna 2 $0.12/1M (self-hosted). Fine-tuned per eval task.Galileo Observability
Per-eval latencyLuna 2 ≈ 80ms p50; frontier judges 2.5–4s.Luna 2 paper
Metric scopePooled: quadratic in agents. Per-agent: linear.Modeling choice