Luna eval cost at scale
Per-token inference is 10–375× cheaper than LLM-as-judge depending on the model. Watch what happens to your eval bill as agents and metrics multiply.
Inputs
Annual savings with Luna
$45.73M
LLM-as-judge costs 27× more than Luna at this scale
Per-eval response latency
Inline guardrail viable on Luna · GPT-4o is offline-only
31× faster
Luna is faster per evaluation
Tokens scanned / mo
1.22T
6M traces × 10K tok × 20 metrics
Trace volume
6M traces/mo
5 agents · 40,000 traces/day each
LLM-as-judge cost
$3.96M/mo
$47.48M annual
Luna cost
$146.1K/mo
$1.75M annual
Same budget. Different coverage.
If you pin the budget at Luna's cost ($146.1K/mo)
96%
Blind spot with judge at equal budget
Methodology & sources
| Parameter | Notes | Source |
|---|---|---|
| Simple · 3K tok | Single-turn or short RAG. Anchored to ~3,700 tok/ticket. | Anthropic support agent |
| Tool-Using · 10K tok | ReAct loop, real tool surface, 3–8 iterations. | τ-bench · τ²-bench leaderboard |
| Code/Research · 50K tok | Multi-step code or browsing. Claude Code 33K, Cursor 188K on SWE-bench. | Cognition SWE-bench · HAL · GAIA |
| Judge rates | Pick a model in the inputs panel. Effective rate uses 90% input / 10% output blend. | Anthropic · OpenAI |
| F1 accuracy hit | Smaller judges sacrifice eval accuracy. Mini-class −5% F1, Nano-class −10%. | Patronus LLM-Judge Leaderboard |
| Luna 2 rate | Luna 2 $0.12/1M (self-hosted). Fine-tuned per eval task. | Galileo Observability |
| Per-eval latency | Luna 2 ≈ 80ms p50; frontier judges 2.5–4s. | Luna 2 paper |
| Metric scope | Pooled: quadratic in agents. Per-agent: linear. | Modeling choice |