Platform

Docs

Pricing

Resources

About

Get Started for Free

Book a Demo

Platform

Docs

Pricing

Resources

About

Get Started for Free

Book a Demo

Luna eval cost at scale

Per-token inference is 10–375× cheaper than LLM-as-judge depending on the model. Watch what happens to your eval bill as agents and metrics multiply.

Inputs

Agent archetype

Evaluator model$3.25/1M

GPT-5 Minipopular$0.25 / $2.00−5% F1 accuracyGPT-4o-minipopular$0.15 / $0.60−5% F1 accuracyGemini 3 Flash$0.50 / $3.00GPT-4o$2.50 / $10.00

Claude Sonnet 4.6$3.00 / $15.00Claude Opus 4.7$5.00 / $25.00GPT-5.5$5.00 / $30.00CustomEdit rates below

Input / 1M

Output / 1M

Effective rate at 90% input / 10% output: $3.25/1M

Number of agents5

150

Traces per agent / day40,000

1K10M

Baseline shared metrics5

020

Custom metrics per agent3

010

Metric scope

Annual savings with Luna

$45.73M

LLM-as-judge costs 27× more than Luna at this scale

Per-eval response latency

Luna 2

80ms

GPT-4o

2.5s

Inline guardrail viable on Luna · GPT-4o is offline-only

31× faster

Luna is faster per evaluation

Tokens scanned / mo

1.22T

6M traces × 10K tok × 20 metrics

Trace volume

6M traces/mo

5 agents · 40,000 traces/day each

LLM-as-judge cost

$3.96M/mo

$47.48M annual

Luna cost

$146.1K/mo

$1.75M annual

Same budget. Different coverage.

Luna 2

100%

GPT-4o

3.69%

If you pin the budget at Luna's cost ($146.1K/mo)

96%

with judge at equal budget

Methodology & sources

Parameter	Notes	Source
Simple · 3K tok	Single-turn or short RAG. Anchored to ~3,700 tok/ticket.	Anthropic support agent
Tool-Using · 10K tok	ReAct loop, real tool surface, 3–8 iterations.	τ-bench · τ²-bench leaderboard
Code/Research · 50K tok	Multi-step code or browsing. Claude Code 33K, Cursor 188K on SWE-bench.	Cognition SWE-bench · HAL · GAIA
Judge rates	Pick a model in the inputs panel. Effective rate uses 90% input / 10% output blend.	Anthropic · OpenAI
F1 accuracy hit	Smaller judges sacrifice eval accuracy. Mini-class −5% F1, Nano-class −10%.	Patronus LLM-Judge Leaderboard
Luna 2 rate	Luna 2 $0.12/1M (self-hosted). Fine-tuned per eval task.	Galileo Observability
Per-eval latency	Luna 2 ≈ 80ms p50; frontier judges 2.5–4s.	Luna 2 paper
Metric scope	Pooled: quadratic in agents. Per-agent: linear.	Modeling choice