Platform

Resources

About

Get Started for Free

Book a Demo

Platform

Docs

Pricing

Resources

About

Get Started for Free

Book a Demo

Back

Jun 18, 2025

Introducing Luna-2: Purpose-Built Models for Reliable AI Evaluations & Guardrailing

Conor Bronsdon

Head of Developer Awareness

Promotional banner for Galileo's Luna 2 models, featuring the tagline “Introducing our Luna 2 Models for Reliable AI Evals & Guardrailing.” Includes a stylized planet with numerical overlays and an "<error> detected" label, symbolizing real-time error detection and guardrailing for AI agents.

AI agents are already everywhere, from customer support chatbots that handle millions of conversations to financial services agents making decisions with real money. But as these agents get more autonomous and complex, evaluation and real-time guardrailing becomes the bottleneck. You can’t have your AI systems selling a car for $1.

Traditional LLM-based evaluation is too slow and expensive for production agent workflows or real-time guardrailing. GPT-4 evals that cost $50+ per thousand evaluations and take multiple seconds to complete won’t work when you need to evaluate 10-20 agent metrics simultaneously in real-time. You need more customization and better price for performance to build reliable AI systems at scale.

That’s why we're excited to introduce Luna-2—our next generation of small language models already powering customized evaluations and guardrails at multiple Fortune 50 companies. Luna 2 is purpose-built for customized real-time guardrailing based on low-latency, low-cost evaluations, designed with complex multi-agent systems in mind.

The Agent Evaluation Challenge

Enterprise teams building AI agents thus far have faced a fundamental tradeoff: comprehensive evaluation or efficient agents.

When your financial services agent is about to initiate a $10M transfer, you need real-time checks for:

Tool selection quality (Is this the right action?)
Flow adherence (Is the agent following proper procedures?)
Unsafe actions (Does this require additional authorization?)
Response groundedness (Is the agent making up information?)

Running 10+ evaluation metrics through LLM judges like GPT-4.1 on every agent interaction would cost a fortune and introduce unacceptable latency. So most teams either skip comprehensive evaluation or settle for basic heuristics that miss critical edge cases. Even worse, this trust and reliability blocker is stopping many exciting AI use cases from making it into production.

Comparison chart showing Luna-2 vs GPT-4o, GPT-4o mini, and Azure Content Safety across cost, accuracy, latency, and token capacity. Luna-2 leads with 0.88 accuracy, 152ms latency, 128k token support, and just $0.02 per million tokens—making it the most cost-effective, low-latency option.

Luna-2 eliminates this tradeoff.

Luna-2: Built for Production Agent Workflows

Luna-2 delivers enterprise-grade evaluation with the speed and cost efficiency that production systems demand. Hosted on Galileo’s proprietary optimized inference engine, and powered using modern GPU hardware for low-cost and low-latency evaluations, Luna models can adapt to hundreds of metrics while remaining deterministic, and offer superior out-of-the-box metrics for agentic evaluation and reliability. This approach enables organizations to leverage both custom fine-tuned LLMs and SLMs as judges. Luna offers:

Adaptability: Easily customizable with minimal data
Efficiency: Low latency when running multiple metrics (10-20) simultaneously
Cost-effectiveness: Lower cost compared to traditional LLM-based evaluation
Sophisticated agentic metrics: Tool error rate, context adherence, tool selection quality, etc.

Diagram of Luna 2 evaluation workflow showing user prompt sent to an LLM via an application with GalileoLogger, followed by response scoring through a fine-tuned small language model (SLM) on the Galileo Inference Engine. The output is evaluated by a Scorer to produce a final score between 0 and 1.

Let’s dig into some of the specifics of the Luna-2 models. Key elements include:

Ultra-Low Latency at Scale

Sub-200ms latency even when running 10-20 metrics simultaneously
Millisecond-level verdicts that don't slow down user interactions
Real-time guardrails that can intercept risky agent actions before execution

Cost-Efficient by Design

~$0.02 per million tokens—97% cheaper than GPT-4 evaluation
Shared infrastructure across all metrics reduces hosting overhead
Always-on monitoring without breaking the budget

Purpose-Built for Agents

Luna-2 includes out-of-the-box metrics explicitly designed for agentic systems:

Tool selection quality - Is the agent choosing the right tools?
Flow adherence - Is the agent following the intended workflow?
Unsafe action detection - Will this action cause harm or policy violations?
Multi-turn conversation quality - Is the agent maintaining context across turns?

Feature comparison table of Galileo Luna 2, Azure Content Safety, and NVIDIA NeMo. Highlights agentic and safety metric support including tool error rate, unsafe action detection, bias, PII leak, and prompt injection. Luna 2 covers all agentic and safety metrics, outperforming alternatives.

Beyond Out-of-the-Box: Customization at Speed

While Luna-2 comes pre-optimized for common agentic metrics, every enterprise has unique evaluation needs. Luna 2 adapts to your specific requirements with minimal data, often fewer than 50 labeled examples.

Whether you need custom definitions of "appropriate customer tone" or industry-specific compliance checks, Luna-2 can be fine-tuned to your domain in minutes, not months.

Technical Innovation: One Model, Hundreds of Metrics

Luna-2's architecture also enables unprecedented efficiency:

Multi-Headed Design: Lightweight adapters on a shared core model let one base model power hundreds of different metrics without multiplying infrastructure costs.
Optimized Inference Engine: Hosted on Galileo's proprietary inference layer with L4 GPUs for consistent performance at massive scale.
Intelligent Context Handling: Dynamic windowing ensures comprehensive evaluation even for long agent conversations and complex tool sequences.

This isn't just faster evaluation, it's a fundamentally different approach that makes comprehensive AI monitoring in production economically viable.

Animated demo of a telecom AI agent powered by Luna 2 responding in real time with guardrails in place. Showcases live interaction handling with ultra-low latency and accurate decision-making, ideal for customer service and enterprise AI workflows.

Available Now

Luna-2 is available today in two sizes—3B and 8B parameters—and powers evaluation across the entire Galileo platform

For enterprise teams ready to move beyond basic monitoring and deploy truly reliable AI agents, Luna 2 represents a fundamental shift in what's possible.

As AI agents become more autonomous and handle higher-stakes decisions, evaluation can't remain an afterthought. Luna-2 makes comprehensive agent evaluation and in-production guardrails not just feasible but economical, enabling the next generation of reliable agentic AI.

Ready to see Luna-2 in action? Contact our team to learn how Fortune 50 companies are using Luna-2 to scale reliable agent workflows.

Luna-2 is available now for Galileo enterprise customers. Learn more about our agent reliability platform or schedule a demo to see how Luna 2 can power your agentic workflows and real-time guardrails.

Conor Bronsdon