Jun 18, 2025
Introducing Luna 2: Purpose-Built Models for Reliable AI Evaluations & Guardrailing


Conor Bronsdon
Head of Developer Awareness
Conor Bronsdon
Head of Developer Awareness


AI agents are already everywhere, from customer support chatbots that handle millions of conversations to financial services agents making decisions with real money. But as these agents get more autonomous and complex, evaluation and real-time guardrailing becomes the bottleneck. You can’t have your AI systems selling a car for $1.
Traditional LLM-based evaluation is too slow and expensive for production agent workflows or real-time guardrailing. GPT-4 evals that cost $50+ per thousand evaluations and take multiple seconds to complete won’t work when you need to evaluate 10-20 agent metrics simultaneously in real-time. You need more customization and better price for performance to build reliable AI systems at scale.
That’s why we're excited to introduce Luna 2—our next generation of small language models already powering customized evaluations and guardrails at multiple Fortune 50 companies. Luna 2 is purpose-built for customized real-time guardrailing based on low-latency, low-cost evaluations, designed with complex multi-agent systems in mind.
The Agent Evaluation Challenge
Enterprise teams building AI agents thus far have faced a fundamental tradeoff: comprehensive evaluation or efficient agents.
When your financial services agent is about to initiate a $10M transfer, you need real-time checks for:
Tool selection quality (Is this the right action?)
Flow adherence (Is the agent following proper procedures?)
Unsafe actions (Does this require additional authorization?)
Response groundedness (Is the agent making up information?)
Running 10+ evaluation metrics through LLM judges like GPT-4.1 on every agent interaction would cost a fortune and introduce unacceptable latency. So most teams either skip comprehensive evaluation or settle for basic heuristics that miss critical edge cases. Even worse, this trust and reliability blocker is stopping many exciting AI use cases from making it into production.

Luna 2 eliminates this tradeoff.
Luna 2: Built for Production Agent Workflows
Luna 2 delivers enterprise-grade evaluation with the speed and cost efficiency that production systems demand. Hosted on Galileo’s proprietary optimized inference engine, and powered using modern GPU hardware for low-cost and low-latency evaluations, Luna models can adapt to hundreds of metrics while remaining deterministic, and offer superior out-of-the-box metrics for agentic evaluation and reliability. This approach enables organizations to leverage both custom fine-tuned LLMs and SLMs as judges. Luna offers:
Adaptability: Easily customizable with minimal data
Efficiency: Low latency when running multiple metrics (10-20) simultaneously
Cost-effectiveness: Lower cost compared to traditional LLM-based evaluation
Sophisticated agentic metrics: Tool error rate, context adherence, tool selection quality, etc.

Let’s dig into some of the specifics of the Luna 2 models. Key elements include:
Ultra-Low Latency at Scale
Sub-200ms latency even when running 10-20 metrics simultaneously
Millisecond-level verdicts that don't slow down user interactions
Real-time guardrails that can intercept risky agent actions before execution
Cost-Efficient by Design
~$0.02 per million tokens—97% cheaper than GPT-4 evaluation
Shared infrastructure across all metrics reduces hosting overhead
Always-on monitoring without breaking the budget
Purpose-Built for Agents
Luna 2 includes out-of-the-box metrics explicitly designed for agentic systems:
Tool selection quality - Is the agent choosing the right tools?
Flow adherence - Is the agent following the intended workflow?
Unsafe action detection - Will this action cause harm or policy violations?
Multi-turn conversation quality - Is the agent maintaining context across turns?

Beyond Out-of-the-Box: Customization at Speed
While Luna 2 comes pre-optimized for common agentic metrics, every enterprise has unique evaluation needs. Luna 2 adapts to your specific requirements with minimal data, often fewer than 50 labeled examples.
Whether you need custom definitions of "appropriate customer tone" or industry-specific compliance checks, Luna 2 can be fine-tuned to your domain in minutes, not months.
Technical Innovation: One Model, Hundreds of Metrics
Luna 2's architecture also enables unprecedented efficiency:
Multi-Headed Design: Lightweight adapters on a shared core model let one base model power hundreds of different metrics without multiplying infrastructure costs.
Optimized Inference Engine: Hosted on Galileo's proprietary inference layer with L4 GPUs for consistent performance at massive scale.
Intelligent Context Handling: Dynamic windowing ensures comprehensive evaluation even for long agent conversations and complex tool sequences.
This isn't just faster evaluation, it's a fundamentally different approach that makes comprehensive AI monitoring in production economically viable.

Available Now
Luna 2 is available today in two sizes—3B and 8B parameters—and powers evaluation across the entire Galileo platform
For enterprise teams ready to move beyond basic monitoring and deploy truly reliable AI agents, Luna 2 represents a fundamental shift in what's possible.
As AI agents become more autonomous and handle higher-stakes decisions, evaluation can't remain an afterthought. Luna 2 makes comprehensive agent evaluation and in-production guardrails not just feasible but economical, enabling the next generation of reliable agentic AI.
Ready to see Luna 2 in action? Contact our team to learn how Fortune 50 companies are using Luna 2 to scale reliable agent workflows.
Luna 2 is available now for Galileo enterprise customers. Learn more about our agent reliability platform or schedule a demo to see how Luna 2 can power your agentic workflows and real-time guardrails.
AI agents are already everywhere, from customer support chatbots that handle millions of conversations to financial services agents making decisions with real money. But as these agents get more autonomous and complex, evaluation and real-time guardrailing becomes the bottleneck. You can’t have your AI systems selling a car for $1.
Traditional LLM-based evaluation is too slow and expensive for production agent workflows or real-time guardrailing. GPT-4 evals that cost $50+ per thousand evaluations and take multiple seconds to complete won’t work when you need to evaluate 10-20 agent metrics simultaneously in real-time. You need more customization and better price for performance to build reliable AI systems at scale.
That’s why we're excited to introduce Luna 2—our next generation of small language models already powering customized evaluations and guardrails at multiple Fortune 50 companies. Luna 2 is purpose-built for customized real-time guardrailing based on low-latency, low-cost evaluations, designed with complex multi-agent systems in mind.
The Agent Evaluation Challenge
Enterprise teams building AI agents thus far have faced a fundamental tradeoff: comprehensive evaluation or efficient agents.
When your financial services agent is about to initiate a $10M transfer, you need real-time checks for:
Tool selection quality (Is this the right action?)
Flow adherence (Is the agent following proper procedures?)
Unsafe actions (Does this require additional authorization?)
Response groundedness (Is the agent making up information?)
Running 10+ evaluation metrics through LLM judges like GPT-4.1 on every agent interaction would cost a fortune and introduce unacceptable latency. So most teams either skip comprehensive evaluation or settle for basic heuristics that miss critical edge cases. Even worse, this trust and reliability blocker is stopping many exciting AI use cases from making it into production.

Luna 2 eliminates this tradeoff.
Luna 2: Built for Production Agent Workflows
Luna 2 delivers enterprise-grade evaluation with the speed and cost efficiency that production systems demand. Hosted on Galileo’s proprietary optimized inference engine, and powered using modern GPU hardware for low-cost and low-latency evaluations, Luna models can adapt to hundreds of metrics while remaining deterministic, and offer superior out-of-the-box metrics for agentic evaluation and reliability. This approach enables organizations to leverage both custom fine-tuned LLMs and SLMs as judges. Luna offers:
Adaptability: Easily customizable with minimal data
Efficiency: Low latency when running multiple metrics (10-20) simultaneously
Cost-effectiveness: Lower cost compared to traditional LLM-based evaluation
Sophisticated agentic metrics: Tool error rate, context adherence, tool selection quality, etc.

Let’s dig into some of the specifics of the Luna 2 models. Key elements include:
Ultra-Low Latency at Scale
Sub-200ms latency even when running 10-20 metrics simultaneously
Millisecond-level verdicts that don't slow down user interactions
Real-time guardrails that can intercept risky agent actions before execution
Cost-Efficient by Design
~$0.02 per million tokens—97% cheaper than GPT-4 evaluation
Shared infrastructure across all metrics reduces hosting overhead
Always-on monitoring without breaking the budget
Purpose-Built for Agents
Luna 2 includes out-of-the-box metrics explicitly designed for agentic systems:
Tool selection quality - Is the agent choosing the right tools?
Flow adherence - Is the agent following the intended workflow?
Unsafe action detection - Will this action cause harm or policy violations?
Multi-turn conversation quality - Is the agent maintaining context across turns?

Beyond Out-of-the-Box: Customization at Speed
While Luna 2 comes pre-optimized for common agentic metrics, every enterprise has unique evaluation needs. Luna 2 adapts to your specific requirements with minimal data, often fewer than 50 labeled examples.
Whether you need custom definitions of "appropriate customer tone" or industry-specific compliance checks, Luna 2 can be fine-tuned to your domain in minutes, not months.
Technical Innovation: One Model, Hundreds of Metrics
Luna 2's architecture also enables unprecedented efficiency:
Multi-Headed Design: Lightweight adapters on a shared core model let one base model power hundreds of different metrics without multiplying infrastructure costs.
Optimized Inference Engine: Hosted on Galileo's proprietary inference layer with L4 GPUs for consistent performance at massive scale.
Intelligent Context Handling: Dynamic windowing ensures comprehensive evaluation even for long agent conversations and complex tool sequences.
This isn't just faster evaluation, it's a fundamentally different approach that makes comprehensive AI monitoring in production economically viable.

Available Now
Luna 2 is available today in two sizes—3B and 8B parameters—and powers evaluation across the entire Galileo platform
For enterprise teams ready to move beyond basic monitoring and deploy truly reliable AI agents, Luna 2 represents a fundamental shift in what's possible.
As AI agents become more autonomous and handle higher-stakes decisions, evaluation can't remain an afterthought. Luna 2 makes comprehensive agent evaluation and in-production guardrails not just feasible but economical, enabling the next generation of reliable agentic AI.
Ready to see Luna 2 in action? Contact our team to learn how Fortune 50 companies are using Luna 2 to scale reliable agent workflows.
Luna 2 is available now for Galileo enterprise customers. Learn more about our agent reliability platform or schedule a demo to see how Luna 2 can power your agentic workflows and real-time guardrails.
AI agents are already everywhere, from customer support chatbots that handle millions of conversations to financial services agents making decisions with real money. But as these agents get more autonomous and complex, evaluation and real-time guardrailing becomes the bottleneck. You can’t have your AI systems selling a car for $1.
Traditional LLM-based evaluation is too slow and expensive for production agent workflows or real-time guardrailing. GPT-4 evals that cost $50+ per thousand evaluations and take multiple seconds to complete won’t work when you need to evaluate 10-20 agent metrics simultaneously in real-time. You need more customization and better price for performance to build reliable AI systems at scale.
That’s why we're excited to introduce Luna 2—our next generation of small language models already powering customized evaluations and guardrails at multiple Fortune 50 companies. Luna 2 is purpose-built for customized real-time guardrailing based on low-latency, low-cost evaluations, designed with complex multi-agent systems in mind.
The Agent Evaluation Challenge
Enterprise teams building AI agents thus far have faced a fundamental tradeoff: comprehensive evaluation or efficient agents.
When your financial services agent is about to initiate a $10M transfer, you need real-time checks for:
Tool selection quality (Is this the right action?)
Flow adherence (Is the agent following proper procedures?)
Unsafe actions (Does this require additional authorization?)
Response groundedness (Is the agent making up information?)
Running 10+ evaluation metrics through LLM judges like GPT-4.1 on every agent interaction would cost a fortune and introduce unacceptable latency. So most teams either skip comprehensive evaluation or settle for basic heuristics that miss critical edge cases. Even worse, this trust and reliability blocker is stopping many exciting AI use cases from making it into production.

Luna 2 eliminates this tradeoff.
Luna 2: Built for Production Agent Workflows
Luna 2 delivers enterprise-grade evaluation with the speed and cost efficiency that production systems demand. Hosted on Galileo’s proprietary optimized inference engine, and powered using modern GPU hardware for low-cost and low-latency evaluations, Luna models can adapt to hundreds of metrics while remaining deterministic, and offer superior out-of-the-box metrics for agentic evaluation and reliability. This approach enables organizations to leverage both custom fine-tuned LLMs and SLMs as judges. Luna offers:
Adaptability: Easily customizable with minimal data
Efficiency: Low latency when running multiple metrics (10-20) simultaneously
Cost-effectiveness: Lower cost compared to traditional LLM-based evaluation
Sophisticated agentic metrics: Tool error rate, context adherence, tool selection quality, etc.

Let’s dig into some of the specifics of the Luna 2 models. Key elements include:
Ultra-Low Latency at Scale
Sub-200ms latency even when running 10-20 metrics simultaneously
Millisecond-level verdicts that don't slow down user interactions
Real-time guardrails that can intercept risky agent actions before execution
Cost-Efficient by Design
~$0.02 per million tokens—97% cheaper than GPT-4 evaluation
Shared infrastructure across all metrics reduces hosting overhead
Always-on monitoring without breaking the budget
Purpose-Built for Agents
Luna 2 includes out-of-the-box metrics explicitly designed for agentic systems:
Tool selection quality - Is the agent choosing the right tools?
Flow adherence - Is the agent following the intended workflow?
Unsafe action detection - Will this action cause harm or policy violations?
Multi-turn conversation quality - Is the agent maintaining context across turns?

Beyond Out-of-the-Box: Customization at Speed
While Luna 2 comes pre-optimized for common agentic metrics, every enterprise has unique evaluation needs. Luna 2 adapts to your specific requirements with minimal data, often fewer than 50 labeled examples.
Whether you need custom definitions of "appropriate customer tone" or industry-specific compliance checks, Luna 2 can be fine-tuned to your domain in minutes, not months.
Technical Innovation: One Model, Hundreds of Metrics
Luna 2's architecture also enables unprecedented efficiency:
Multi-Headed Design: Lightweight adapters on a shared core model let one base model power hundreds of different metrics without multiplying infrastructure costs.
Optimized Inference Engine: Hosted on Galileo's proprietary inference layer with L4 GPUs for consistent performance at massive scale.
Intelligent Context Handling: Dynamic windowing ensures comprehensive evaluation even for long agent conversations and complex tool sequences.
This isn't just faster evaluation, it's a fundamentally different approach that makes comprehensive AI monitoring in production economically viable.

Available Now
Luna 2 is available today in two sizes—3B and 8B parameters—and powers evaluation across the entire Galileo platform
For enterprise teams ready to move beyond basic monitoring and deploy truly reliable AI agents, Luna 2 represents a fundamental shift in what's possible.
As AI agents become more autonomous and handle higher-stakes decisions, evaluation can't remain an afterthought. Luna 2 makes comprehensive agent evaluation and in-production guardrails not just feasible but economical, enabling the next generation of reliable agentic AI.
Ready to see Luna 2 in action? Contact our team to learn how Fortune 50 companies are using Luna 2 to scale reliable agent workflows.
Luna 2 is available now for Galileo enterprise customers. Learn more about our agent reliability platform or schedule a demo to see how Luna 2 can power your agentic workflows and real-time guardrails.
AI agents are already everywhere, from customer support chatbots that handle millions of conversations to financial services agents making decisions with real money. But as these agents get more autonomous and complex, evaluation and real-time guardrailing becomes the bottleneck. You can’t have your AI systems selling a car for $1.
Traditional LLM-based evaluation is too slow and expensive for production agent workflows or real-time guardrailing. GPT-4 evals that cost $50+ per thousand evaluations and take multiple seconds to complete won’t work when you need to evaluate 10-20 agent metrics simultaneously in real-time. You need more customization and better price for performance to build reliable AI systems at scale.
That’s why we're excited to introduce Luna 2—our next generation of small language models already powering customized evaluations and guardrails at multiple Fortune 50 companies. Luna 2 is purpose-built for customized real-time guardrailing based on low-latency, low-cost evaluations, designed with complex multi-agent systems in mind.
The Agent Evaluation Challenge
Enterprise teams building AI agents thus far have faced a fundamental tradeoff: comprehensive evaluation or efficient agents.
When your financial services agent is about to initiate a $10M transfer, you need real-time checks for:
Tool selection quality (Is this the right action?)
Flow adherence (Is the agent following proper procedures?)
Unsafe actions (Does this require additional authorization?)
Response groundedness (Is the agent making up information?)
Running 10+ evaluation metrics through LLM judges like GPT-4.1 on every agent interaction would cost a fortune and introduce unacceptable latency. So most teams either skip comprehensive evaluation or settle for basic heuristics that miss critical edge cases. Even worse, this trust and reliability blocker is stopping many exciting AI use cases from making it into production.

Luna 2 eliminates this tradeoff.
Luna 2: Built for Production Agent Workflows
Luna 2 delivers enterprise-grade evaluation with the speed and cost efficiency that production systems demand. Hosted on Galileo’s proprietary optimized inference engine, and powered using modern GPU hardware for low-cost and low-latency evaluations, Luna models can adapt to hundreds of metrics while remaining deterministic, and offer superior out-of-the-box metrics for agentic evaluation and reliability. This approach enables organizations to leverage both custom fine-tuned LLMs and SLMs as judges. Luna offers:
Adaptability: Easily customizable with minimal data
Efficiency: Low latency when running multiple metrics (10-20) simultaneously
Cost-effectiveness: Lower cost compared to traditional LLM-based evaluation
Sophisticated agentic metrics: Tool error rate, context adherence, tool selection quality, etc.

Let’s dig into some of the specifics of the Luna 2 models. Key elements include:
Ultra-Low Latency at Scale
Sub-200ms latency even when running 10-20 metrics simultaneously
Millisecond-level verdicts that don't slow down user interactions
Real-time guardrails that can intercept risky agent actions before execution
Cost-Efficient by Design
~$0.02 per million tokens—97% cheaper than GPT-4 evaluation
Shared infrastructure across all metrics reduces hosting overhead
Always-on monitoring without breaking the budget
Purpose-Built for Agents
Luna 2 includes out-of-the-box metrics explicitly designed for agentic systems:
Tool selection quality - Is the agent choosing the right tools?
Flow adherence - Is the agent following the intended workflow?
Unsafe action detection - Will this action cause harm or policy violations?
Multi-turn conversation quality - Is the agent maintaining context across turns?

Beyond Out-of-the-Box: Customization at Speed
While Luna 2 comes pre-optimized for common agentic metrics, every enterprise has unique evaluation needs. Luna 2 adapts to your specific requirements with minimal data, often fewer than 50 labeled examples.
Whether you need custom definitions of "appropriate customer tone" or industry-specific compliance checks, Luna 2 can be fine-tuned to your domain in minutes, not months.
Technical Innovation: One Model, Hundreds of Metrics
Luna 2's architecture also enables unprecedented efficiency:
Multi-Headed Design: Lightweight adapters on a shared core model let one base model power hundreds of different metrics without multiplying infrastructure costs.
Optimized Inference Engine: Hosted on Galileo's proprietary inference layer with L4 GPUs for consistent performance at massive scale.
Intelligent Context Handling: Dynamic windowing ensures comprehensive evaluation even for long agent conversations and complex tool sequences.
This isn't just faster evaluation, it's a fundamentally different approach that makes comprehensive AI monitoring in production economically viable.

Available Now
Luna 2 is available today in two sizes—3B and 8B parameters—and powers evaluation across the entire Galileo platform
For enterprise teams ready to move beyond basic monitoring and deploy truly reliable AI agents, Luna 2 represents a fundamental shift in what's possible.
As AI agents become more autonomous and handle higher-stakes decisions, evaluation can't remain an afterthought. Luna 2 makes comprehensive agent evaluation and in-production guardrails not just feasible but economical, enabling the next generation of reliable agentic AI.
Ready to see Luna 2 in action? Contact our team to learn how Fortune 50 companies are using Luna 2 to scale reliable agent workflows.
Luna 2 is available now for Galileo enterprise customers. Learn more about our agent reliability platform or schedule a demo to see how Luna 2 can power your agentic workflows and real-time guardrails.