Introducing Agentic Evaluations!

Introducing Agentic Evaluations

Quique Lores
Quique LoresProduct
Murtaza Khomusi
Murtaza KhomusiHead of Product Marketing
Introducing Agentic Evaluations: Everything developers need to build, ship, and scale best-in-class AI agents.
3 min readJanuary 23 2025

Join Galileo and CrewAI to see agentic evaluations in action. Register here!

Agents. Agents. Agents. It’s safe to say that agentic applications are dominating AI developer airwaves, but are they really ready to be deployed in real world contexts? Without a robust framework for evaluating agents, they’re only as good as the latest science experiments—promising in theory but challenging to implement or rely on in the real world.

Today, Galileo is excited to announce the release of Agentic Evaluations, empowering developers to rapidly deploy reliable and resilient agentic applications. Core capabilities include:

  • Agent-specific metrics: Leverage proprietary, research-backed metrics powered by LLM-as-a-Judge for measuring the success of individual spans as well as overall task advancement and completion.
  • Visibility into LLM planning and tool use: Log every step, from input to final action, in actionable visualizations that make it easy to find areas for further optimization.
  • Tracking of cost, latency, and errors: Optimize agent performance, empowering builders to strike the balance between efficiency and effectiveness in agent deployments.

What makes agent evaluations challenging?

Agents introduce groundbreaking capabilities to GenAI applications, including the ability to autonomously plan and act. However, agents introduce novel challenges for developers, which traditional evaluation tools fail to address:

  • Non-deterministic paths: LLM planners can choose more than one sequence of action to respond to a user request. These are complexities beyond traditional LLM-as-a-Judge frameworks.
  • Increased failure points: Complex workflows require visibility across multi-steps and parallel processes. Evaluation needs the context of the entire session.
  • Cost and latency management: Agents rely on multiple calls to different LLMs. This can make it difficult to balance cost versus performance.

As agents take on complex and impactful workflows, the stakes—and the potential impact of errors—grow significantly. These risks, coupled with growing demand for agentic applications, increase the need for precise and actionable insights. Galileo’s Agentic Evaluations tackle these challenges head on with agent-specific metrics, updated tracing, and granular cost and error tracking.

Agent-Specific Metrics

Traditional GenAI metrics focus on evaluating the final response of a LLM, measuring things like factuality, PII leakage, toxicity, bias, and more. With agents, there’s a lot more happening under the hood before arriving at a final action. Agents use an LLM Planner to make decisions and determine what relevant tools to call to reach a final action based on its understanding of a user’s intent and goals.

Figure 1: AI builders need to evaluate their agent performance at multiple steps.
Figure 1: AI builders need to evaluate their agent performance at multiple steps.

What this means in practice is that under-the-hood agents involve multiple steps to get from user input to final action, and there are any number of paths that an agent may take. While this flexibility is considered a key value proposition of agents, it also increases potential points of failure. Galileo has built a set of proprietary LLM-as-a-Judge metrics specifically for developers building agents. The metrics in our Agentic Evaluations include:

  • Tool Selection Quality: Did the LLM Planner select the correct tool and arguments?
  • Tool Errors: Did any individual tool error out?
  • Action Advancement: Does each trace reflect progress toward the ultimate goal?
  • Action Completion: Does the final action align with the agent’s original instructions?

Figure 2: New agent-specific metrics with aggregate scores shown across multiple runs in the Galileo Evaluation Platform.
Figure 2: New agent-specific metrics with aggregate scores shown across multiple runs in the Galileo Evaluation Platform.

These metrics are tested and refined by our research team to perform well (AUC of 0.93 and 0.97 respectively, see Figure 3) on a series of popular benchmark datasets. In addition, these metrics are refined through customer learnings, incorporating best practices of real-world leading AI teams and their agentic applications.

Figure 3: ROC curve of Galileo’s Tool Selection Quality and Tool Error metrics, showing high area under the curve (AUC) when tested on benchmark datasets.
Figure 3: ROC curve of Galileo’s Tool Selection Quality and Tool Error metrics, showing high area under the curve (AUC) when tested on benchmark datasets.

Visibility into LLM planning and tool use

The multi-span nature of agent completions makes sifting through individual logs to pinpoint issues challenging. To enable both the high-level overview of entire agentic completions as well as granular insights into individual nodes, Galileo automatically groups entire traces and provides a tool use overview in a single expandable visualization.

Figure 4: A multi-span trace in the Galileo Evaluation platform, showing tool and LLM calls all part of a single completion.
Figure 4: A multi-span trace in the Galileo Evaluation platform, showing tool and LLM calls all part of a single completion.

Galileo makes it simple to get overviews of agentic completions as well as granular insights into the performance of individual steps in the chain. Developers no longer have to manually sift through rows of logs. They can rapidly pinpoint areas for improvement and measure overall application health.

By the way, we just published our comprehensive book on Agents that will take you from 0 -> 1. It is completely free and available now for download.

Tracking of cost, latency, and errors

With so much happening under the hood, can agents remain cost-effective and performant in real-world settings? As we’ve learned across engagements, what sets the best GenAI apps apart is that every node in the chain has been rigorously tested to optimize for cost and latency.

Figure 5: Using Galileo to measure the cost and performance of an agent during refinement
Figure 5: Using Galileo to measure the cost and performance of an agent during refinement

Galileo aggregates the cost and latency of end-to-end traces while letting users drill down to find which node is causing cost spikes or slow downs. This enables developers to experiment and A/B test runs side-by-side. This makes it easier to ship reliable, efficient apps.

Conclusion

We mark this as a significant milestone for anyone building AI agents. Builders are already using these tools to accelerate time-to-production of reliable and scalable agentic apps, and we can’t wait to see where they go next.

To try it for yourself, register for access to the Galileo Evaluation Platform, including our latest Agentic Evaluations and speak with a Galileo expert. For a deep-dive into best practices for evaluating agents, be sure to tune into our webinar on agentic evaluations.

Hi there! What can I help you with?