Join Galileo and CrewAI to see agentic evaluations in action. Register here!
Agents. Agents. Agents. It’s safe to say that agentic applications are dominating AI developer airwaves, but are they really ready to be deployed in real world contexts? Without a robust framework for evaluating agents, they’re only as good as the latest science experiments—promising in theory but challenging to implement or rely on in the real world.
Today, Galileo is excited to announce the release of Agentic Evaluations, empowering developers to rapidly deploy reliable and resilient agentic applications. Core capabilities include:
Agents introduce groundbreaking capabilities to GenAI applications, including the ability to autonomously plan and act. However, agents introduce novel challenges for developers, which traditional evaluation tools fail to address:
As agents take on complex and impactful workflows, the stakes—and the potential impact of errors—grow significantly. These risks, coupled with growing demand for agentic applications, increase the need for precise and actionable insights. Galileo’s Agentic Evaluations tackle these challenges head on with agent-specific metrics, updated tracing, and granular cost and error tracking.
Traditional GenAI metrics focus on evaluating the final response of a LLM, measuring things like factuality, PII leakage, toxicity, bias, and more. With agents, there’s a lot more happening under the hood before arriving at a final action. Agents use an LLM Planner to make decisions and determine what relevant tools to call to reach a final action based on its understanding of a user’s intent and goals.
What this means in practice is that under-the-hood agents involve multiple steps to get from user input to final action, and there are any number of paths that an agent may take. While this flexibility is considered a key value proposition of agents, it also increases potential points of failure. Galileo has built a set of proprietary LLM-as-a-Judge metrics specifically for developers building agents. The metrics in our Agentic Evaluations include:
These metrics are tested and refined by our research team to perform well (AUC of 0.93 and 0.97 respectively, see Figure 3) on a series of popular benchmark datasets. In addition, these metrics are refined through customer learnings, incorporating best practices of real-world leading AI teams and their agentic applications.
The multi-span nature of agent completions makes sifting through individual logs to pinpoint issues challenging. To enable both the high-level overview of entire agentic completions as well as granular insights into individual nodes, Galileo automatically groups entire traces and provides a tool use overview in a single expandable visualization.
Galileo makes it simple to get overviews of agentic completions as well as granular insights into the performance of individual steps in the chain. Developers no longer have to manually sift through rows of logs. They can rapidly pinpoint areas for improvement and measure overall application health.
By the way, we just published our comprehensive book on Agents that will take you from 0 -> 1. It is completely free and available now for download.
With so much happening under the hood, can agents remain cost-effective and performant in real-world settings? As we’ve learned across engagements, what sets the best GenAI apps apart is that every node in the chain has been rigorously tested to optimize for cost and latency.
Galileo aggregates the cost and latency of end-to-end traces while letting users drill down to find which node is causing cost spikes or slow downs. This enables developers to experiment and A/B test runs side-by-side. This makes it easier to ship reliable, efficient apps.
We mark this as a significant milestone for anyone building AI agents. Builders are already using these tools to accelerate time-to-production of reliable and scalable agentic apps, and we can’t wait to see where they go next.
To try it for yourself, register for access to the Galileo Evaluation Platform, including our latest Agentic Evaluations and speak with a Galileo expert. For a deep-dive into best practices for evaluating agents, be sure to tune into our webinar on agentic evaluations.