Jan 22, 2025

Introducing Agentic Evaluations

Quique Lores

Head of Product

Quique Lores

Head of Product

Join Galileo and CrewAI to see agentic evaluations in action. Register here!

Agents. Agents. Agents. It’s safe to say that agentic applications are dominating AI developer airwaves, but are they really ready to be deployed in real world contexts? Without a robust framework for evaluating agents, they’re only as good as the latest science experiments—promising in theory but challenging to implement or rely on in the real world.

Today, Galileo is excited to announce the release of Agentic Evaluations, empowering developers to rapidly deploy reliable and resilient agentic applications. Core capabilities include:

  • Agent-specific metrics: Leverage proprietary, research-backed metrics powered by LLM-as-a-Judge for measuring the success of individual spans as well as overall task advancement and completion.

  • Visibility into LLM planning and tool use: Log every step, from input to final action, in actionable visualizations that make it easy to find areas for further optimization.

  • Tracking of cost, latency, and errors: Optimize agent performance, empowering builders to strike the balance between efficiency and effectiveness in agent deployments.

What makes agent evaluations challenging?

Agents introduce groundbreaking capabilities to GenAI applications, including the ability to autonomously plan and act. However, agents introduce novel challenges for developers, which traditional evaluation tools fail to address:

  • Non-deterministic paths: LLM planners can choose more than one sequence of action to respond to a user request. These are complexities beyond traditional LLM-as-a-Judge frameworks.

  • Increased failure points: Complex workflows require visibility across multi-steps and parallel processes. Evaluation needs the context of the entire session.

  • Cost and latency management: Agents rely on multiple calls to different LLMs. This can make it difficult to balance cost versus performance.

As agents take on complex and impactful workflows, the stakes—and the potential impact of errors—grow significantly. These risks, coupled with growing demand for agentic applications, increase the need for precise and actionable insights. Galileo’s Agentic Evaluations tackle these challenges head on with agent-specific metrics, updated tracing, and granular cost and error tracking.

Agent-Specific Metrics

Traditional GenAI metrics focus on evaluating the final response of a LLM, measuring things like factuality, PII leakage, toxicity, bias, and more. With agents, there’s a lot more happening under the hood before arriving at a final action. Agents use an LLM Planner to make decisions and determine what relevant tools to call to reach a final action based on its understanding of a user’s intent and goals.

What this means in practice is that under-the-hood agents involve multiple steps to get from user input to final action, and there are any number of paths that an agent may take. While this flexibility is considered a key value proposition of agents, it also increases potential points of failure. Galileo has built a set of proprietary LLM-as-a-Judge metrics specifically for developers building agents. The metrics in our Agentic Evaluations include.

  • Tool Selection Quality: Did the LLM Planner select the correct tool and arguments?

  • Tool Errors: Did any individual tool error out?

  • Action Advancement: Does each trace reflect progress toward the ultimate goal?

  • Action Completion: Does the final action align with the agent’s original instructions?

These metrics are tested and refined by our research team to perform well (AUC of 0.93 and 0.97 respectively, see Figure 3) on a series of popular benchmark datasets. In addition, these metrics are refined through customer learnings, incorporating best practices of real-world leading AI teams and their agentic applications.

Visibility into LLM planning and tool use

The multi-span nature of agent completions makes sifting through individual logs to pinpoint issues challenging. To enable both the high-level overview of entire agentic completions as well as granular insights into individual nodes, Galileo automatically groups entire traces and provides a tool use overview in a single expandable visualization.

Galileo makes it simple to get overviews of agentic completions as well as granular insights into the performance of individual steps in the chain. Developers no longer have to manually sift through rows of logs. They can rapidly pinpoint areas for improvement and measure overall application health.

By the way, we just published our comprehensive book on Agents that will take you from 0 -> 1. It is completely free and available now for download.

Tracking of cost, latency, and errors

With so much happening under the hood, can agents remain cost-effective and performant in real-world settings? As we’ve learned across engagements, what sets the best GenAI apps apart is that every node in the chain has been rigorously tested to optimize for cost and latency.

Galileo aggregates the cost and latency of end-to-end traces while letting users drill down to find which node is causing cost spikes or slowdowns. This enables developers to experiment and run A/B test runs side-by-side, making it easier to ship reliable, efficient apps.

Conclusion

We mark this as a significant milestone for anyone building AI agents. Builders are already using these tools to accelerate time-to-production of reliable and scalable agentic apps, and we can’t wait to see where they go next.

Sign up for free access to the Galileo Evaluation Platform, including our latest Agentic Evaluations, to try it for yourself. For a deep dive into best practices for evaluating agents, be sure to tune into our webinar on agentic evaluations.

Join Galileo and CrewAI to see agentic evaluations in action. Register here!

Agents. Agents. Agents. It’s safe to say that agentic applications are dominating AI developer airwaves, but are they really ready to be deployed in real world contexts? Without a robust framework for evaluating agents, they’re only as good as the latest science experiments—promising in theory but challenging to implement or rely on in the real world.

Today, Galileo is excited to announce the release of Agentic Evaluations, empowering developers to rapidly deploy reliable and resilient agentic applications. Core capabilities include:

  • Agent-specific metrics: Leverage proprietary, research-backed metrics powered by LLM-as-a-Judge for measuring the success of individual spans as well as overall task advancement and completion.

  • Visibility into LLM planning and tool use: Log every step, from input to final action, in actionable visualizations that make it easy to find areas for further optimization.

  • Tracking of cost, latency, and errors: Optimize agent performance, empowering builders to strike the balance between efficiency and effectiveness in agent deployments.

What makes agent evaluations challenging?

Agents introduce groundbreaking capabilities to GenAI applications, including the ability to autonomously plan and act. However, agents introduce novel challenges for developers, which traditional evaluation tools fail to address:

  • Non-deterministic paths: LLM planners can choose more than one sequence of action to respond to a user request. These are complexities beyond traditional LLM-as-a-Judge frameworks.

  • Increased failure points: Complex workflows require visibility across multi-steps and parallel processes. Evaluation needs the context of the entire session.

  • Cost and latency management: Agents rely on multiple calls to different LLMs. This can make it difficult to balance cost versus performance.

As agents take on complex and impactful workflows, the stakes—and the potential impact of errors—grow significantly. These risks, coupled with growing demand for agentic applications, increase the need for precise and actionable insights. Galileo’s Agentic Evaluations tackle these challenges head on with agent-specific metrics, updated tracing, and granular cost and error tracking.

Agent-Specific Metrics

Traditional GenAI metrics focus on evaluating the final response of a LLM, measuring things like factuality, PII leakage, toxicity, bias, and more. With agents, there’s a lot more happening under the hood before arriving at a final action. Agents use an LLM Planner to make decisions and determine what relevant tools to call to reach a final action based on its understanding of a user’s intent and goals.

What this means in practice is that under-the-hood agents involve multiple steps to get from user input to final action, and there are any number of paths that an agent may take. While this flexibility is considered a key value proposition of agents, it also increases potential points of failure. Galileo has built a set of proprietary LLM-as-a-Judge metrics specifically for developers building agents. The metrics in our Agentic Evaluations include.

  • Tool Selection Quality: Did the LLM Planner select the correct tool and arguments?

  • Tool Errors: Did any individual tool error out?

  • Action Advancement: Does each trace reflect progress toward the ultimate goal?

  • Action Completion: Does the final action align with the agent’s original instructions?

These metrics are tested and refined by our research team to perform well (AUC of 0.93 and 0.97 respectively, see Figure 3) on a series of popular benchmark datasets. In addition, these metrics are refined through customer learnings, incorporating best practices of real-world leading AI teams and their agentic applications.

Visibility into LLM planning and tool use

The multi-span nature of agent completions makes sifting through individual logs to pinpoint issues challenging. To enable both the high-level overview of entire agentic completions as well as granular insights into individual nodes, Galileo automatically groups entire traces and provides a tool use overview in a single expandable visualization.

Galileo makes it simple to get overviews of agentic completions as well as granular insights into the performance of individual steps in the chain. Developers no longer have to manually sift through rows of logs. They can rapidly pinpoint areas for improvement and measure overall application health.

By the way, we just published our comprehensive book on Agents that will take you from 0 -> 1. It is completely free and available now for download.

Tracking of cost, latency, and errors

With so much happening under the hood, can agents remain cost-effective and performant in real-world settings? As we’ve learned across engagements, what sets the best GenAI apps apart is that every node in the chain has been rigorously tested to optimize for cost and latency.

Galileo aggregates the cost and latency of end-to-end traces while letting users drill down to find which node is causing cost spikes or slowdowns. This enables developers to experiment and run A/B test runs side-by-side, making it easier to ship reliable, efficient apps.

Conclusion

We mark this as a significant milestone for anyone building AI agents. Builders are already using these tools to accelerate time-to-production of reliable and scalable agentic apps, and we can’t wait to see where they go next.

Sign up for free access to the Galileo Evaluation Platform, including our latest Agentic Evaluations, to try it for yourself. For a deep dive into best practices for evaluating agents, be sure to tune into our webinar on agentic evaluations.

Join Galileo and CrewAI to see agentic evaluations in action. Register here!

Agents. Agents. Agents. It’s safe to say that agentic applications are dominating AI developer airwaves, but are they really ready to be deployed in real world contexts? Without a robust framework for evaluating agents, they’re only as good as the latest science experiments—promising in theory but challenging to implement or rely on in the real world.

Today, Galileo is excited to announce the release of Agentic Evaluations, empowering developers to rapidly deploy reliable and resilient agentic applications. Core capabilities include:

  • Agent-specific metrics: Leverage proprietary, research-backed metrics powered by LLM-as-a-Judge for measuring the success of individual spans as well as overall task advancement and completion.

  • Visibility into LLM planning and tool use: Log every step, from input to final action, in actionable visualizations that make it easy to find areas for further optimization.

  • Tracking of cost, latency, and errors: Optimize agent performance, empowering builders to strike the balance between efficiency and effectiveness in agent deployments.

What makes agent evaluations challenging?

Agents introduce groundbreaking capabilities to GenAI applications, including the ability to autonomously plan and act. However, agents introduce novel challenges for developers, which traditional evaluation tools fail to address:

  • Non-deterministic paths: LLM planners can choose more than one sequence of action to respond to a user request. These are complexities beyond traditional LLM-as-a-Judge frameworks.

  • Increased failure points: Complex workflows require visibility across multi-steps and parallel processes. Evaluation needs the context of the entire session.

  • Cost and latency management: Agents rely on multiple calls to different LLMs. This can make it difficult to balance cost versus performance.

As agents take on complex and impactful workflows, the stakes—and the potential impact of errors—grow significantly. These risks, coupled with growing demand for agentic applications, increase the need for precise and actionable insights. Galileo’s Agentic Evaluations tackle these challenges head on with agent-specific metrics, updated tracing, and granular cost and error tracking.

Agent-Specific Metrics

Traditional GenAI metrics focus on evaluating the final response of a LLM, measuring things like factuality, PII leakage, toxicity, bias, and more. With agents, there’s a lot more happening under the hood before arriving at a final action. Agents use an LLM Planner to make decisions and determine what relevant tools to call to reach a final action based on its understanding of a user’s intent and goals.

What this means in practice is that under-the-hood agents involve multiple steps to get from user input to final action, and there are any number of paths that an agent may take. While this flexibility is considered a key value proposition of agents, it also increases potential points of failure. Galileo has built a set of proprietary LLM-as-a-Judge metrics specifically for developers building agents. The metrics in our Agentic Evaluations include.

  • Tool Selection Quality: Did the LLM Planner select the correct tool and arguments?

  • Tool Errors: Did any individual tool error out?

  • Action Advancement: Does each trace reflect progress toward the ultimate goal?

  • Action Completion: Does the final action align with the agent’s original instructions?

These metrics are tested and refined by our research team to perform well (AUC of 0.93 and 0.97 respectively, see Figure 3) on a series of popular benchmark datasets. In addition, these metrics are refined through customer learnings, incorporating best practices of real-world leading AI teams and their agentic applications.

Visibility into LLM planning and tool use

The multi-span nature of agent completions makes sifting through individual logs to pinpoint issues challenging. To enable both the high-level overview of entire agentic completions as well as granular insights into individual nodes, Galileo automatically groups entire traces and provides a tool use overview in a single expandable visualization.

Galileo makes it simple to get overviews of agentic completions as well as granular insights into the performance of individual steps in the chain. Developers no longer have to manually sift through rows of logs. They can rapidly pinpoint areas for improvement and measure overall application health.

By the way, we just published our comprehensive book on Agents that will take you from 0 -> 1. It is completely free and available now for download.

Tracking of cost, latency, and errors

With so much happening under the hood, can agents remain cost-effective and performant in real-world settings? As we’ve learned across engagements, what sets the best GenAI apps apart is that every node in the chain has been rigorously tested to optimize for cost and latency.

Galileo aggregates the cost and latency of end-to-end traces while letting users drill down to find which node is causing cost spikes or slowdowns. This enables developers to experiment and run A/B test runs side-by-side, making it easier to ship reliable, efficient apps.

Conclusion

We mark this as a significant milestone for anyone building AI agents. Builders are already using these tools to accelerate time-to-production of reliable and scalable agentic apps, and we can’t wait to see where they go next.

Sign up for free access to the Galileo Evaluation Platform, including our latest Agentic Evaluations, to try it for yourself. For a deep dive into best practices for evaluating agents, be sure to tune into our webinar on agentic evaluations.

Join Galileo and CrewAI to see agentic evaluations in action. Register here!

Agents. Agents. Agents. It’s safe to say that agentic applications are dominating AI developer airwaves, but are they really ready to be deployed in real world contexts? Without a robust framework for evaluating agents, they’re only as good as the latest science experiments—promising in theory but challenging to implement or rely on in the real world.

Today, Galileo is excited to announce the release of Agentic Evaluations, empowering developers to rapidly deploy reliable and resilient agentic applications. Core capabilities include:

  • Agent-specific metrics: Leverage proprietary, research-backed metrics powered by LLM-as-a-Judge for measuring the success of individual spans as well as overall task advancement and completion.

  • Visibility into LLM planning and tool use: Log every step, from input to final action, in actionable visualizations that make it easy to find areas for further optimization.

  • Tracking of cost, latency, and errors: Optimize agent performance, empowering builders to strike the balance between efficiency and effectiveness in agent deployments.

What makes agent evaluations challenging?

Agents introduce groundbreaking capabilities to GenAI applications, including the ability to autonomously plan and act. However, agents introduce novel challenges for developers, which traditional evaluation tools fail to address:

  • Non-deterministic paths: LLM planners can choose more than one sequence of action to respond to a user request. These are complexities beyond traditional LLM-as-a-Judge frameworks.

  • Increased failure points: Complex workflows require visibility across multi-steps and parallel processes. Evaluation needs the context of the entire session.

  • Cost and latency management: Agents rely on multiple calls to different LLMs. This can make it difficult to balance cost versus performance.

As agents take on complex and impactful workflows, the stakes—and the potential impact of errors—grow significantly. These risks, coupled with growing demand for agentic applications, increase the need for precise and actionable insights. Galileo’s Agentic Evaluations tackle these challenges head on with agent-specific metrics, updated tracing, and granular cost and error tracking.

Agent-Specific Metrics

Traditional GenAI metrics focus on evaluating the final response of a LLM, measuring things like factuality, PII leakage, toxicity, bias, and more. With agents, there’s a lot more happening under the hood before arriving at a final action. Agents use an LLM Planner to make decisions and determine what relevant tools to call to reach a final action based on its understanding of a user’s intent and goals.

What this means in practice is that under-the-hood agents involve multiple steps to get from user input to final action, and there are any number of paths that an agent may take. While this flexibility is considered a key value proposition of agents, it also increases potential points of failure. Galileo has built a set of proprietary LLM-as-a-Judge metrics specifically for developers building agents. The metrics in our Agentic Evaluations include.

  • Tool Selection Quality: Did the LLM Planner select the correct tool and arguments?

  • Tool Errors: Did any individual tool error out?

  • Action Advancement: Does each trace reflect progress toward the ultimate goal?

  • Action Completion: Does the final action align with the agent’s original instructions?

These metrics are tested and refined by our research team to perform well (AUC of 0.93 and 0.97 respectively, see Figure 3) on a series of popular benchmark datasets. In addition, these metrics are refined through customer learnings, incorporating best practices of real-world leading AI teams and their agentic applications.

Visibility into LLM planning and tool use

The multi-span nature of agent completions makes sifting through individual logs to pinpoint issues challenging. To enable both the high-level overview of entire agentic completions as well as granular insights into individual nodes, Galileo automatically groups entire traces and provides a tool use overview in a single expandable visualization.

Galileo makes it simple to get overviews of agentic completions as well as granular insights into the performance of individual steps in the chain. Developers no longer have to manually sift through rows of logs. They can rapidly pinpoint areas for improvement and measure overall application health.

By the way, we just published our comprehensive book on Agents that will take you from 0 -> 1. It is completely free and available now for download.

Tracking of cost, latency, and errors

With so much happening under the hood, can agents remain cost-effective and performant in real-world settings? As we’ve learned across engagements, what sets the best GenAI apps apart is that every node in the chain has been rigorously tested to optimize for cost and latency.

Galileo aggregates the cost and latency of end-to-end traces while letting users drill down to find which node is causing cost spikes or slowdowns. This enables developers to experiment and run A/B test runs side-by-side, making it easier to ship reliable, efficient apps.

Conclusion

We mark this as a significant milestone for anyone building AI agents. Builders are already using these tools to accelerate time-to-production of reliable and scalable agentic apps, and we can’t wait to see where they go next.

Sign up for free access to the Galileo Evaluation Platform, including our latest Agentic Evaluations, to try it for yourself. For a deep dive into best practices for evaluating agents, be sure to tune into our webinar on agentic evaluations.

Quique Lores

Quique Lores

Quique Lores

Quique Lores

Share this post