AI agents promise to revolutionize how enterprises automate complex workflows and tasks—but only if they can consistently meet the high standards for accuracy and reliability that business-critical operations demand. The gap between experimental AI and production-ready agentic systems remains significant, with many organizations struggling to confidently deploy agents that interact with real-world systems. In this article, we explore how NVIDIA NeMo™ microservices, part of NVIDIA AI Enterprise, and the Galileo platform create a powerful toolchain enabling developers to achieve the high degree of accuracy and reliability necessary to truly de-risk agentic AI in production environments.
We'll demonstrate the tangible impact of having a purpose-built Small Language Model (SLM) as a judge with 10x lower latency by implementing the AI data flywheel on a real-world agentic application built at Outshift by Cisco as part of its partnership with Galileo on AGNTCY.org
At its core, an AI data flywheel is a systematic process that creates a virtuous cycle of continuous improvement for AI systems. Galileo has built an AI data flywheel for agents using NVIDIA NeMo microservices and the Galileo platform.
NVIDIA NeMo is an end-to-end platform for building data flywheels, enabling enterprises to develop and continuously optimize their AI agents with the latest information. NeMo helps enterprise AI developers easily curate data at scale, customize large language models (LLMs) with popular fine-tuning techniques, consistently evaluate models on industry and custom benchmarks, and guardrail them for appropriate and grounded outputs.
Evaluating models and AI workflows represents a crucial stage within the data flywheel process. This step becomes particularly critical when building agentic applications, for several key reasons:
To solve this challenge, Galileo evaluates critical pieces of an agentic workflow. For example, it enables developers to measure whether an AI agent is calling the right tools, whether it has followed the expected planning steps. It even provides powerful insights on which trajectories are unexpected in an agent’s execution. Furthermore, Galileo integrates with NVIDIA NeMo Evaluator to provide additional agent-specific evaluation scenarios that empower teams to comprehensively measure the performance of their AI applications.
The data flywheel is implemented through five key stages:
The process begins with NVIDIA NeMo Curator working together with Galileo's Dataset Analysis tools to produce high-quality data. This systematic curation of interaction data from AI applications includes:
This comprehensive dataset becomes the foundation for all subsequent improvements.
The next stage is model customization. It involves fine-tuning models using NVIDIA NeMo Customizer while iterating on fixing problematic data identified by Galileo's DEP score within its LLM FineTune Studio. This targeted approach ensures that models are optimized specifically for their intended use cases.
Stage 3 combines NVIDIA NeMo Evaluator benchmarks and LLM-as-a-Judge with additional custom metrics and diagnostic insights in Galileo Evaluate. Under the hood, agents generate and execute multiple steps to get from user input to final action. During this process, an agent may take any number of paths. While this flexibility is considered a key value proposition for agents, it also increases potential points of failure due to the non-determinism added by the LLM’s planning steps.
Galileo has built a set of proprietary LLM-as-a-Judge metrics for developers building agents. The out-of-the-box metrics in Galileo's Agentic Evaluations include:
Stage 4 in the flywheel involves implementing protective guardrails. Galileo Protect, when paired with NVIDIA NeMo Guardrails, serves as the ultimate tool for safeguarding your AI application's behavior and mitigating risk while increasing compliance in production. These guardrails operate with low latency and computational cost, making them practical for production environments.
It helps guard against hallucinations, prompt injections, harmful toxic languages & jailbreak attacks, which are critical safeguards for production deployments.
The final stage of the flywheel is model deployment and observability. Galileo delivers comprehensive agentic observability powered by your model running on NVIDIA NIM microservices. This creates a continuous feedback loop, as the deployed model generates new interaction data that feeds back into Stage 1 (Data Curation), starting the cycle again.
To demonstrate the effectiveness of this approach, let's examine how Galileo and Outshift, Cisco’s emerging technologies team, applied the AI data flywheel to improve a Pull Request (PR) Coach Agent—an agent that automatically reviews code changes and helps developers apply best practices to improve software development efficiency.
The PR Coach Agent needs to:
Getting this right requires the agent to make accurate decisions about which tools to use at each process step—from code analysis to recommendation generation.
Here we look at the agent being measured on its Tool Selection Quality Task and the metric, detecting any invalid tool selection tasks from the model.
Galileo and Cisco teams partnered to create highly accurate evaluations and guardrails to de-risk this multi-agentic system. They implemented the AI data flywheel using Galileo software integrated with NVIDIA AI for this use case:
The optimized approach delivered remarkable improvements:
This newly fine-tuned model can now even run as an accurate Guardrail on top of NVIDIA NIM microservices with Galileo Protect at a 400ms latency instead of 5 to 7 seconds taken by LLM as a judge on this agentic use case.
The Galileo platform, integrated with NVIDIA NeMo microservices, demonstrates how the AI data flywheel can transform agentic system development to a systematic engineering practice for iterative improvement. By implementing this approach, organizations can:
As enterprises increasingly look to deploy agent-based systems for business-critical tasks, this kind of systematic, measurement-driven approach will be essential for success, particularly when creating multi-agent systems.
To learn more about building trustworthy AI agents with Galileo and NVIDIA NeMo, visit galileo.ai or contact us at [email protected]. Additionally, download and get started with NVIDIA NeMo microservices from the NVIDIA NGC Catalog.