Data Flywheel for De-Risking Agentic AI | NVIDIA & Galileo Partnership

AI agents promise to revolutionize how enterprises automate complex workflows and tasks—but only if they can consistently meet the high standards for accuracy and reliability that business-critical operations demand. The gap between experimental AI and production-ready agentic systems remains significant, with many organizations struggling to confidently deploy agents that interact with real-world systems. In this article, we explore how NVIDIA NeMo™ microservices, part of NVIDIA AI Enterprise, and the Galileo platform create a powerful toolchain enabling developers to achieve the high degree of accuracy and reliability necessary to truly de-risk agentic AI in production environments.

We'll demonstrate the tangible impact of having a purpose-built Small Language Model (SLM) as a judge with 10x lower latency by implementing the AI data flywheel on a real-world agentic application built at Outshift by Cisco as part of its partnership with Galileo on AGNTCY.org

AI Data Flywheel to Continuously Improve Agents

At its core, an AI data flywheel is a systematic process that creates a virtuous cycle of continuous improvement for AI systems. Galileo has built an AI data flywheel for agents using NVIDIA NeMo microservices and the Galileo platform.

NVIDIA NeMo is an end-to-end platform for building data flywheels, enabling enterprises to develop and continuously optimize their AI agents with the latest information. NeMo helps enterprise AI developers easily curate data at scale, customize large language models (LLMs) with popular fine-tuning techniques, consistently evaluate models on industry and custom benchmarks, and guardrail them for appropriate and grounded outputs.

Evaluating models and AI workflows represents a crucial stage within the data flywheel process. This step becomes particularly critical when building agentic applications, for several key reasons:

Agents make complex, multi-step decisions that are difficult to evaluate holistically.
Agents interact with real-world systems where errors can have significant consequences.
Traditional testing approaches fall short when dealing with the open-ended nature of agent behaviors.

To solve this challenge, Galileo evaluates critical pieces of an agentic workflow. For example, it enables developers to measure whether an AI agent is calling the right tools, whether it has followed the expected planning steps. It even provides powerful insights on which trajectories are unexpected in an agent’s execution. Furthermore, Galileo integrates with NVIDIA NeMo Evaluator to provide additional agent-specific evaluation scenarios that empower teams to comprehensively measure the performance of their AI applications.

The data flywheel is implemented through five key stages:

1. Data Curation

The process begins with NVIDIA NeMo Curator working together with Galileo's Dataset Analysis tools to produce high-quality data. This systematic curation of interaction data from AI applications includes:

Agent planning processes
Tool selection decisions
API calls and responses
Final outputs and user feedback

This comprehensive dataset becomes the foundation for all subsequent improvements.

2. Model Customization

The next stage is model customization. It involves fine-tuning models using NVIDIA NeMo Customizer while iterating on fixing problematic data identified by Galileo's DEP score within its LLM FineTune Studio. This targeted approach ensures that models are optimized specifically for their intended use cases.

3. Comprehensive Evaluation

Stage 3 combines NVIDIA NeMo Evaluator benchmarks and LLM-as-a-Judge with additional custom metrics and diagnostic insights in Galileo Evaluate. Under the hood, agents generate and execute multiple steps to get from user input to final action. During this process, an agent may take any number of paths. While this flexibility is considered a key value proposition for agents, it also increases potential points of failure due to the non-determinism added by the LLM’s planning steps.

Galileo has built a set of proprietary LLM-as-a-Judge metrics for developers building agents. The out-of-the-box metrics in Galileo's Agentic Evaluations include:

Tool Selection Quality: Did the LLM Planner select the correct tool and arguments?
Tool Errors: Did any individual tool error out?
Action Advancement: Does each trace reflect progress toward the ultimate goal?
Action Completion: Does the final action align with the agent's original instructions?
Context Adherence: How well does the agent adhere to the provided context?

4. Protective Guardrails

Stage 4 in the flywheel involves implementing protective guardrails. Galileo Protect, when paired with NVIDIA NeMo Guardrails, serves as the ultimate tool for safeguarding your AI application's behavior and mitigating risk while increasing compliance in production. These guardrails operate with low latency and computational cost, making them practical for production environments.

It helps guard against hallucinations, prompt injections, harmful toxic languages & jailbreak attacks, which are critical safeguards for production deployments.

5. Model Deployment and Observability

The final stage of the flywheel is model deployment and observability. Galileo delivers comprehensive agentic observability powered by your model running on NVIDIA NIM microservices. This creates a continuous feedback loop, as the deployed model generates new interaction data that feeds back into Stage 1 (Data Curation), starting the cycle again.

Real-World Impact: The PR Coach Agent

To demonstrate the effectiveness of this approach, let's examine how Galileo and Outshift, Cisco’s emerging technologies team, applied the AI data flywheel to improve a Pull Request (PR) Coach Agent—an agent that automatically reviews code changes and helps developers apply best practices to improve software development efficiency.

The Challenge

The PR Coach Agent needs to:

Analyze code changes across multiple files and languages
Understand development best practices in context
Provide actionable, specific feedback
Integrate seamlessly with existing development workflows

Getting this right requires the agent to make accurate decisions about which tools to use at each process step—from code analysis to recommendation generation.

Here we look at the agent being measured on its Tool Selection Quality Task and the metric, detecting any invalid tool selection tasks from the model.

Applying the Data Flywheel

Galileo and Cisco teams partnered to create highly accurate evaluations and guardrails to de-risk this multi-agentic system. They implemented the AI data flywheel using Galileo software integrated with NVIDIA AI for this use case:

Data Collection: They gather thousands of real PR reviews, capturing both the code changes and expert feedback.
Evaluation Framework: They define critical metrics for success, emphasizing Tool Selection Quality (TSQ) — a measure of how accurately the agent selects the appropriate tool based on its plan and instructions. This was extended by leveraging metrics from NVIDIA NeMo Evaluator.
Model Refinement: Using NeMo Customizer and benchmarking with NeMo Evaluator, a Llama 3B model deployed as NVIDIA NIM was iteratively fine-tuned and evaluated specifically for this use case, focusing on improving tool selection accuracy while reducing latency.
Optimization: Initial evaluations revealed that while LLMs could achieve reasonable accuracy in tool selection, they were:
- Receiving feedback too slowly for real-time use (5-7 seconds per decision)
- Inconsistent with more nuanced tool selection tasks
De-risking with Guardrailing: The low-latency use-case-specific model can now be used as a real-time guardrail within Galileo Protect, which also leverages NVIDIA NeMo Guardrails to detect a host of known risks like Personally Identifiable Information (PII) being leaked.

Measured Results

The optimized approach delivered remarkable improvements:

10x reduction in latency: Scoring of tool selection decisions with our fine-tuned SLM (Llama 3B) now takes just 400ms, compared to 4-6 seconds with LLMs (Llama 70B or GPT-4o). This was achieved only because of ‌targeted model customization through the data flywheel.

Higher accuracy: Finetuning the Llama 3B model on task-specific data, which had instructions from an annotation exercise, increased the accuracy of this smaller model to pass that of the baseline Llama 70B model and even get very close to the accuracy of a much larger GPT-4o model.

This newly fine-tuned model can now even run as an accurate Guardrail on top of NVIDIA NIM microservices with Galileo Protect at a 400ms latency instead of 5 to 7 seconds taken by LLM as a judge on this agentic use case.

Conclusion: De-risking Agent Development

The Galileo platform, integrated with NVIDIA NeMo microservices, demonstrates how the AI data flywheel can transform agentic system development to a systematic engineering practice for iterative improvement. By implementing this approach, organizations can:

Gain confidence in agent behavior through comprehensive, granular evaluation
Reduce risks with practical, low-latency guardrails
Improve efficiency by targeting specific aspects of agent performance
Accelerate the deployment of agents in production environments

“There is a much higher impetus on de-risking AI when it comes to Multi Agentic Systems. At Outshift by Cisco, we have built a multi-agentic system using the agentic communication primitives developed by the AGNTCY OSS collective, which is being used by our teams to improve the pull-request efficiency in our developer’s workflows.

Creating the right open, interoperable, and test-driven approach with a rigorous data flywheel powered by Galileo and NVIDIA MeMo microservices around these agents has made it possible to scale our agent’s capabilities in production with 40% higher tool selection accuracy and 10x faster detection latency.” – Papi Menon, VP & CPO, Outshift by Cisco

As enterprises increasingly look to deploy agent-based systems for business-critical tasks, this kind of systematic, measurement-driven approach will be essential for success, particularly when creating multi-agent systems.

To learn more about building trustworthy AI agents with Galileo and NVIDIA NeMo, visit galileo.ai or contact us at info@galileo.ai. Additionally, download and get started with NVIDIA NeMo microservices from the NVIDIA NGC Catalog.

AI agents promise to revolutionize how enterprises automate complex workflows and tasks—but only if they can consistently meet the high standards for accuracy and reliability that business-critical operations demand. The gap between experimental AI and production-ready agentic systems remains significant, with many organizations struggling to confidently deploy agents that interact with real-world systems. In this article, we explore how NVIDIA NeMo™ microservices, part of NVIDIA AI Enterprise, and the Galileo platform create a powerful toolchain enabling developers to achieve the high degree of accuracy and reliability necessary to truly de-risk agentic AI in production environments.

We'll demonstrate the tangible impact of having a purpose-built Small Language Model (SLM) as a judge with 10x lower latency by implementing the AI data flywheel on a real-world agentic application built at Outshift by Cisco as part of its partnership with Galileo on AGNTCY.org

AI Data Flywheel to Continuously Improve Agents

At its core, an AI data flywheel is a systematic process that creates a virtuous cycle of continuous improvement for AI systems. Galileo has built an AI data flywheel for agents using NVIDIA NeMo microservices and the Galileo platform.

NVIDIA NeMo is an end-to-end platform for building data flywheels, enabling enterprises to develop and continuously optimize their AI agents with the latest information. NeMo helps enterprise AI developers easily curate data at scale, customize large language models (LLMs) with popular fine-tuning techniques, consistently evaluate models on industry and custom benchmarks, and guardrail them for appropriate and grounded outputs.

Evaluating models and AI workflows represents a crucial stage within the data flywheel process. This step becomes particularly critical when building agentic applications, for several key reasons:

Agents make complex, multi-step decisions that are difficult to evaluate holistically.
Agents interact with real-world systems where errors can have significant consequences.
Traditional testing approaches fall short when dealing with the open-ended nature of agent behaviors.

To solve this challenge, Galileo evaluates critical pieces of an agentic workflow. For example, it enables developers to measure whether an AI agent is calling the right tools, whether it has followed the expected planning steps. It even provides powerful insights on which trajectories are unexpected in an agent’s execution. Furthermore, Galileo integrates with NVIDIA NeMo Evaluator to provide additional agent-specific evaluation scenarios that empower teams to comprehensively measure the performance of their AI applications.

The data flywheel is implemented through five key stages: