NIST AI Risk Management Framework 1.0 in Practice with Govern, Map, Measure, and Manage

Jackson Wells
Integrated Marketing

Your autonomous agent just approved a transaction that violates sanctions policy. The audit committee wants answers: which control failed, who owns it, and how you guarantee it will not happen again.
You are scrambling through logs, pinging three teams, and manually reconstructing a decision path that spanned four tool calls and two agent handoffs. The NIST AI Risk Management Framework exists to prevent this fire drill.
NIST AI RMF 1.0 is structured around four core functions, Govern, Map, Measure, and Manage, that organize activities to identify and manage AI risks across the lifecycle. Each function addresses a distinct dimension of AI risk, and together they form a continuous loop that keeps pace with the autonomous agents you are deploying. This walkthrough covers how each function works in practice for agentic systems.
TLDR:
Govern, Map, Measure, and Manage form a continuous AI risk management loop.
Agentic AI breaks legacy governance, risk, and compliance approaches through non-determinism and multi-step action risk.
Production-grade measurement requires agent-specific metrics running on 100% of traffic.
Centralized, hot-reloadable policies prevent governance from bottlenecking engineering.
The eval-to-guardrail lifecycle connects offline testing to runtime enforcement.
Understanding the NIST AI Risk Management Framework 1.0
Published by the National Institute of Standards and Technology (NIST), AI RMF 1.0 is a voluntary Risk Management Framework, available in the official NIST document, structured around four core functions, Govern, Map, Measure, and Manage, for building and operating trustworthy AI systems.
Its design is deliberately non-prescriptive: you select the categories and subcategories that apply to your context rather than following a rigid checklist. The core framework remains AI RMF 1.0, supplemented by companion profiles published since 2024, including the Generative AI Profile (NIST AI 600-1) and a concept note for a Critical Infrastructure Profile released in April 2026.
Although the NIST AI RMF is voluntary, you may still use it as a practical risk-management resource. The Treasury AI RMF explicitly adapts it for banking, and FDA materials reference its Generative AI Profile.
Traditional model risk management frameworks like SR 11-7 assume a model produces outputs for human review. NIST AI RMF treats AI as a socio-technical system with risk distributed across stakeholders, use contexts, and the entire lifecycle; this framing becomes critical when you are governing autonomous agents that independently perceive, decide, and act. Here is how each function works in practice.
Governing AI Risk Across Teams and Systems
Govern establishes the culture, policies, accountability structures, and risk tolerance that make the other three functions possible. Without Govern, Map, Measure, and Manage operate in a vacuum. You measure different things across groups, respond to incidents inconsistently, and lack clear ownership when something breaks.
A common failure mode appears when your team hardcodes policies into individual autonomous agents. Every policy update requires an engineering redeploy, and your compliance team cannot respond to incidents without filing a ticket and waiting for the next release cycle. Two implementation moves address this.
Establishing Policies and Accountability for AI Systems
Someone has to sign off before an autonomous agent goes live. You should specify who approves launches, who monitors ongoing behavior, and who has authority to deactivate a non-compliant autonomous agent. The NIST AI RMF Govern function mentioned above calls for risk tolerance levels, transparent policies, and mechanisms for ongoing monitoring and review.
Risk-tier classification determines how much scrutiny each autonomous agent receives. An internal summarization agent warrants lighter governance than an autonomous agent executing financial transactions.
Documenting acceptable use boundaries, including which tools an autonomous agent can call, which data sources it can access, and which downstream systems it can write to, creates the audit trail you need.
All of this collapses when policies live only in application code. Your compliance team cannot inspect, update, or enforce rules without engineering involvement, and every change becomes a deployment risk.
Centralizing Policy Enforcement with Runtime Protection
A centralized policy plane lets your compliance team update controls across fleets of autonomous agents without engineering redeploys. Rather than embedding safety rules in each codebase, you define policies on a centralized server. The server evaluates autonomous agent behavior against assigned policies at runtime.
When a policy fires, the system responds with one of four actions: Allow, Deny, which blocks execution and raises a ControlViolationError, Steer, which surfaces corrective guidance without halting the autonomous agent, or Warn. Policies are hot-reloadable. Change a rule on the server and every connected autonomous agent picks it up within seconds. No redeploy, no code edits, no downtime.
Mapping AI Systems and Risk Surfaces
Map is where you catalog AI systems in scope, identify risk surfaces, and document the context that informs how risk is measured and managed. For a traditional ML model, the risk surface is well-defined: inputs go in, predictions come out.
For autonomous agents, the risk surface is the full decision path, including every tool call, every data source read, every downstream system written to, and every handoff between autonomous agents. A small misinterpretation early in a long workflow can compound into a materially wrong outcome before anyone notices.
Cataloging Production Agents and Their Decision Paths
Inventory hygiene starts with basic questions you may not answer confidently. Which production agents are running in production? Which tools does each autonomous agent have access to? Which data sources does it read from? Which downstream systems can it write to?
The MAP function expects traceability and lineage documentation sufficient to make go or no-go deployment decisions. For autonomous agents, that means logging not just inputs and outputs but the full execution trace: every span, an individual large language model (LLM) call, tool invocation, or retrieval step, every trace, a sequence of spans from a single logical operation, and every session, a complete multi-turn interaction.
This three-tier hierarchy, sessions, traces, and spans, provides the lineage documentation MAP requires. It is also the foundation of agent observability. Without it, you are reconstructing decision paths from fragments after an incident, not maintaining a living catalog.
Visualizing Multi-Agent Workflows for Risk Identification
A catalog in a spreadsheet does not help you identify where controls should sit. You need to see the decision paths autonomous agents actually take.
Agent Graph provides three complementary views that make mapping tractable at scale: Graph View renders every branch, decision, and tool call as an interactive visualization; Timeline View steps through execution paths and surfaces latency bottlenecks; and Conversation View shows exactly what you see, enabling debugging from your perspective.
For multi-agent orchestration, the Aggregate Graph View surfaces the most common paths autonomous agents take across sessions, helping you identify broader patterns that individual trace inspection may miss. When you can see tool-selection paths and autonomous agent handoffs visually, you can pinpoint exactly where a control should sit, at the tool call boundary, at the handoff, or at the output stage, rather than guessing.

Measuring AI Risk with Production-Grade Evaluation
Measure operationalizes the abstract risks identified in Map by attaching quantitative signals to them. The MEASURE function calls for documented assessments of trustworthy characteristics using test, evaluation, verification, and validation methods.
For autonomous agents, traditional ML metrics miss the failure modes that matter most. An autonomous agent can have low latency and high throughput while consistently selecting the wrong tool, failing to complete your goal, or reasoning incoherently across a multi-step plan.
Defining Metrics That Match Regulatory Expectations
A production metrics stack for autonomous agents needs three layers. Safety and compliance metrics, including personally identifiable information (PII) detection, prompt injection detection, toxicity, and sexism or bias, address baseline expectations. Agentic metrics capture failure modes unique to autonomous systems: Action Completion, Tool Selection Quality, Reasoning Coherence, and Action Advancement.
Response quality metrics like Context Adherence and Instruction Adherence round out the picture. Custom metrics extend this taxonomy to domain-specific risks.
A financial services workflow might require regulatory citation accuracy, a healthcare workflow might track clinical safety language adherence, a SaaS support workflow might measure policy-consistent resolution quality, and a developer tooling workflow might check whether code actions follow approved instructions. The MEASURE function's emphasis on domain expert validation maps directly to this customization requirement.
Running Continuous Evaluation at Production Scale
Point-in-time benchmarking cannot characterize systems whose behavior varies stochastically across executions. You need continuous evals running on 100% of production traffic.
Galileo's Luna-2 models, fine-tuned Llama variants in 3B and 8B configurations, achieve 0.95 F1 accuracy versus GPT-4o's 0.94 according to Galileo's benchmark methodology, at a fraction of the cost. A multi-headed architecture lets a single model run 10 to 20 metrics simultaneously while staying under 200ms latency.
Raw metric accuracy is not enough if the metrics do not reflect your domain's standards. An Autotune feedback workflow closes this gap by letting domain experts provide feedback on metric outputs, correcting results and explaining their reasoning in natural language.
Those corrections become few-shot examples appended to the metric's prompt, improving accuracy by 20-30% from as few as two annotated examples. If you spot a false negative on a PII detection metric, you can refine it through that workflow rather than manually editing prompt templates.
Managing Surfaced Risks in Live AI Systems
Manage is where measurement translates into action: prioritizing risks, responding to incidents, and preventing recurrence.
The MANAGE function covers risk treatment plans, response and recovery procedures, and communication protocols. For autonomous agents, the distinction between reactive and proactive management determines whether you catch a problem after 10,000 affected interactions or after 10.
Detecting Failure Patterns Before They Cascade
Your autonomous agents fail in ways you did not predict, which means you cannot write evals for failures you have not imagined. Production trace analysis can surface unknown unknowns such as security leaks, policy drift, and cascading failures that no manual search or predefined eval would catch. Issues can then be prioritized by severity so you know where to focus first.
Each detected pattern should carry linked evidence pointing to the exact trace or span where the issue occurred. Pattern discrimination should distinguish new failures from known bugs, building institutional memory across runs.
One useful workflow is converting an identified failure pattern into an automated evaluator that catches that pattern in future traces. A failure that was invisible yesterday becomes a permanently monitored metric today, which shortens debugging cycles, improves rollout confidence, and reduces the odds that the same issue returns in your next release.
Enforcing Runtime Controls on Risky Outputs
Detection without enforcement is observation, not governance. Real-time interception blocks unsafe outputs in sub-200ms before they reach you.
The enforcement model operates through three structural layers: Rules, which are individual metric checks, Rulesets, which are collections of rules evaluated in parallel with AND logic, and Stages, which are collections of rulesets triggered by OR logic that can run at different points in your workflow.
Two action types satisfy different audit requirements. Passthrough delegates response handling to your application logic, giving your engineering team control. Override allows the ruleset to define fixed responses managed centrally, separate from application code. This distinction lets you standardize policy responses where you need consistency while preserving local flexibility where your product teams need it.
Operationalizing NIST AI RMF Across the Agent Development Lifecycle
The four functions form a continuous loop, not a linear sequence. Govern sets the rules. Map identifies where they apply. Measure quantifies adherence. Manage acts on the gaps. Outputs from Manage feed back into Govern to update policies, and the cycle repeats.
The eval-to-guardrail lifecycle is the connective tissue that wires this loop to production systems. Evaluation criteria defined during development become the logical basis for guardrail rules in production, operating over the same trace and session identity. Offline evals do not sit in a report; they become production controls without glue code.
The Agent Development Lifecycle provides the broader operating model. Progressive staged rollouts, automatic kill switches tied to error-rate thresholds, and continuous feedback loops between production monitoring and development iteration replace the static validate-then-deploy approach that traditional model risk management assumes.
Turning NIST AI RMF Into Operational Control
NIST AI RMF 1.0 is most effective when Govern, Map, Measure, and Manage connect to operational monitoring, logging, and runtime controls throughout deployment and use.
Documented policies that cannot enforce themselves, metrics that run only in staging, and incident response plans that require manual log reconstruction all fail at the speed autonomous agents operate.
The framework works because it is iterative. You need shared trace identity, shared metrics, and shared enforcement so the four functions reinforce one another instead of becoming separate checklists.
Galileo delivers the agent observability and guardrails infrastructure that operationalizes each function of NIST AI RMF:
Runtime Protection: Sub-200ms interception of unsafe outputs using rules, rulesets, and stages that satisfy Manage and Govern requirements with full audit trails.
Luna-2 evaluation models: Purpose-built small language models running continuous Measure-function evals on 100% of production traffic at a fraction of LLM cost.
Hierarchical tracing: Sessions, traces, and spans provide the lineage documentation the MAP function requires for autonomous agents that make multi-step decisions.
Signals: Automatic detection of failure patterns, security leaks, and policy drift across production traces, surfacing risks the MANAGE function must respond to before they cascade.
Agent Reliability Platform: Unified observability, evaluation, and intervention so the four NIST functions reinforce each other across the agent development lifecycle.
Book a demo to see how Galileo can operationalize NIST AI RMF across your agent fleet, turning Govern, Map, Measure, and Manage into a continuous loop instead of four disconnected checklists.
Frequently Asked Questions
What Is the NIST AI Risk Management Framework 1.0
The NIST AI Risk Management Framework 1.0 is a voluntary framework from NIST for identifying, measuring, and managing AI risk across the lifecycle. It is organized around four core functions: Govern, Map, Measure, and Manage.
Though voluntary at the federal level, many teams use it as a practical governance and risk-management framework, and in some procurement or policy contexts it can function as an operational baseline.
Is the NIST AI RMF Mandatory for AI Teams
NIST AI RMF 1.0 carries no direct enforcement mechanism at the federal level. In practice, the U.S. Treasury adapted it into a Financial Services AI RMF for banking, and FDA advisory committee materials cite its Generative AI Profile. You can also use it as a reference point alongside other governance frameworks and regulatory requirements.
How Does the NIST AI RMF Apply to Autonomous Agents
Traditional model risk management assumes bounded inputs, deterministic outputs, and human decision-makers reviewing results.
Autonomous agents violate all three assumptions: they make multi-step decisions, invoke tools with real-world consequences, and behave non-deterministically across executions. The four NIST AI RMF functions must extend to cover action risk, multi-agent coordination dynamics, and continuous behavioral monitoring rather than periodic validation.
How Do You Measure AI Risk Under the NIST AI RMF
Measurement requires a metrics taxonomy covering safety, such as PII, prompt injection, and toxicity, agentic performance, such as Action Completion, Tool Selection Quality, and Reasoning Coherence, and response quality, such as Context Adherence and Instruction Adherence.
Continuous evals at production scale replace point-in-time benchmarking because behavior that varies stochastically across runs cannot be characterized by a single validation pass. Domain expert feedback on metric outputs further calibrates accuracy to your specific risk definitions.
How Do You Turn Evals into Runtime Controls
You start by defining eval criteria during development, then apply those same criteria as runtime enforcement rules in production, creating continuity between testing and live controls so the issues you measure offline can trigger action when they appear in real traffic. For autonomous agents, this is what closes the loop between Measure and Manage.

Jackson Wells