From Logs to Decisions and the New Telemetry Model for Autonomous Agents

Jackson Wells

Integrated Marketing

Your autonomous agent passed every log check. Latency stayed within SLO. Error rates held at zero. And it still chose the wrong tool repeatedly overnight. Your observability stack captured every request and response, but without agent telemetry, it never recorded why the system made the decisions it made. 

The three pillars of classical observability, logs, metrics, and traces, were engineered for deterministic services where the same input reliably produces the same output. Autonomous agents often violate that assumption because their inference and execution can be non-deterministic. 

For you and your team, this gap creates a category of failure that is structurally invisible: autonomous agents that appear healthy by every infrastructure measure while quietly making bad decisions at the cognition layer. The cost surfaces in postmortems that cannot answer the first question your VP asks: "Why did it do that?"

TLDR:

  • Classical observability assumes deterministic behavior that autonomous agents violate

  • Decision and tool-call data are now first-class observability signals

  • Session-level traces surface failures invisible in single turns

  • Observability must feed real-time evals and runtime intervention

  • A cognition-aware stack extends the three pillars

What Is Agent Telemetry

Agent telemetry is the structured capture of an autonomous agent's decisions, tool selections, reasoning steps, and session-level behavior. Traditional application telemetry records requests, responses, latency, and resource utilization. 

This new telemetry layer records what your autonomous agent chose to do, why it chose that path, which alternatives it considered, and whether its reasoning remained coherent across multiple turns.

Traditional observability answers: "Did the service respond within SLO?" Agent observability answers: "Did your autonomous agent make the right decision, and can you prove it?".

The gap between those two questions is where production accountability breaks down. Your infrastructure dashboards show green. Your team reports corrupted data. The ACM paper defines this precisely: "existing tools observe either an agent's high-level intent (via LLM prompts) or its low-level actions (e.g., system calls), but cannot correlate these two views." For any team that owns autonomous agent reliability, that missing bridge is agent telemetry.

Why Classical Observability Falls Short for Autonomous Agents

Logs, metrics, and traces remain essential for the infrastructure surrounding your autonomous agents. Your load balancers, API gateways, and databases still need traditional monitoring. But the decision layer of agentic systems generates failure modes these pillars were never designed to surface. 

Three specific gaps explain why you can have mature observability practices and still get blindsided by production failures: logs discard reasoning, metrics smooth over bad decisions, and traces cannot model non-deterministic branching.

Logs Lose The Reasoning Behind Each Agent Action

Autonomous agent debugging stalls when your team can trace what happened but not why. A customer support autonomous agent tells users their order does not exist when it clearly does. Your logs show [INFO] LLM call completed in 2.3s followed by [ERROR] Database query failed

Technically complete by APM standards. Operationally useless in a postmortem. Most teams respond by increasing log verbosity, adding more structured fields, and extending retention windows. None of this captures the missing layer: which tool your autonomous agent selected, whether the prompt was well formed, or whether the model hallucinated a different order number.

The structural fix is to record intermediate reasoning, tool selection rationale, and confidence signals alongside inputs and outputs. Without reasoning-level capture, postmortems stall because logs cannot answer the why question your executives ask first.

Metrics Aggregate Away The Decisions That Matter

How do you know when your production agent decisions go wrong if every dashboard indicator stays green? Your production agents make thousands of tool selections daily. 

Latency percentiles, error rates, and throughput counters aggregate these individual events into summary statistics. For deterministic services, this aggregation is acceptably lossy because the abstracted events are not decision-relevant. For autonomous agents, the abstracted events are the decisions.

A non-deterministic autonomous agent can hit every SLO while quietly choosing the wrong tool on a subset of requests. The affected input class gets diluted across all invocations in the aggregation window. An autonomous agent returning a plausible but wrong answer produces a 200 OK response, zero exceptions, and normal latency. 

No signal reaches any error counter. When metric-driven dashboards show reliability at 99.5%, they can give you false confidence because the 0.5% they miss may represent an entire class of decisions where the autonomous agent consistently selects the wrong reasoning path. Standard SLO frameworks assume a stable error rate that can be projected forward, but an autonomous agent's error rate on a given input class is not stable across invocations.

Distributed Traces Cannot Capture Non-Deterministic Branching

Span trees were built for deterministic microservice calls where the call tree topology is fixed at deployment time. Traces in agentic systems branch dynamically based on model output. 

The same prompt with identical inputs can produce meaningfully different tool call sequences, reasoning paths, and outputs across invocations. A latency spike might indicate a bug, or it might indicate the model chose a more computationally expensive reasoning path that produced a better result.

The current semantic conventions for generative AI tracing are still evolving, with session-level tracking not yet fully standardized. Even with complete span coverage, standard span trees primarily capture operational details such as timing, call sequences, and exceptions. 

A span tree with full operational coverage still does not contain the data needed to determine whether your autonomous agent's decisions were correct. The policy and reasoning layer above the spans remains invisible.

The New Telemetry Primitives For Agentic Systems

Production-grade autonomous agent observability adds four primitives on top of classical observability. These primitives address specific layers of the cognition-execution stack that traditional APM leaves invisible. 

They extend your existing infrastructure monitoring rather than replace it. New generative AI telemetry fields build on standard span, metric, and log primitives while capturing what those primitives were never designed to record.

Decision Spans Capture Intent And Reasoning

A decision span is a structured record of what your autonomous agent chose to do and why. Where a conventional span captures call latency and payload size, a decision span captures the intent behind an action, the reasoning chain that produced it, and the depth of reasoning expended. In emerging generative AI tracing conventions, the primary operation type for this is invoke_agent, with span names such as invoke_agent {gen_ai.agent.name}.

A request span shows that your agent called search_inventory with a 200ms response. A decision span shows that your agent chose search_inventory over query_catalog because the user asked about availability rather than browsing, and that the choice was reached after a single reasoning step rather than a multi-step deliberation. The first tells you what happened on the wire. The second tells you why your agent took that path.

When your production agent fails, the root cause can arise at multiple layers, including cognition, tool use, memory, coordination, or retrieval. Conventional spans record the downstream consequence of that choice. Decision spans record the choice itself, unlocking root-cause analysis where most agent failures actually originate.

Tool-Call Traces Expose Selection Logic And Parameter Accuracy

Tool-call telemetry must record three things: whether the right tool was chosen, whether the arguments were correct, and whether the tool returned a usable result. The execute_tool span captures tool name, type, and the full arguments and results as defined attributes in emerging GenAI tracing conventions. That designation reflects a privacy and cost tradeoff, but the data behind it is the most common debugging surface for teams running production agents.

Systematic parameter analysis can reveal subtle but damaging tool-use errors that ordinary infrastructure telemetry misses. Metrics like Tool Selection Quality and Tool Error operate as the eval layer that transforms raw tool-call traces into actionable signals. They score whether your autonomous agent selected the correct tool with correct parameters and whether tools executed without errors.

Session-Level Signals Surface Emergent Multi-Turn Failures

Two autonomous agents. Same input. Completely different outputs. Single-turn telemetry cannot explain the divergence because the failure accumulated across turns, not within one. Context drift, the gradual degradation or distortion of the conversational state a model uses to generate responses, is invisible when you evaluate each turn in isolation. Multi-turn performance can drop sharply even when single-turn capability looks strong.

Session-level telemetry stitches individual traces into complete interaction arcs, revealing four failure modes that single-turn telemetry misses: faulty tool calls that compound across turns, infinite loops where autonomous agents re-plan without convergence, false task completion claims, and instruction drift away from the original user intent. 

This is the layer that matters most because your customer-visible failures often compound across turns. A single bad tool selection in turn three becomes corrupted data by turn seven.

Reasoning Coherence Telemetry Ties Steps To Goals

The highest-value primitive measures whether each step in your autonomous agent's trajectory advances your actual intent. A step can succeed at the execution layer while failing at the coherence layer. Your autonomous agent calls the right API, gets the right response, and then uses the result in a way that contradicts its own prior reasoning.

Goal-aligned telemetry evaluates the cognition layer directly: are reasoning steps logically consistent, non-contradictory, and aligned with the intended plan? Metrics like Action Advancement, Action Completion, and Reasoning Coherence operate on this layer. 

Building Your Agent Telemetry Stack

The implementation challenge is an architecture decision, not just a tooling selection. The new telemetry model is the foundation for governance, incident response, and the speed at which your team ships reliable autonomous agents. Getting this right determines whether your team treats production failures as solvable engineering problems or as unpredictable crises.

Moving From Sampling To Full-Traffic Evaluation

Ten percent trace sampling is standard practice in classical observability. For autonomous agents, it breaks. Classical sampling works because failures in deterministic systems are reproducible. A bad code path fails consistently, so observing 10% of traffic will eventually surface it. 

Production autonomous agent failures are non-deterministic: the same input can produce different execution paths, and a failure absent from the sampled 10% cannot be assumed absent from the unsampled 90%. The statistical inference that makes sampling viable simply does not hold.

Sampling, metric rollups, and retention limits eliminate the very signals agentic systems generate. Full-traffic evals are the architectural requirement, but they are only economically feasible when evaluator cost drops far enough to justify them. Purpose-built eval models like Luna-2 can change the math: at 96% lower cost and sub-200ms latency compared to frontier LLM-based evaluators, full-traffic evals become economically feasible in production.

Closing The Loop From Telemetry To Runtime Intervention

Telemetry is the foundation, not the endpoint. Decision spans, tool-call traces, and session signals become input to runtime guardrails that can block, override, or route risky actions before users see them. The architecture follows a closed loop: observe autonomous agent behavior through telemetry primitives, evaluate each action against quality and safety policies, and intervene when that eval detects a violation.

Pre-execution interception is architecturally mandatory for irreversible actions: database writes, outbound API calls, financial transactions, and any tool call that modifies external state. Each tool call is an external side effect. The observe-evaluate-intervene loop is the durable architecture for autonomous agent reliability because it transforms passive visibility into active control. 

Your observability stack captures what your autonomous agent decided. The eval layer scores that decision against quality and safety policies. When a violation fires, intervention blocks or overrides the action before users see it.

Building Accountability Into Autonomous Agent Decisions

Classical observability was built to tell you whether infrastructure is healthy. Autonomous agents require one layer more: visibility into whether decisions were correct, how those decisions were made, and how failures compound across multi-turn workflows. 

Decision spans, tool-call traces, session-level signals, and reasoning coherence telemetry close that gap by making the cognition layer observable. When you connect that visibility to evals and runtime control, autonomous agent reliability becomes a manageable engineering discipline instead of a postmortem guessing game.

For teams that want observability, evals, and intervention connected in one workflow, Galileo is the agent observability and guardrails platform that helps engineers ship reliable AI agents with visibility, evaluation, and control:

  • Agent Graph visualization: Interactive exploration of multi-step decision paths, tool interactions, and autonomous agent reasoning chains

  • Agentic eval metrics: Purpose-built metrics scoring decision quality, tool selection accuracy, and reasoning coherence across production traces

  • Luna-2 evaluation models: Fine-tuned  Small Language Models (SLMs) delivering 98% lower cost than LLM-based evaluation at sub-200ms latency for full-traffic evals

  • Session-level tracing: Multi-turn, multi-agent trace stitching that reveals compounding failures invisible in single-turn telemetry

  • Runtime Protection guardrails: Real-time intervention that blocks unsafe autonomous agent actions before they reach users

  • Automated Signals detection: AI-powered failure pattern analysis that surfaces unknown unknowns without manual search

Book a demo to see how agent observability can turn postmortem guesswork into structured root-cause analysis.

FAQs

What is agent telemetry, and how does it differ from traditional application telemetry? 

Agent telemetry is the structured capture of an autonomous agent's decisions, reasoning steps, tool selections, and session-level behavior. Traditional application telemetry records requests, responses, latency, and error rates for deterministic services. Agent telemetry adds the cognition layer: why the autonomous agent chose a specific action, which alternatives it considered, and whether its reasoning remained coherent across turns.

What are decision spans in agent telemetry? 

Decision spans are structured records that capture what an autonomous agent chose to do and why, including the reasoning chain, alternatives considered, and confidence level. 

They differ from conventional distributed tracing spans, which only record call latency and payload metadata. Decision spans operate at the cognition layer, recording the intent behind each action so your team can perform root-cause analysis where most production autonomous agent failures originate.

How do you instrument an autonomous agent for production-grade telemetry? 

Start by adopting emerging GenAI semantic conventions, which define span types for agent invocation (invoke_agent) and tool execution (execute_tool). Layer purpose-built eval metrics on top of these spans to score decision quality, tool selection accuracy, and reasoning coherence. Stitch individual traces into sessions for multi-turn analysis, and connect your telemetry pipeline to runtime guardrails for closed-loop intervention.

Do you still need logs and metrics if you have agent telemetry? 

Yes. Agent telemetry extends classical observability rather than replacing it. Your infrastructure still requires traditional logs, metrics, and traces for load balancers, databases, API gateways, and compute resources. 

The decision and reasoning layer that this new telemetry captures sits on top of that foundation. The two layers complement each other: infrastructure telemetry confirms the system is healthy, while the cognition layer confirms your autonomous agent's decisions are correct. 

How does Galileo capture agent telemetry beyond traditional observability tools? 

The agent reliability platform captures decision-level telemetry through agent-specific observability views and can evaluate production traffic at 100% sampling rates using proprietary agentic metrics. 

Signals automatically surfaces failure patterns without manual search, while Runtime Protection closes the loop by blocking unsafe actions before they reach users, powered by Luna-2 evaluation models.

Jackson Wells