Agent Observability: Tracing Multi-Step Reasoning in Production

Q: Do I need this if my agent is simple, one LLM call plus one tool?

Even simple agents surprise you in production. But the ROI scales with complexity. Multi-step chains with multiple tools and session state need it. Single-step calls benefit but can live without it.

Q: How much does tracing slow agents down?

Current platforms (AgentOps, Langfuse, and similar) report 12-15% overhead. Acceptable given that agents already incur multi-second LLM latencies. The overhead is dwarfed by the cost savings from detecting inefficiencies and preventing reasoning loops.

Q: Can I use my existing APM tool?

Partially, if it supports OpenTelemetry. But it likely lacks agent-specific features like reasoning chain visualization, token-cost breakdowns, and quality evaluation overlays. Most teams use a dedicated agent observability platform alongside existing APM, connected by shared trace IDs.

Q: What should I instrument first?

LLM calls and tool calls. Highest diagnostic value per unit of effort. Capture inputs, outputs, latency, tokens, cost, and success/failure. Add reasoning-step capture once you have baselines. Session-level tracing comes last, once you're operating agents that maintain state across multiple turns.

Q: What's the difference between agent observability and LLM monitoring?

LLM monitoring focuses on the model layer: token counts, latency, error rates, cost per call. Agent observability operates one level up, capturing the full reasoning chain across multiple LLM calls, tool invocations, and decision points. An agent might make five LLM calls in a single run; LLM monitoring sees five independent calls, while agent observability sees one coherent decision tree with causal relationships between the steps.

Monitoring Limits

Why Traditional Monitoring Falls Short

Traditional application monitoring was built for deterministic software. Same input, same code path, same output. You monitor latency, error rates, throughput. When something breaks, the metrics point you to the cause.

AI agents break every one of those assumptions. An agent receiving the same prompt twice may follow entirely different reasoning paths, call different tools in different orders, and produce different outputs. The execution isn't a straight line through a call stack. It's a directed acyclic graph (a DAG) that varies in depth and width from one run to the next. Traditional monitoring sees a request that took 14 seconds and cost 12,000 tokens. We had a case where an agent spent 9 of those seconds stuck in a retrieval step that returned irrelevant documents, triggering three reformulation attempts before it got anything useful. Our dashboards showed the request completed normally. It took us two days of manual log spelunking to figure out why that particular query class was consistently slow. That's what made us invest in tracing.

Agent failures rarely look like traditional errors. A 500 status code is obvious. An agent that confidently returns a plausible but wrong answer because it retrieved stale data in step three of a seven-step chain looks like success to every traditional metric. Only a trace that captures the full decision path reveals the moment the reasoning went sideways. Traditional monitoring is like tracking an airline's on-time arrival rate. Agent observability is the full flight data recorder: altitude, airspeed, heading, every course correction, every decision the autopilot made. The arrival rate tells you whether flights land on time. The recorder tells you why they did or didn't.

Seventy-nine percent of organizations have adopted AI agents in some capacity, yet the majority cannot trace failures through multi-step workflows or measure quality systematically. Over half of organizations using AI have experienced negative consequences from AI inaccuracy. The gap between adoption and observability is not only technical debt. It becomes a credibility problem with the business.

Trace Anatomy

What an Agent Trace Actually Looks Like

A trace is the complete record of everything an agent did from input to output. The building block is the span: one LLM call, one tool invocation, one retrieval query, one reasoning step. Each span carries attributes (model used, prompt sent, tokens consumed, latency, response) and knows its parent. This parent-child relationship transforms a flat event list into a structured tree mirroring the agent's actual decision path.

At the top sits the root span representing the entire invocation. Beneath it, child spans branch across planning, tool calls, retrieval, and synthesis. You see branching paths, nested operations, timing bars, and where the run diverged.

This is what separates agent observability from log aggregation. Logs give you chronology. Traces give you causality. You can see that a tool call happened because the previous reasoning step identified a gap. You can see the final answer was wrong because retrieval pulled from the wrong index. Parent-child relationships between spans are the connective tissue that turns data points into a diagnostic story.

The richest traces also capture reasoning itself: the chain-of-thought, intermediate plans, moments where the agent considered multiple paths and chose one. This is the layer that turns "the agent called a tool" into "the agent called this tool because it determined from the previous step that it needed numerical data it couldn't derive from context alone." If a traditional trace is a flight data recorder, this layer is the cockpit voice recorder: what the pilot was thinking when it happened.

Instrumentation

What to Capture at Each Layer

Not all telemetry is equally valuable. Capturing everything is expensive and noisy. Capturing too little leaves you blind when it matters. The practical question is what to record at each layer of the stack. Most post-incident investigations don't start at the wreckage. They start with the trace, jump to the moment things diverged from expectations, and read the data there.

The practical challenge is volume. A single complex run might generate dozens of spans with kilobytes of prompt text each. At thousands of runs per day, naive capture-everything strategies overwhelm storage. Production teams often use tiered capture: full detail for a sampled percentage, metadata-only for all runs, and on-demand full capture when investigating specific issues. Sampling thresholds depend on team and workload, but starting at 10% with on-demand escalation has worked in the cases we've seen.

Layer	Capture	Question it answers
LLM calls	Model name and version, full prompt or prompt reference, completion, token counts, latency, generation parameters, cost.	What did the model see, and what did it say?
Tool calls	Which tool, inputs, outputs, duration, success or failure.	What information did the agent gather, and was it correct?
Reasoning steps	Intermediate plans, relevance evaluation, retry decisions, and other internal pivots.	Why did the agent do what it did?
Sessions	Multi-turn aggregation tying traces into a longer interaction narrative.	How did the agent's behavior evolve over time?

Failure Modes

The Five Failures You Can't See Without Traces

The OpenTelemetry community has flagged silent retries as a major reason retries must be traced as distinct spans rather than swallowed by the framework. Context overflow is deceptive because traditional monitoring can still show consistent latency and token counts while answer quality drops. Each one is invisible at the input-output level. The symptoms live in the middle of the run.

The Confident Wrong Answer

Every traditional metric says "success." The trace reveals retrieval pulled from the wrong index, or the model hallucinated facts not present in any retrieved context. This is the most dangerous failure because it looks exactly like the best outcome.

The Reasoning Loop

The agent calls a tool, gets ambiguous results, retries with slightly different parameters, gets ambiguous results again. Each individual step looks reasonable. The circular pattern is only visible when you zoom out across the span tree. We have seen loops run for hundreds of iterations before hitting a token limit, each iteration looking perfectly rational in isolation.

The Silent Retry Cascade

The framework automatically retries rate-limited LLM calls. Each retry is a cost event, but if retries are not surfaced as separate spans, the team sees only the final successful call and wonders why a 2-second operation took 30 seconds at five times the expected cost.

Tool Misselection

The agent calls a general web search when it should query an internal knowledge base. The tool returns a result, just a useless one. Without traces showing the agent's selection reasoning, the team can only observe occasionally mediocre outputs without knowing where the mediocrity originates.

Context Overflow

As the agent accumulates context through multiple steps, relevant information gets diluted with noise from earlier steps. The agent's performance degrades gradually across a session: early answers are sharp, later answers are vague.

Unit Economics

Agent Economics Need Span-Level Cost Attribution

Agent economics are unlike traditional software. Traditional software has near-zero marginal cost per request once deployed. Agents incur real per-token costs on every run, and those costs compound through multi-step reasoning. A seven-step chain does not cost seven times a single call. Each step passes accumulated context, so token counts climb geometrically.

We assumed our retrieval-augmented agent would cost roughly $0.12 per run based on the model pricing. The actual number was $0.41. When we finally got span-level cost breakdowns working, we found that 58% of the spend was in retrieval steps that were re-fetching documents the agent had already seen two steps earlier. We added a simple retrieval cache keyed on query similarity and cut per-run cost to $0.19 in a week. We never would have found that without span-level attribution.

More teams are asking about unit economics. Not "what is our total AI spend?" but "what is our cost per outcome?" Answering that requires traces. You need to decompose an agent run into its constituent spans, attach cost data to each span, and aggregate across thousands of runs. Maybe 60% of your spend is in retrieval steps that could be cached. Maybe your agent calls a premium model for simple classification tasks that a smaller model handles just as well. You cannot optimize what you cannot measure.

Production teams are implementing circuit breakers: automated kill switches terminating runs that exceed cost thresholds or iteration limits. These are the flight envelope protections in modern aircraft: hard boundaries preventing the autopilot from entering an unrecoverable state regardless of what the reasoning chain "thinks" it should do. Hard limits on reasoning steps per task and mandatory human approval above a cost threshold (some teams set this around $50) are becoming standard. Without these guardrails, unconstrained autonomy and pay-per-token pricing are a volatile combination. Track cost per outcome alongside accuracy. Aggregate spend hides which steps are worth optimizing.

$0.12

Expected cost per run based on model pricing assumptions.

$0.41

Actual cost per run once the full reasoning chain was measured.

58%

Share of spend tied to repeated retrieval of already-seen documents.

$0.19

Per-run cost after a simple retrieval cache was introduced.

Open Standards

OpenTelemetry as the Emerging Standard

For the first few years, every observability tool spoke its own dialect. LangSmith captured traces one way, Arize another, Weights & Biases another. Switching tools meant re-instrumenting everything.

OpenTelemetry's GenAI semantic conventions are changing this. An LLM call is not just an HTTP request. It carries semantic meaning HTTP-level monitoring can't capture: which model was used, how many tokens were consumed, what prompt template ran, and whether the output passed evaluation. The conventions standardize how all of this is recorded, so any tool that speaks OTel can ingest and display it.

For agents specifically, the conventions define span types that map directly to the trace anatomy. An invoke_agent span represents the root of an agent run. Beneath it, child spans for LLM calls, tool calls, and reasoning steps, each with standardized attributes. The span hierarchy mirrors the agent's decision tree, and because conventions are shared, a trace captured by one framework can be analyzed by any compatible backend. This is the equivalent of standardizing radar protocols: the manufacturer of the aircraft doesn't matter. The data format is the same, so any controller at any screen can track any aircraft.

The GenAI Special Interest Group within OpenTelemetry, active since early 2024, has been converging on conventions covering frameworks like CrewAI, AutoGen, LangGraph, IBM Bee Stack, and Pydantic AI.

Several major agent frameworks (CrewAI, AutoGen, LangGraph, Pydantic AI) now emit OTel-compatible traces natively. Teams investing in OTel-compatible instrumentation are betting on the format most likely to outlive any single vendor.

The conventions aren't fully mature. Early adopters sometimes patch gaps or contribute upstream fixes. But that cost is far lower than building on proprietary formats that get deprecated when vendors pivot.

Case Study

What It Looked Like in Practice

A health-tech company we worked with built an agent for clinical researchers to synthesize findings from published medical studies. About 800 queries per day. The agent searched a curated literature database, retrieved papers, extracted findings, and produced evidence summaries.

Researchers started reporting that summaries occasionally included findings not traceable to any paper in the database. Well-written, clinically plausible, difficult to catch without manual verification.

When they deployed span-level tracing, the root cause emerged within 48 hours. We assumed the hallucinations were coming from the synthesis step, that the model was just making things up. We were wrong. The traces showed that for complex multi-concept queries, initial retrieval returned borderline-relevant papers. The agent's reasoning chain then entered a "gap-filling" pattern: recognizing insufficient evidence but synthesizing plausible-sounding findings to fill gaps instead of returning partial answers. Retrieval spans showed marginally relevant results, followed by reasoning spans where chain-of-thought shifted from "summarize what was found" to "infer what is likely true."

Two targeted fixes: a relevance-scoring gate after retrieval that routed low-confidence results to clarification instead of synthesis, and a provenance check requiring every finding to be attributable to a retrieved source with ID attached as span metadata.

Hallucinated findings dropped 94% within two weeks. Total time from "we have a problem" to "fixed and verified": 23 days. Without tracing, the team estimated months of experiments against wrong hypotheses.

Researcher reports

Summaries started containing plausible findings that could not be traced to any paper in the database.

Trace discovery

Span-level tracing showed retrieval returning borderline-relevant papers, followed by a gap-filling reasoning pattern.

Targeted controls

The team added a relevance-scoring gate and a provenance requirement tying findings back to retrieved sources.

Measured outcome

Hallucinated findings dropped 94% within two weeks and the root cause was verified in 23 days instead of months.

Operating Practice

Building the Practice

Tooling only matters if teams review and use the traces.

Trace review as routine discipline. Like code review, periodically examine traces even when nothing is broken. What does a "healthy" trace look like? How deep should the span tree be? What is the expected distribution of tool calls? Teams that review regularly develop intuition for drift before it becomes an incident.

Shared dashboards across roles. Engineering needs span trees and latency. Product needs success rates and quality scores. Finance needs cost breakdowns by agent type. Compliance needs audit trails showing that agent decisions can be explained and justified. The best platforms present the same underlying trace data through different lenses for each audience.

Connected pre-production and production monitoring. Use the same trace format, metrics, and dashboards across test and production. When a production trace shows a failure, replay it in test, fix the root cause, verify against the same data. This tight feedback loop is where observability stops being monitoring and starts being a continuous improvement engine.

The teams that get the most value here don't have the fanciest tooling. They treat traces as institutional knowledge: a record of how agents reason, where they fail, and what teams changed after seeing those failures.

Trace review as routine discipline

Examine healthy traces regularly so teams understand normal execution before incidents happen.

Shared dashboards across roles

Engineering, product, finance, and compliance teams need different views of the same trace data.

Connected pre-production and production monitoring

Use the same trace format and dashboards across environments so failures can be replayed and fixes verified fast.

Takeaways

Agent observability captures causality, not just events: the span-by-span record that makes non-deterministic behavior auditable.
The five invisible failure modes (confident wrong answers, reasoning loops, silent retries, tool misselection, context overflow) are undetectable without trace-level instrumentation.
You can't optimize agent costs without span-level attribution showing which steps consume the most tokens.
Invest in OTel-compatible instrumentation now; it's the format most likely to outlive any single vendor.
Trace review, shared dashboards, and connected test/production monitoring turn tooling into institutional knowledge.

Keep Reading

FAQ

Do I need this if my agent is simple, one LLM call plus one tool?

How much does tracing slow agents down?

Can I use my existing APM tool?

What should I instrument first?

What's the difference between agent observability and LLM monitoring?