Designing Distributed Observability Infrastructure for LLM-Powered Systems

The Observability Gap in Production AI

LLM-powered systems fail in ways traditional APM tools were never designed to capture. A user-facing agent might return coherent text while silently burning $0.40 per request through redundant tool calls. A model upgrade might pass all unit tests while regressing on structured output compliance by 12%. A latency spike might originate not in the inference endpoint but in a prompt-preprocessing pipeline feeding too many tokens into the context window.

The 2026 survey by Stacklok found that 47% of organizations running LLMs in production cite observability — specifically, the inability to trace failures across agent steps and attributing cost to specific workflow stages — as their top infrastructure gap [1]. The OpenTelemetry GenAI Semantic Conventions, ratified by the CNCF in late 2025, provide the vocabulary for solving this. But vocabulary without architecture is just schema. This post builds the production system around it.

The Four-Layer Observability Topology

A production observability stack for LLM systems decomposes into four layers, each with distinct data models, retention policies, and access patterns.

Layer 1: Structured Event Logs

Every LLM invocation, tool execution, and retrieval step emits a structured log record. The critical design decision is schema enforcement at ingestion rather than at query time. The OTel GenAI Semantic Conventions define gen_ai.request.model, gen_ai.response.id, gen_ai.usage.completion_tokens, and gen_ai.usage.prompt_tokens as required span attributes [2]. Enforcing these at log ingestion means every downstream consumer — cost analysis, latency dashboards, eval pipelines — operates on normalized data.

A production-grade event schema extends the OTel base with application-layer fields:

{
  "trace_id": "abc123",
  "span_id": "def456",
  "gen_ai.system": "openai",
  "gen_ai.request.model": "gpt-4o",
  "gen_ai.usage.prompt_tokens": 2847,
  "gen_ai.usage.completion_tokens": 512,
  "gen_ai.response.id": "chatcmpl-xyz",
  "app.agent_id": "code-review-v3",
  "app.session_id": "sess_8a2f",
  "app.user_id": "org_42",
  "app.cost_estimate_usd": 0.0213,
  "llm.latency_ms": 1842,
  "llm.ttft_ms": 340
}

Datadog announced native mapping of these OTel GenAI attributes to its own LLM Observability schema in December 2025, meaning the same instrumentation can feed both APM-style dashboards and LLM-specific views without schema translation [3].

Layer 2: Span-Based Distributed Tracing

Structured logs capture individual events. Distributed tracing captures the relationship between them. The fundamental unit is the trace — a directed acyclic graph of spans representing an end-to-end agent operation.

Every agent run should produce a root trace. Each LLM call, tool execution, retrieval step, and guardrail evaluation becomes a child span. The OTel GenAI spec defines gen_ai.span.kind — LLM, TOOL, RETRIEVAL, GUARDRAIL, AGENT — as the span kind discriminator [2]. This lets downstream systems aggregate by semantic category without parsing application payloads.

The cost of instrumentation is negligible. The OpenInferenceSpanProcessor from Arize’s Phoenix project (now the most widely adopted open-source LLM tracing backend) attaches OTel-compliant spans with zero manual span creation for OpenAI, Anthropic, and LangChain integrations [4]. For the most common providers, a single TracerProvider.instrument() call enables full trace capture.

Layer 3: Aggregated Metrics Pipeline

Spans and logs are high-cardinality; metrics are the compressed view. The OTel GenAI Metrics specification defines counters and histograms for token usage, request duration, and error rates by model and provider [5]. The key architectural choice is whether to derive metrics from spans (pull-based) or emit them independently (push-based).

For systems exceeding 5,000 requests per minute, the push-based approach wins. Sampling trace data at 10% and aggregating into Prometheus-style histograms reduces storage costs by 90% while preserving P50/P95/P99 latency distributions [1]. The trace pipeline feeds the debug workflow; the metrics pipeline feeds the alerting workflow. Never mix them.

Layer 4: Eval-Driven Quality Aggregation

The most consequential difference between traditional observability and LLM observability is the need to evaluate output quality at scale. A latency spike is meaningless if the agent is also producing worse results. The eval layer runs LLM-as-judge evaluations on production traces — retrieved from the OTel backend — and joins quality scores back to the original spans.

Braintrust demonstrated that per-tool-call cost attribution combined with trajectory-level quality scoring reveals the workflows driving both high spend and low quality simultaneously [6]. The architectural pattern is a secondary pipeline: traces land in object storage, a batch eval worker runs scoring, and the results are joined back into the observability store via a shared trace_id.

Topology for Multi-Step Agent Traces

Multi-step agents are the hardest case. A single agent task may produce 20+ spans across LLM calls, tool invocations, and retrieval steps. Arize’s analysis shows that on a 50,000-spans/month plan, a coding agent making 20 tool calls per task exhausts the quota after roughly 2,500 agent runs [7]. The architectural response is span sampling with weighted retention: retain 100% of error traces, 10% of successful traces for cost analysis, and 1% for latency distribution tracking.

Cost Attribution Architecture

Cost attribution is the highest-ROI observability feature for production LLM systems. Every span should carry an estimated cost computed from token counts and provider pricing at ingestion time, not at query time. This avoids the join explosion of matching traces against changing rate cards.

The most common production pattern is a cost-attribution service that subscribes to the structured log stream, computes cost per span using a cached pricing table (refreshed daily from provider APIs), and emits a derived metric stream to the observability store. Portkey’s architecture maintains cost attribution at the span level, computing cost_estimate_usd from gen_ai.usage.completion_tokens * provider_rate + gen_ai.usage.prompt_tokens * provider_rate at ingestion [8].

Failover and Backpressure

Observability instrumentation must never degrade the primary inference path. Every span exporter should implement a circuit breaker pattern: if the observability backend is unreachable for more than 5 seconds, switch to local batching on disk with a maximum queue of 50 MB. Span loss is acceptable; request latency is not. The LILO (last-in-last-out) queue ensures recent spans are preserved when the queue drains.

Production Deployment Patterns

The dominant deployment topology for LLM observability in 2026 is a sidecar-based architecture:

Application sidecar runs the OTel SDK with batch span processor
Local OTel collector aggregates spans from all sidecars on the node
Regional OTel gateway performs sampling, filtering, and tenant routing
Observability store (Phoenix, Datadog, or self-hosted ClickHouse) with 30-day hot retention, 90-day warm retention in object storage

MLflow’s 2026 observability platform demonstrates a fully integrated variant of this topology, with trace capture, eval scoring, and cost analytics in a single backend that speaks the OTel GenAI protocol natively [9].

Metrics That Matter

Production LLM observability should expose at minimum:

Metric	Source	Alert Threshold
P95 TTFT (time-to-first-token)	Span attribute `llm.ttft_ms`	> 2000ms
P95 end-to-end latency per agent run	Root span duration	> 30s
Token usage / request (prompt + completion)	`gen_ai.usage.*`	Budget-dependent
Cost per agent run	Derived from token counts	> $0.05
Structured output parse failure rate	Eval pipeline	> 2%
Tool call error rate	Span status code	> 5%

Confident AI’s 2026 survey of production LLM deployments found that teams tracking fewer than these six metrics were 3.2x more likely to miss regressions until they reached user-facing impact [10].

Summary

LLM observability is not traditional APM with an AI label. It requires a four-layer topology — structured events, distributed traces, aggregated metrics, and eval-driven quality scores — joined by the OTel GenAI Semantic Conventions. The infrastructure pattern converges on a sidecar-based OTel pipeline with span-level cost attribution, weighted sampling for multi-step agent traces, and circuit-breaker backpressure. The tools are mature enough that the engineering challenge is no longer instrumentation but architecture: designing the data flow, retention policy, and alerting topology for systems that process 10K+ inference requests per minute.

References

[1] Stacklok, “2026 State of AI Infrastructure Survey,” 2026. [Online]. Available: https://stacklok.com/research/2026-ai-infrastructure

[2] OpenTelemetry, “Semantic Conventions for Generative AI,” CNCF, 2025–2026. [Online]. Available: https://opentelemetry.io/docs/specs/semconv/gen-ai/

[3] Datadog, “Datadog LLM Observability Natively Supports OpenTelemetry GenAI Semantic Conventions,” Dec. 2025. [Online]. Available: https://www.datadoghq.com/blog/llm-otel-semantic-convention/

[4] Arize AI, “Phoenix: AI Observability & Evaluation,” GitHub. [Online]. Available: https://github.com/arize-ai/phoenix

[5] OpenTelemetry, “Semantic Conventions for Generative AI Metrics,” CNCF. [Online]. Available: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/

[6] Braintrust, “Best Tools for Tracking LLM Costs in Production (2026),” 2026. [Online]. Available: https://www.braintrust.dev/articles/best-tools-tracking-llm-costs-2026

[7] Augment Code, “7 Best AI Agent Observability Tools for Coding Teams in 2026,” 2026. [Online]. Available: https://www.augmentcode.com/tools/best-ai-agent-observability-tools

[8] Portkey, “The Complete Guide to LLM Observability for 2026,” 2026. [Online]. Available: https://portkey.ai/blog/the-complete-guide-to-llm-observability/

[9] MLflow, “Top LLM Observability Tools in 2026: A Pro Guide,” 2026. [Online]. Available: https://mlflow.org/articles/top-llm-observability-tools-in-2026-a-pro-guide/

[10] Confident AI, “Top 7 LLM Observability Tools in 2026,” 2026. [Online]. Available: https://www.confident-ai.com/knowledge-base/compare/top-7-llm-observability-tools