LLM Observability and Monitoring: From Development to Production
Hands-on guide to LLM observability—tracing, metrics, cost tracking, and the tools that make production AI systems reliable. Comparing LangSmith, Langfuse, Arize Phoenix, and more.
Table of Contents
Why LLM Observability Matters
The gap between a working demo and a production system is vast. From research: "The final stretch from 'demo quality' to 'production quality' consumes disproportionate effort. The organisations extracting real value are the ones doing the less glamorous engineering work: building evaluation pipelines, implementing guardrails, designing for uncertainty."
LLM observability provides the visibility needed to:
- Debug non-deterministic behavior
- Track costs before they spiral
- Detect quality degradation
- Optimize latency and throughput
- Build trust with stakeholders
This post covers the tools, standards, and practices for production LLM observability.
Key Metrics to Monitor
Performance Metrics
| Metric | Description | Target |
|---|---|---|
| Latency (P50/P95/P99) | Response time distribution | P95 < 2s for chat |
| Time to First Token (TTFT) | Streaming responsiveness | < 500ms |
| Tokens per Second | Generation throughput | Model-dependent |
| Throughput | Requests per second | Based on capacity |
From research: "Performance monitoring tracks latency at every percentile, not just averages that hide outliers."
Cost Metrics
| Metric | Description | Why It Matters |
|---|---|---|
| Tokens per request | Input + output tokens | Normalizes usage |
| Cost per user/team | Attribution | Showback/chargeback |
| Cost per feature | Feature-level tracking | ROI analysis |
| Cache hit ratio | Saved spend | Optimization signal |
From research: "Key cost metrics include: tokens per request, cost per user/team/feature, cache hit ratio, requests routed to expensive models, and cost spikes/anomalies."
Quality Metrics
| Metric | Description | Measurement |
|---|---|---|
| Task completion | Did user achieve goal? | User feedback, heuristics |
| Hallucination rate | Factual accuracy | LLM-as-judge, citations |
| Relevance score | Answer quality | Embedding similarity |
| User satisfaction | Explicit feedback | Thumbs up/down, ratings |
Tracing Standards
OpenTelemetry for LLMs
OpenTelemetry is becoming the standard for LLM observability.
Why OpenTelemetry matters for LLM applications: Before OpenTelemetry, every observability vendor had proprietary instrumentation. If you used Datadog, you used Datadog's SDK. Switching vendors meant rewriting instrumentation code. OpenTelemetry provides a vendor-neutral standard: instrument once, export to any backend. For LLM applications—which might start with one observability tool and outgrow it—this portability is essential.
The LLM-specific challenge: Traditional APM tracks request/response cycles, database queries, and service calls. LLM applications add new concerns: token usage, prompt content, completion text, model parameters, and multi-step agent reasoning. OpenTelemetry's semantic conventions for GenAI provide standardized ways to capture these LLM-specific attributes, ensuring consistency across tools.
Why traces are essential for debugging LLM apps: A single user request might involve: query understanding (LLM call #1), retrieval (vector DB), context assembly, response generation (LLM call #2), and maybe tool use (additional LLM calls). When something goes wrong, you need to see the entire chain—what did each LLM call receive and return? Traces connect these dots, showing cause and effect across the request lifecycle.
From OpenTelemetry: "OpenTelemetry has defined semantic conventions for Generative AI operations across multiple signals: Events (for inputs and outputs), Metrics (for operations), Model spans, and Agent spans."
Key attributes:
# Standard GenAI semantic conventions
gen_ai.system = "openai"
gen_ai.request.model = "gpt-4o"
gen_ai.request.max_tokens = 1000
gen_ai.request.temperature = 0.7
gen_ai.usage.input_tokens = 150
gen_ai.usage.output_tokens = 500
gen_ai.response.finish_reason = "stop"
OpenInference
Arize's OpenInference extends OpenTelemetry for AI:
From research: "OpenInference is a set of conventions and plugins that is complementary to OpenTelemetry to enable tracing of AI applications. OpenInference defines standardized attributes for LLM interactions, including prompts, model parameters, token usage, responses, and key moments like time-to-first-token."
OpenLLMetry
From Traceloop: "OpenLLMetry is a set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application, and because it uses OpenTelemetry under the hood, it can be connected to existing observability solutions like Datadog and Honeycomb."
Integration example:
from traceloop.sdk import Traceloop
# Initialize once
Traceloop.init(app_name="my-llm-app")
# All LLM calls automatically traced
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
# Traces sent to configured backend
Major Observability Platforms
The LLM observability landscape is evolving rapidly. The platforms below represent different philosophies: some optimize for specific frameworks, others for flexibility; some are cloud-only, others self-hostable. Your choice depends on your stack, team size, and data sensitivity requirements.
The build vs. buy decision: You could build observability with raw OpenTelemetry and your existing APM tool. But LLM-specific platforms provide: prompt playgrounds for iteration, LLM-as-judge evaluation pipelines, conversation thread visualization, and RAG-specific debugging tools. These specialized features often justify the platform cost.
LangSmith
Best for: LangChain/LangGraph users
From research: "LangChain users get the most from LangSmith, where the integration is automatic and the debugging tools understand LangChain's internals."
Key features:
- Automatic tracing for LangChain
- Prompt versioning and playground
- Dataset management for evaluation
- Hub for sharing prompts
Pricing: Free tier (5,000 traces/month), Plus ($39/user/month)
Limitations: "LangSmith doesn't offer a self-hosting option in the self-serve module" and "its operational capabilities are limited outside LangChain-centric workflows."
Setup:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
# All LangChain operations automatically traced
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
Langfuse
Best for: Framework-agnostic production monitoring, self-hosting
From research: "Langfuse is the open source leader in this space, with over 19,000 GitHub stars and an MIT license that lets you self-host without restrictions."
Key features:
- Multi-turn conversation tracing
- Prompt versioning with playground
- LLM-as-judge evaluation
- Cost tracking and analytics
- Self-hosting support
From research: "Langfuse has a larger open source adoption compared to Arize Phoenix and is considered battle-tested for production use cases."
Setup:
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse()
@observe()
def my_llm_function(prompt: str):
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
Arize Phoenix
Best for: Experimentation, RAG evaluation, agent debugging
From research: "Arize Phoenix is an open-source LLM observability tool built by Arize AI. It is built entirely on OpenTelemetry standards and is designed to run in your local environment."
Key features:
- RAG-specific evaluation
- Agent trace visualization
- Embedding analysis
- Local development friendly
From research: "Compared with other open-source evaluation and tracing tools, Arize Phoenix provides deeper support for agent evaluation. It captures complete multi-step agent traces, allowing teams to assess how agents make decisions over time."
Setup:
import phoenix as px
# Launch local Phoenix server
session = px.launch_app()
# Instrument your application
from phoenix.otel import register
tracer_provider = register(project_name="my-project")
# View traces at http://localhost:6006
Platform Comparison
| Feature | LangSmith | Langfuse | Arize Phoenix |
|---|---|---|---|
| License | Commercial | MIT (open source) | ELv2 |
| Self-hosting | Enterprise only | Yes | Yes |
| Framework support | LangChain-focused | All frameworks | All frameworks |
| Prompt management | Yes | Yes | Limited |
| Cost tracking | Yes | Yes | Limited |
| RAG evaluation | Basic | Yes | Excellent |
| Agent traces | Good | Good | Excellent |
| Production scale | Yes | Yes | Development focus |
Choosing a Platform
From research:
For LangChain users: "LangSmith is the most natural and powerful choice. Its deep, seamless integration provides unparalleled visibility into chains and agents."
For mid-size teams (10-50): "You can justify combining focused tools. Phoenix for evaluation plus Portkey for routing gives you depth without platform lock-in."
For enterprise (50+): "Extend existing infrastructure. If you run Datadog or New Relic, add their LLM modules. Otherwise, deploy Langfuse self-hosted for data control."
Other Notable Tools
Helicone
Gateway-based observability—intercepts API calls for zero-code instrumentation:
# Just change the base URL
client = OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={"Helicone-Auth": f"Bearer {HELICONE_API_KEY}"}
)
Datadog LLM Observability
From Datadog: Enterprise-grade LLM monitoring integrated with existing APM.
Weights & Biases
MLOps platform with LLM tracing capabilities, good for experiment tracking.
Portkey
AI gateway with routing, caching, and observability:
- Route between providers
- Automatic retries and fallbacks
- Cost optimization
- Unified observability
Implementation Best Practices
Start Early
From research: "Start monitoring from day one of development. Don't wait for production deployment. Instrument your LLM applications during prototyping so you understand baseline model behavior."
Define Thresholds
From research: "Define clear quality thresholds for your use case. Your customer service bot might require 95% accuracy and 90% user satisfaction. Your code generator needs 99% syntax correctness. Document these thresholds, align stakeholders on them, and configure alerts when reality diverges."
Layer Your Monitoring
From research: "Implement layered monitoring across the stack. Track application-level metrics (user satisfaction, task completion), model-level metrics (latency, token usage), and infrastructure metrics (API availability, rate limits)."
Prioritize Metrics
From research: "Start with cost and latency since they're easy to measure and immediately actionable. Add error rates next. Once you have baseline visibility, layer in quality metrics like hallucination detection and relevance scoring."
Cost Tracking Implementation
Adding Cost Attributes
From research: "Many teams extend OpenTelemetry by adding a custom span attribute for cost, calculated from token counts and the model's pricing schema."
from opentelemetry import trace
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00}, # per 1M tokens
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
def traced_completion(model: str, messages: list):
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("llm_completion") as span:
response = client.chat.completions.create(model=model, messages=messages)
usage = response.usage
cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)
span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
span.set_attribute("gen_ai.cost_usd", cost)
span.set_attribute("gen_ai.request.model", model)
return response
Cost Attribution
From research: "Tag metadata such as user, team, environment, and feature for precise cost attribution."
span.set_attribute("user_id", user_id)
span.set_attribute("team", team_name)
span.set_attribute("feature", "customer_support")
span.set_attribute("environment", "production")
Production Monitoring Architecture
┌─────────────────────────────────────────────────────────────┐
│ Your LLM Application │
├─────────────────────────────────────────────────────────────┤
│ OpenTelemetry SDK + OpenLLMetry/OpenInference Instrumentation│
└─────────────────────────┬───────────────────────────────────┘
│ OTLP
▼
┌─────────────────────────────────────────────────────────────┐
│ Collector / Gateway │
│ (OpenTelemetry Collector, Portkey, Helicone) │
└─────────────────────────┬───────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ LangSmith │ │ Langfuse │ │ Datadog │
│ Phoenix │ │ Grafana │ │ etc. │
└──────────┘ └──────────┘ └──────────┘
Alerting and Dashboards
Key Alerts
alerts:
- name: high_latency
condition: p95_latency > 5000ms
severity: warning
- name: cost_spike
condition: hourly_cost > 2x_baseline
severity: critical
- name: error_rate
condition: error_rate > 5%
severity: critical
- name: low_quality
condition: user_satisfaction < 80%
severity: warning
Dashboard Panels
Essential panels:
- Request volume over time
- Latency distribution (P50, P95, P99)
- Token usage by model
- Cost by team/feature
- Error rate and types
- User feedback scores
Debugging with Traces
Trace Analysis Workflow
- Identify problematic requests via metrics (high latency, errors)
- Find trace ID from logs or metrics
- Inspect trace waterfall to identify slow spans
- Examine span attributes for input/output details
- Compare with successful traces to isolate issues
- Fix and verify with A/B comparison
Common Issues and Diagnosis
| Symptom | Trace Pattern | Likely Cause |
|---|---|---|
| High latency | Long LLM span | Model capacity, prompt length |
| High latency | Long retrieval span | Vector DB performance |
| Errors | Missing spans | Timeout, rate limit |
| Poor quality | Short output | Max tokens too low |
| High cost | Many LLM spans | Unnecessary retries |
Conclusion
LLM observability is essential for production systems. Start with:
- Instrument early with OpenTelemetry-compatible tools
- Track costs from day one before they surprise you
- Define quality thresholds and alert on violations
- Choose tools based on your stack (LangSmith for LangChain, Langfuse for flexibility, Phoenix for experimentation)
The investment in observability pays dividends in debugging speed, cost control, and stakeholder confidence.
Frequently Asked Questions
Related Articles
LLM Evaluation in Production: Beyond Benchmarks
How to evaluate LLM performance in real-world applications, where academic benchmarks often fail to capture what matters.
Building Production-Ready RAG Systems: Lessons from the Field
Production-focused guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
LLM Inference Optimization: From Quantization to Speculative Decoding
Practical guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.