Which observability platform should I start with?

If you use LangChain, start with LangSmith—the integration is seamless. For framework-agnostic needs, Langfuse offers the best balance of features and flexibility. For experimentation and RAG debugging, Arize Phoenix excels. All have free tiers.

How much does LLM observability cost?

The observability platform itself is often free or low-cost (Langfuse is open source, LangSmith has 5K free traces/month). The real cost is engineering time for instrumentation and dashboard building. Start simple and expand.

Should I use OpenTelemetry or a proprietary SDK?

Use OpenTelemetry-based instrumentation (OpenLLMetry, OpenInference) for portability. This lets you switch backends without re-instrumenting. From research: "OpenTelemetry is the industry standard with libraries for every language, production-ready and widely adopted."

How do I track costs across multiple LLM providers?

Add cost calculation as a custom span attribute based on token counts and provider pricing. Tag with user/team/feature for attribution. Tools like Langfuse and Helicone have built-in cost tracking.

What's the difference between monitoring and observability?

From research: "Monitoring tracks predefined metrics like latency and token counts. Observability provides tools to investigate arbitrary questions about system behavior. Monitoring asks 'Is latency acceptable?' while observability asks 'Why did this specific request fail?'"

LLM Observability and Monitoring: From Development to Production | Enrico Piovano

Why LLM Observability Matters

The gap between a working demo and a production system is vast. From research: "The final stretch from 'demo quality' to 'production quality' consumes disproportionate effort. The organisations extracting real value are the ones doing the less glamorous engineering work: building evaluation pipelines, implementing guardrails, designing for uncertainty."

LLM observability provides the visibility needed to:

Debug non-deterministic behavior
Track costs before they spiral
Detect quality degradation
Optimize latency and throughput
Build trust with stakeholders

This post covers the tools, standards, and practices for production LLM observability.

Key Metrics to Monitor

Performance Metrics

Metric	Description	Target
Latency (P50/P95/P99)	Response time distribution	P95 < 2s for chat
Time to First Token (TTFT)	Streaming responsiveness	< 500ms
Tokens per Second	Generation throughput	Model-dependent
Throughput	Requests per second	Based on capacity

From research: "Performance monitoring tracks latency at every percentile, not just averages that hide outliers."

Cost Metrics

Metric	Description	Why It Matters
Tokens per request	Input + output tokens	Normalizes usage
Cost per user/team	Attribution	Showback/chargeback
Cost per feature	Feature-level tracking	ROI analysis
Cache hit ratio	Saved spend	Optimization signal

From research: "Key cost metrics include: tokens per request, cost per user/team/feature, cache hit ratio, requests routed to expensive models, and cost spikes/anomalies."

Quality Metrics

Metric	Description	Measurement
Task completion	Did user achieve goal?	User feedback, heuristics
Hallucination rate	Factual accuracy	LLM-as-judge, citations
Relevance score	Answer quality	Embedding similarity
User satisfaction	Explicit feedback	Thumbs up/down, ratings

Tracing Standards

OpenTelemetry for LLMs

OpenTelemetry is becoming the standard for LLM observability.

Why OpenTelemetry matters for LLM applications: Before OpenTelemetry, every observability vendor had proprietary instrumentation. If you used Datadog, you used Datadog's SDK. Switching vendors meant rewriting instrumentation code. OpenTelemetry provides a vendor-neutral standard: instrument once, export to any backend. For LLM applications—which might start with one observability tool and outgrow it—this portability is essential.

The LLM-specific challenge: Traditional APM tracks request/response cycles, database queries, and service calls. LLM applications add new concerns: token usage, prompt content, completion text, model parameters, and multi-step agent reasoning. OpenTelemetry's semantic conventions for GenAI provide standardized ways to capture these LLM-specific attributes, ensuring consistency across tools.

Why traces are essential for debugging LLM apps: A single user request might involve: query understanding (LLM call #1), retrieval (vector DB), context assembly, response generation (LLM call #2), and maybe tool use (additional LLM calls). When something goes wrong, you need to see the entire chain—what did each LLM call receive and return? Traces connect these dots, showing cause and effect across the request lifecycle.

From OpenTelemetry: "OpenTelemetry has defined semantic conventions for Generative AI operations across multiple signals: Events (for inputs and outputs), Metrics (for operations), Model spans, and Agent spans."

Key attributes:

Python

# Standard GenAI semantic conventions
gen_ai.system = "openai"
gen_ai.request.model = "gpt-4o"
gen_ai.request.max_tokens = 1000
gen_ai.request.temperature = 0.7
gen_ai.usage.input_tokens = 150
gen_ai.usage.output_tokens = 500
gen_ai.response.finish_reason = "stop"

OpenInference

Arize's OpenInference extends OpenTelemetry for AI:

From research: "OpenInference is a set of conventions and plugins that is complementary to OpenTelemetry to enable tracing of AI applications. OpenInference defines standardized attributes for LLM interactions, including prompts, model parameters, token usage, responses, and key moments like time-to-first-token."

OpenLLMetry

From Traceloop: "OpenLLMetry is a set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application, and because it uses OpenTelemetry under the hood, it can be connected to existing observability solutions like Datadog and Honeycomb."

Integration example:

Python

from traceloop.sdk import Traceloop

# Initialize once
Traceloop.init(app_name="my-llm-app")

# All LLM calls automatically traced
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)
# Traces sent to configured backend

Major Observability Platforms

The LLM observability landscape is evolving rapidly. The platforms below represent different philosophies: some optimize for specific frameworks, others for flexibility; some are cloud-only, others self-hostable. Your choice depends on your stack, team size, and data sensitivity requirements.

The build vs. buy decision: You could build observability with raw OpenTelemetry and your existing APM tool. But LLM-specific platforms provide: prompt playgrounds for iteration, LLM-as-judge evaluation pipelines, conversation thread visualization, and RAG-specific debugging tools. These specialized features often justify the platform cost.

LangSmith

Best for: LangChain/LangGraph users

From research: "LangChain users get the most from LangSmith, where the integration is automatic and the debugging tools understand LangChain's internals."

Key features:

Automatic tracing for LangChain
Prompt versioning and playground
Dataset management for evaluation
Hub for sharing prompts

Pricing: Free tier (5,000 traces/month), Plus ($39/user/month)

Limitations: "LangSmith doesn't offer a self-hosting option in the self-serve module" and "its operational capabilities are limited outside LangChain-centric workflows."

Setup:

Python

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"

# All LangChain operations automatically traced
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")

Langfuse

Best for: Framework-agnostic production monitoring, self-hosting

From research: "Langfuse is the open source leader in this space, with over 19,000 GitHub stars and an MIT license that lets you self-host without restrictions."

Key features:

Multi-turn conversation tracing
Prompt versioning with playground
LLM-as-judge evaluation
Cost tracking and analytics
Self-hosting support

From research: "Langfuse has a larger open source adoption compared to Arize Phoenix and is considered battle-tested for production use cases."

Setup:

Python

from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()
def my_llm_function(prompt: str):
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Arize Phoenix

Best for: Experimentation, RAG evaluation, agent debugging

From research: "Arize Phoenix is an open-source LLM observability tool built by Arize AI. It is built entirely on OpenTelemetry standards and is designed to run in your local environment."

Key features:

RAG-specific evaluation
Agent trace visualization
Embedding analysis
Local development friendly

From research: "Compared with other open-source evaluation and tracing tools, Arize Phoenix provides deeper support for agent evaluation. It captures complete multi-step agent traces, allowing teams to assess how agents make decisions over time."

Setup:

Python

import phoenix as px

# Launch local Phoenix server
session = px.launch_app()

# Instrument your application
from phoenix.otel import register
tracer_provider = register(project_name="my-project")

# View traces at http://localhost:6006

Platform Comparison

Feature	LangSmith	Langfuse	Arize Phoenix
License	Commercial	MIT (open source)	ELv2
Self-hosting	Enterprise only	Yes	Yes
Framework support	LangChain-focused	All frameworks	All frameworks
Prompt management	Yes	Yes	Limited
Cost tracking	Yes	Yes	Limited
RAG evaluation	Basic	Yes	Excellent
Agent traces	Good	Good	Excellent
Production scale	Yes	Yes	Development focus

Choosing a Platform

From research:

For LangChain users: "LangSmith is the most natural and powerful choice. Its deep, seamless integration provides unparalleled visibility into chains and agents."

For mid-size teams (10-50): "You can justify combining focused tools. Phoenix for evaluation plus Portkey for routing gives you depth without platform lock-in."

For enterprise (50+): "Extend existing infrastructure. If you run Datadog or New Relic, add their LLM modules. Otherwise, deploy Langfuse self-hosted for data control."

Other Notable Tools

Helicone

Gateway-based observability—intercepts API calls for zero-code instrumentation:

Python

# Just change the base URL
client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={"Helicone-Auth": f"Bearer {HELICONE_API_KEY}"}
)

Datadog LLM Observability

From Datadog: Enterprise-grade LLM monitoring integrated with existing APM.

Weights & Biases

MLOps platform with LLM tracing capabilities, good for experiment tracking.

Portkey

AI gateway with routing, caching, and observability:

Route between providers
Automatic retries and fallbacks
Cost optimization
Unified observability

Implementation Best Practices

Start Early

From research: "Start monitoring from day one of development. Don't wait for production deployment. Instrument your LLM applications during prototyping so you understand baseline model behavior."

Define Thresholds

From research: "Define clear quality thresholds for your use case. Your customer service bot might require 95% accuracy and 90% user satisfaction. Your code generator needs 99% syntax correctness. Document these thresholds, align stakeholders on them, and configure alerts when reality diverges."

Layer Your Monitoring

From research: "Implement layered monitoring across the stack. Track application-level metrics (user satisfaction, task completion), model-level metrics (latency, token usage), and infrastructure metrics (API availability, rate limits)."

Prioritize Metrics

From research: "Start with cost and latency since they're easy to measure and immediately actionable. Add error rates next. Once you have baseline visibility, layer in quality metrics like hallucination detection and relevance scoring."

Cost Tracking Implementation

Adding Cost Attributes

From research: "Many teams extend OpenTelemetry by adding a custom span attribute for cost, calculated from token counts and the model's pricing schema."

Python

from opentelemetry import trace

MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},  # per 1M tokens
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000

def traced_completion(model: str, messages: list):
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("llm_completion") as span:
        response = client.chat.completions.create(model=model, messages=messages)

        usage = response.usage
        cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)

        span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
        span.set_attribute("gen_ai.cost_usd", cost)
        span.set_attribute("gen_ai.request.model", model)

        return response

Cost Attribution

From research: "Tag metadata such as user, team, environment, and feature for precise cost attribution."

Python

span.set_attribute("user_id", user_id)
span.set_attribute("team", team_name)
span.set_attribute("feature", "customer_support")
span.set_attribute("environment", "production")

Production Monitoring Architecture

Code

┌─────────────────────────────────────────────────────────────┐
│                     Your LLM Application                     │
├─────────────────────────────────────────────────────────────┤
│  OpenTelemetry SDK + OpenLLMetry/OpenInference Instrumentation│
└─────────────────────────┬───────────────────────────────────┘
                          │ OTLP
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                    Collector / Gateway                       │
│         (OpenTelemetry Collector, Portkey, Helicone)        │
└─────────────────────────┬───────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    ┌──────────┐   ┌──────────┐   ┌──────────┐
    │ LangSmith │   │ Langfuse │   │ Datadog  │
    │ Phoenix   │   │ Grafana  │   │ etc.     │
    └──────────┘   └──────────┘   └──────────┘

Alerting and Dashboards

Key Alerts

YAML

alerts:
  - name: high_latency
    condition: p95_latency > 5000ms
    severity: warning

  - name: cost_spike
    condition: hourly_cost > 2x_baseline
    severity: critical

  - name: error_rate
    condition: error_rate > 5%
    severity: critical

  - name: low_quality
    condition: user_satisfaction < 80%
    severity: warning

Dashboard Panels

Essential panels:

Request volume over time
Latency distribution (P50, P95, P99)
Token usage by model
Cost by team/feature
Error rate and types
User feedback scores

Debugging with Traces

Trace Analysis Workflow

Identify problematic requests via metrics (high latency, errors)
Find trace ID from logs or metrics
Inspect trace waterfall to identify slow spans
Examine span attributes for input/output details
Compare with successful traces to isolate issues
Fix and verify with A/B comparison

Common Issues and Diagnosis

Symptom	Trace Pattern	Likely Cause
High latency	Long LLM span	Model capacity, prompt length
High latency	Long retrieval span	Vector DB performance
Errors	Missing spans	Timeout, rate limit
Poor quality	Short output	Max tokens too low
High cost	Many LLM spans	Unnecessary retries

Conclusion

LLM observability is essential for production systems. Start with:

Instrument early with OpenTelemetry-compatible tools
Track costs from day one before they surprise you
Define quality thresholds and alert on violations
Choose tools based on your stack (LangSmith for LangChain, Langfuse for flexibility, Phoenix for experimentation)

The investment in observability pays dividends in debugging speed, cost control, and stakeholder confidence.

Table of Contents

Why LLM Observability Matters

Key Metrics to Monitor

Performance Metrics

Cost Metrics

Quality Metrics

Tracing Standards

OpenTelemetry for LLMs

OpenInference

OpenLLMetry

Major Observability Platforms

LangSmith

Langfuse

Arize Phoenix

Platform Comparison

Choosing a Platform

Other Notable Tools

Helicone

Datadog LLM Observability

Weights & Biases

Portkey

Implementation Best Practices

Start Early

Define Thresholds

Layer Your Monitoring

Prioritize Metrics

Cost Tracking Implementation

Adding Cost Attributes

Cost Attribution

Production Monitoring Architecture

Alerting and Dashboards

Key Alerts

Dashboard Panels

Debugging with Traces

Trace Analysis Workflow

Common Issues and Diagnosis

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

LLM Evaluation in Production: Beyond Benchmarks

Building Production-Ready RAG Systems: Lessons from the Field

LLM Inference Optimization: From Quantization to Speculative Decoding