Skip to main content
Back to Blog

LLM Observability and Monitoring: From Development to Production

Hands-on guide to LLM observability—tracing, metrics, cost tracking, and the tools that make production AI systems reliable. Comparing LangSmith, Langfuse, Arize Phoenix, and more.

8 min read
Share:

Why LLM Observability Matters

The gap between a working demo and a production system is vast. From research: "The final stretch from 'demo quality' to 'production quality' consumes disproportionate effort. The organisations extracting real value are the ones doing the less glamorous engineering work: building evaluation pipelines, implementing guardrails, designing for uncertainty."

LLM observability provides the visibility needed to:

  • Debug non-deterministic behavior
  • Track costs before they spiral
  • Detect quality degradation
  • Optimize latency and throughput
  • Build trust with stakeholders

This post covers the tools, standards, and practices for production LLM observability.

Key Metrics to Monitor

Performance Metrics

MetricDescriptionTarget
Latency (P50/P95/P99)Response time distributionP95 < 2s for chat
Time to First Token (TTFT)Streaming responsiveness< 500ms
Tokens per SecondGeneration throughputModel-dependent
ThroughputRequests per secondBased on capacity

From research: "Performance monitoring tracks latency at every percentile, not just averages that hide outliers."

Cost Metrics

MetricDescriptionWhy It Matters
Tokens per requestInput + output tokensNormalizes usage
Cost per user/teamAttributionShowback/chargeback
Cost per featureFeature-level trackingROI analysis
Cache hit ratioSaved spendOptimization signal

From research: "Key cost metrics include: tokens per request, cost per user/team/feature, cache hit ratio, requests routed to expensive models, and cost spikes/anomalies."

Quality Metrics

MetricDescriptionMeasurement
Task completionDid user achieve goal?User feedback, heuristics
Hallucination rateFactual accuracyLLM-as-judge, citations
Relevance scoreAnswer qualityEmbedding similarity
User satisfactionExplicit feedbackThumbs up/down, ratings

Tracing Standards

OpenTelemetry for LLMs

OpenTelemetry is becoming the standard for LLM observability.

Why OpenTelemetry matters for LLM applications: Before OpenTelemetry, every observability vendor had proprietary instrumentation. If you used Datadog, you used Datadog's SDK. Switching vendors meant rewriting instrumentation code. OpenTelemetry provides a vendor-neutral standard: instrument once, export to any backend. For LLM applications—which might start with one observability tool and outgrow it—this portability is essential.

The LLM-specific challenge: Traditional APM tracks request/response cycles, database queries, and service calls. LLM applications add new concerns: token usage, prompt content, completion text, model parameters, and multi-step agent reasoning. OpenTelemetry's semantic conventions for GenAI provide standardized ways to capture these LLM-specific attributes, ensuring consistency across tools.

Why traces are essential for debugging LLM apps: A single user request might involve: query understanding (LLM call #1), retrieval (vector DB), context assembly, response generation (LLM call #2), and maybe tool use (additional LLM calls). When something goes wrong, you need to see the entire chain—what did each LLM call receive and return? Traces connect these dots, showing cause and effect across the request lifecycle.

From OpenTelemetry: "OpenTelemetry has defined semantic conventions for Generative AI operations across multiple signals: Events (for inputs and outputs), Metrics (for operations), Model spans, and Agent spans."

Key attributes:

Python
# Standard GenAI semantic conventions
gen_ai.system = "openai"
gen_ai.request.model = "gpt-4o"
gen_ai.request.max_tokens = 1000
gen_ai.request.temperature = 0.7
gen_ai.usage.input_tokens = 150
gen_ai.usage.output_tokens = 500
gen_ai.response.finish_reason = "stop"

OpenInference

Arize's OpenInference extends OpenTelemetry for AI:

From research: "OpenInference is a set of conventions and plugins that is complementary to OpenTelemetry to enable tracing of AI applications. OpenInference defines standardized attributes for LLM interactions, including prompts, model parameters, token usage, responses, and key moments like time-to-first-token."

OpenLLMetry

From Traceloop: "OpenLLMetry is a set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application, and because it uses OpenTelemetry under the hood, it can be connected to existing observability solutions like Datadog and Honeycomb."

Integration example:

Python
from traceloop.sdk import Traceloop

# Initialize once
Traceloop.init(app_name="my-llm-app")

# All LLM calls automatically traced
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)
# Traces sent to configured backend

Major Observability Platforms

The LLM observability landscape is evolving rapidly. The platforms below represent different philosophies: some optimize for specific frameworks, others for flexibility; some are cloud-only, others self-hostable. Your choice depends on your stack, team size, and data sensitivity requirements.

The build vs. buy decision: You could build observability with raw OpenTelemetry and your existing APM tool. But LLM-specific platforms provide: prompt playgrounds for iteration, LLM-as-judge evaluation pipelines, conversation thread visualization, and RAG-specific debugging tools. These specialized features often justify the platform cost.

LangSmith

Best for: LangChain/LangGraph users

From research: "LangChain users get the most from LangSmith, where the integration is automatic and the debugging tools understand LangChain's internals."

Key features:

  • Automatic tracing for LangChain
  • Prompt versioning and playground
  • Dataset management for evaluation
  • Hub for sharing prompts

Pricing: Free tier (5,000 traces/month), Plus ($39/user/month)

Limitations: "LangSmith doesn't offer a self-hosting option in the self-serve module" and "its operational capabilities are limited outside LangChain-centric workflows."

Setup:

Python
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"

# All LangChain operations automatically traced
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")

Langfuse

Best for: Framework-agnostic production monitoring, self-hosting

From research: "Langfuse is the open source leader in this space, with over 19,000 GitHub stars and an MIT license that lets you self-host without restrictions."

Key features:

  • Multi-turn conversation tracing
  • Prompt versioning with playground
  • LLM-as-judge evaluation
  • Cost tracking and analytics
  • Self-hosting support

From research: "Langfuse has a larger open source adoption compared to Arize Phoenix and is considered battle-tested for production use cases."

Setup:

Python
from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()
def my_llm_function(prompt: str):
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Arize Phoenix

Best for: Experimentation, RAG evaluation, agent debugging

From research: "Arize Phoenix is an open-source LLM observability tool built by Arize AI. It is built entirely on OpenTelemetry standards and is designed to run in your local environment."

Key features:

  • RAG-specific evaluation
  • Agent trace visualization
  • Embedding analysis
  • Local development friendly

From research: "Compared with other open-source evaluation and tracing tools, Arize Phoenix provides deeper support for agent evaluation. It captures complete multi-step agent traces, allowing teams to assess how agents make decisions over time."

Setup:

Python
import phoenix as px

# Launch local Phoenix server
session = px.launch_app()

# Instrument your application
from phoenix.otel import register
tracer_provider = register(project_name="my-project")

# View traces at http://localhost:6006

Platform Comparison

FeatureLangSmithLangfuseArize Phoenix
LicenseCommercialMIT (open source)ELv2
Self-hostingEnterprise onlyYesYes
Framework supportLangChain-focusedAll frameworksAll frameworks
Prompt managementYesYesLimited
Cost trackingYesYesLimited
RAG evaluationBasicYesExcellent
Agent tracesGoodGoodExcellent
Production scaleYesYesDevelopment focus

Choosing a Platform

From research:

For LangChain users: "LangSmith is the most natural and powerful choice. Its deep, seamless integration provides unparalleled visibility into chains and agents."

For mid-size teams (10-50): "You can justify combining focused tools. Phoenix for evaluation plus Portkey for routing gives you depth without platform lock-in."

For enterprise (50+): "Extend existing infrastructure. If you run Datadog or New Relic, add their LLM modules. Otherwise, deploy Langfuse self-hosted for data control."

Other Notable Tools

Helicone

Gateway-based observability—intercepts API calls for zero-code instrumentation:

Python
# Just change the base URL
client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={"Helicone-Auth": f"Bearer {HELICONE_API_KEY}"}
)

Datadog LLM Observability

From Datadog: Enterprise-grade LLM monitoring integrated with existing APM.

Weights & Biases

MLOps platform with LLM tracing capabilities, good for experiment tracking.

Portkey

AI gateway with routing, caching, and observability:

  • Route between providers
  • Automatic retries and fallbacks
  • Cost optimization
  • Unified observability

Implementation Best Practices

Start Early

From research: "Start monitoring from day one of development. Don't wait for production deployment. Instrument your LLM applications during prototyping so you understand baseline model behavior."

Define Thresholds

From research: "Define clear quality thresholds for your use case. Your customer service bot might require 95% accuracy and 90% user satisfaction. Your code generator needs 99% syntax correctness. Document these thresholds, align stakeholders on them, and configure alerts when reality diverges."

Layer Your Monitoring

From research: "Implement layered monitoring across the stack. Track application-level metrics (user satisfaction, task completion), model-level metrics (latency, token usage), and infrastructure metrics (API availability, rate limits)."

Prioritize Metrics

From research: "Start with cost and latency since they're easy to measure and immediately actionable. Add error rates next. Once you have baseline visibility, layer in quality metrics like hallucination detection and relevance scoring."

Cost Tracking Implementation

Adding Cost Attributes

From research: "Many teams extend OpenTelemetry by adding a custom span attribute for cost, calculated from token counts and the model's pricing schema."

Python
from opentelemetry import trace

MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},  # per 1M tokens
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000

def traced_completion(model: str, messages: list):
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("llm_completion") as span:
        response = client.chat.completions.create(model=model, messages=messages)

        usage = response.usage
        cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)

        span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
        span.set_attribute("gen_ai.cost_usd", cost)
        span.set_attribute("gen_ai.request.model", model)

        return response

Cost Attribution

From research: "Tag metadata such as user, team, environment, and feature for precise cost attribution."

Python
span.set_attribute("user_id", user_id)
span.set_attribute("team", team_name)
span.set_attribute("feature", "customer_support")
span.set_attribute("environment", "production")

Production Monitoring Architecture

Code
┌─────────────────────────────────────────────────────────────┐
│                     Your LLM Application                     │
├─────────────────────────────────────────────────────────────┤
│  OpenTelemetry SDK + OpenLLMetry/OpenInference Instrumentation│
└─────────────────────────┬───────────────────────────────────┘
                          │ OTLP
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                    Collector / Gateway                       │
│         (OpenTelemetry Collector, Portkey, Helicone)        │
└─────────────────────────┬───────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    ┌──────────┐   ┌──────────┐   ┌──────────┐
    │ LangSmith │   │ Langfuse │   │ Datadog  │
    │ Phoenix   │   │ Grafana  │   │ etc.     │
    └──────────┘   └──────────┘   └──────────┘

Alerting and Dashboards

Key Alerts

YAML
alerts:
  - name: high_latency
    condition: p95_latency > 5000ms
    severity: warning

  - name: cost_spike
    condition: hourly_cost > 2x_baseline
    severity: critical

  - name: error_rate
    condition: error_rate > 5%
    severity: critical

  - name: low_quality
    condition: user_satisfaction < 80%
    severity: warning

Dashboard Panels

Essential panels:

  1. Request volume over time
  2. Latency distribution (P50, P95, P99)
  3. Token usage by model
  4. Cost by team/feature
  5. Error rate and types
  6. User feedback scores

Debugging with Traces

Trace Analysis Workflow

  1. Identify problematic requests via metrics (high latency, errors)
  2. Find trace ID from logs or metrics
  3. Inspect trace waterfall to identify slow spans
  4. Examine span attributes for input/output details
  5. Compare with successful traces to isolate issues
  6. Fix and verify with A/B comparison

Common Issues and Diagnosis

SymptomTrace PatternLikely Cause
High latencyLong LLM spanModel capacity, prompt length
High latencyLong retrieval spanVector DB performance
ErrorsMissing spansTimeout, rate limit
Poor qualityShort outputMax tokens too low
High costMany LLM spansUnnecessary retries

Conclusion

LLM observability is essential for production systems. Start with:

  1. Instrument early with OpenTelemetry-compatible tools
  2. Track costs from day one before they surprise you
  3. Define quality thresholds and alert on violations
  4. Choose tools based on your stack (LangSmith for LangChain, Langfuse for flexibility, Phoenix for experimentation)

The investment in observability pays dividends in debugging speed, cost control, and stakeholder confidence.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles