LLM Evaluation in Production: Beyond Benchmarks
How to evaluate LLM performance in real-world applications, where academic benchmarks often fail to capture what matters.
Table of Contents
The Evaluation Crisis
Here's an uncomfortable truth: most teams deploying LLMs in production don't actually know how well their systems work. They've seen impressive benchmark scores, run a few manual tests, and shipped. When things go wrong—and they will—they have no systematic way to understand why or measure improvement.
Academic benchmarks like MMLU, HellaSwag, and HumanEval are useful for comparing foundation models, but they're nearly useless for evaluating your specific application. A model that tops the leaderboards might fail catastrophically on your data, while a smaller model with lower benchmark scores might excel.
At Goji AI, we learned this lesson the hard way. Our first production LLM deployment had impressive demo metrics but a 40% failure rate on real user queries. The problem wasn't the model—it was our evaluation. We were measuring the wrong things.
This post shares the evaluation framework we've built over two years and millions of production queries. It's not glamorous work, but it's the difference between deploying AI systems that actually work and deploying expensive failures.
Why Benchmarks Fail
Distribution Mismatch
Benchmarks are constructed datasets with specific properties. Your production data has different distributions:
- User queries are messier, more ambiguous, and more diverse
- Your domain has specific terminology and context
- Edge cases appear that benchmark creators never imagined
A model optimized for clean benchmark questions may struggle with "hey can u help me wit my tax stuff" from a real user.
Metric Mismatch
Benchmark metrics optimize for narrow capabilities:
- Accuracy on multiple-choice questions
- Exact match on code generation
- BLEU/ROUGE on translation
Production success is multi-dimensional:
- Did the user accomplish their goal?
- Was the response safe and appropriate?
- Did it match the expected format?
- Was latency acceptable?
- What was the cost per query?
Contamination
Models may have seen benchmark data during training, either directly or through similar content. This inflates benchmark scores without improving real capability. Your proprietary data is, by definition, not contaminated—and that's where true performance shows.
The Evaluation Stack
A robust evaluation system has four layers. Each layer catches different types of issues, and you need all of them working together.
Why four layers instead of just testing the final output? Consider an analogy to software testing. You wouldn't test a web application only by clicking through the UI—you'd have unit tests for individual functions, integration tests for API endpoints, and end-to-end tests for user workflows. LLM systems are the same: component-level tests catch prompt bugs, model-level evaluations catch capability regressions, integration tests catch system issues, and production monitoring catches real-world edge cases.
The cost of missing a layer: Without unit tests, you'll waste evaluation compute on runs that fail due to template bugs. Without golden datasets, you won't notice gradual quality degradation. Without integration tests, you'll miss context window issues and tool failures. Without production monitoring, you'll be blind to the gap between your test distribution and reality.
Layer 1: Unit Tests (Component Level)
Test individual components with deterministic checks:
Prompt Template Tests:
- Does the template render correctly with various inputs?
- Are edge cases (empty inputs, special characters) handled?
- Do format instructions produce parseable outputs?
Tool Integration Tests:
- Do tools return expected responses for known inputs?
- Are errors handled gracefully?
- Do timeouts trigger correctly?
Output Parser Tests:
- Does parsing work for all expected formats?
- Are malformed outputs detected and handled?
These tests run in CI on every commit. They catch regressions in the deterministic parts of your system.
Layer 2: Evaluations (Model Level)
Test model behavior on curated datasets:
Golden Datasets: Curated examples with human-labeled correct outputs. For each example:
- Input query/context
- Expected output (or acceptable output range)
- Evaluation criteria
- Edge case labels
We maintain ~500 golden examples per major use case, stratified by:
- Query type (factual, analytical, creative)
- Difficulty (easy, medium, hard)
- Edge case category
- User segment
Automated Evaluation: For each golden example, generate model output and evaluate:
| Evaluation Type | What It Measures | How |
|---|---|---|
| Exact Match | Deterministic outputs | String comparison |
| Semantic Similarity | Meaning preservation | Embedding cosine similarity |
| Rubric Scoring | Multi-dimensional quality | LLM-as-judge with rubric |
| Factuality | Claim accuracy | LLM-as-judge against sources |
| Format Compliance | Output structure | Schema validation |
| Safety | Harmful content | Classification models + rules |
LLM-as-Judge: The most powerful technique for nuanced evaluation. A separate LLM evaluates outputs against criteria:
Evaluate the following response on these criteria (1-5 scale):
1. Relevance: Does it address the user's question?
2. Accuracy: Are all factual claims correct?
3. Completeness: Does it cover all necessary information?
4. Clarity: Is it well-organized and easy to understand?
5. Safety: Is it free from harmful content?
Query: {query}
Response: {response}
Reference: {reference}
Key considerations:
- Use a different model family as judge (Claude judging GPT outputs, or vice versa)
- Include reference answers for calibration
- Validate judge accuracy against human labels
- Watch for position bias, verbosity bias, and self-preference
Layer 3: Integration Tests (System Level)
Test the complete system end-to-end:
Scenario Tests: Multi-turn conversations that test realistic workflows:
- User asks initial question
- System responds
- User follows up
- System maintains context and provides coherent response
Regression Tests: Specific examples that previously failed. Every production bug becomes a test case. Our regression suite has 2,000+ examples from past failures.
Performance Tests:
- Latency percentiles (p50, p95, p99) under load
- Throughput limits
- Behavior under degraded conditions (slow APIs, partial failures)
Layer 4: Production Monitoring (Live Level)
Monitor real-world performance:
Implicit Signals:
- Conversation length (longer may indicate struggle)
- Retry rate (users rephrasing questions)
- Abandonment rate
- Time to completion
Explicit Feedback:
- Thumbs up/down
- Star ratings
- Correction submissions
- Support escalations
Quality Sampling: Randomly sample N% of production queries for human evaluation. This grounds your automated metrics in reality. We sample 1% of queries daily, stratified by user segment and query type.
Drift Detection: Monitor for distribution shift:
- Input embedding drift (are queries changing?)
- Output characteristic drift (response length, format, tone)
- Metric drift (are automated scores changing?)
Alert on significant drift for investigation.
Metrics That Matter
Primary Metrics (Business Outcomes)
What actually matters for your application:
| Application Type | Primary Metric | Target |
|---|---|---|
| Customer Support | Resolution rate without escalation | > 75% |
| Code Assistant | User acceptance rate | > 60% |
| Content Generation | Publication rate | > 80% |
| Data Analysis | Insight actionability (user survey) | > 4/5 |
| RAG System | Answer correctness (sampled) | > 90% |
These are your north star metrics. Everything else exists to predict or improve them.
Secondary Metrics (Quality Dimensions)
Multi-dimensional quality assessment:
Accuracy/Faithfulness: Are factual claims correct? For RAG systems, are claims grounded in retrieved context?
Measurement: LLM-as-judge against sources, human spot-check, citation verification.
Target: > 95% for high-stakes applications.
Relevance: Does the response address what the user actually asked?
Measurement: LLM-as-judge, semantic similarity to reference, user feedback.
Target: > 90%.
Completeness: Does it cover all necessary information without critical omissions?
Measurement: Checklist-based LLM evaluation, human review.
Target: Context-dependent.
Safety: Is it free from harmful, biased, or inappropriate content?
Measurement: Safety classifiers, rule-based filters, human review queue for edge cases.
Target: > 99.9% (false negatives are costly).
Format/Structure: Does output match expected schema and style?
Measurement: Automated parsing, schema validation.
Target: > 99%.
Operational Metrics
System health indicators:
| Metric | Description | Target |
|---|---|---|
| Latency P50 | Median response time | < 2s |
| Latency P95 | 95th percentile response time | < 5s |
| Error Rate | % of requests that fail | < 1% |
| Cost per Query | LLM API + compute costs | Application-specific |
| Token Efficiency | Output quality per input token | Trending up |
The Correlation Problem
The critical insight: your offline metrics must correlate with online outcomes.
We've seen systems where:
- LLM-as-judge scores improved by 10% but user satisfaction dropped
- Automated accuracy increased but support tickets spiked
- All metrics looked great but users churned
Run correlation analysis monthly:
- Sample production queries with explicit feedback
- Run all automated evaluations
- Calculate correlation coefficients
- If correlation < 0.7, your evaluation needs work
When offline and online metrics diverge, trust online metrics and fix your evaluation.
Building Golden Datasets
Your golden dataset is the foundation of everything. Here's how to build one that actually works:
Collection Strategy
Seed from Production: Start with real user queries, not synthetic examples. Sample across:
- Time periods (capture temporal variation)
- User segments (beginners vs. experts)
- Query types (all categories you support)
- Difficulty levels (estimated or labeled)
Targeted Expansion: Identify coverage gaps through error analysis. If a certain query type keeps failing, add more examples.
Adversarial Examples: Include examples designed to break your system:
- Prompt injection attempts
- Ambiguous queries
- Out-of-scope requests
- Edge cases from production failures
Labeling Protocol
Clear Criteria: Write detailed rubrics for each evaluation dimension. Labelers should agree on what "good" means.
Multiple Raters: Use 3+ raters per example for subjective dimensions. Measure inter-rater reliability (Krippendorff's alpha > 0.8 is good).
Calibration Sessions: Regular meetings where raters discuss disagreements and align on standards.
Expert Review: Domain experts validate labels for specialized content.
Dataset Hygiene
Versioning: Track dataset versions. When you update labels or add examples, create a new version. Keep old versions for comparison.
Stratification: Ensure balanced representation across important dimensions. Use stratified sampling for evaluation.
Contamination Prevention: Never use golden examples for training, prompting, or few-shot examples. Keep evaluation data strictly held-out.
Regular Refresh: Production distribution shifts. Refresh 10-20% of your dataset quarterly with new production examples.
LLM-as-Judge: The Details
LLM-as-judge is powerful but requires careful implementation:
Prompt Design
Explicit Rubrics: Don't ask "is this good?" Ask for specific dimensions with clear criteria:
Rate the response on ACCURACY (1-5):
5: All claims are factually correct and verifiable
4: Nearly all claims correct, minor inaccuracies that don't affect usefulness
3: Mostly correct but contains at least one significant error
2: Multiple errors that substantially reduce reliability
1: Predominantly incorrect or misleading
Reference Answers: When available, provide reference answers for comparison. Judges are more reliable comparing than evaluating in isolation.
Chain-of-Thought: Ask the judge to explain reasoning before scoring. This improves consistency and provides debugging information.
Bias Mitigation
Position Bias: LLMs prefer responses in certain positions. When comparing two responses, run both orderings and aggregate.
Verbosity Bias: Longer responses often score higher regardless of quality. Normalize for length or explicitly instruct against this bias.
Self-Preference: Models prefer their own outputs. Use a different model family for judging.
Primacy/Recency: Information early or late in prompts gets more attention. Structure prompts carefully.
Calibration
Human Alignment: Calculate correlation between LLM-as-judge scores and human ratings on a held-out set. If correlation < 0.8, revise your prompts.
Score Distribution: Examine score distributions. If all scores are 4-5, your rubric isn't discriminative. If scores are bimodal, criteria may be ambiguous.
Failure Cases: Manually review cases where LLM-as-judge significantly disagrees with humans. These reveal prompt improvements.
Implementing Evaluation Infrastructure
Evaluation Pipeline Architecture
Production
│
▼
[Query Sampling] ──► [Golden Dataset Store]
│ │
▼ ▼
[Model Inference] ◄─── [Eval Set Loader]
│
▼
[Output Store] ──────► [Automated Evaluators]
│ │
▼ ▼
[Human Review Queue] [Metrics Store]
│ │
▼ ▼
[Label Store] ◄───────► [Dashboard/Alerting]
Tooling
Evaluation Framework: Use or build a harness that:
- Loads evaluation sets
- Runs inference in parallel
- Applies multiple evaluators
- Stores results with full provenance
- Generates reports
We use a custom framework built on top of pytest for deterministic tests and a separate async pipeline for LLM-based evaluations.
Human Evaluation Platform: For labeling and quality sampling:
- Task assignment and routing
- Annotation interfaces
- Agreement calculation
- Label aggregation
- Quality control (trap questions)
Tools like Label Studio, Prodigy, or custom interfaces work. The key is workflow integration.
Dashboards: Visualize trends:
- Metric time series
- Slice-and-dice by dimensions
- Regression detection
- Correlation monitoring
We use Grafana with a custom backend that aggregates evaluation results.
Running Evaluations
On Every PR:
- Unit tests (fast, deterministic)
- Smoke tests on subset of golden set
- Performance regression tests
Nightly:
- Full golden set evaluation
- All automated metrics
- Comparison against previous day
Weekly:
- Human evaluation of production samples
- Correlation analysis
- Dataset health checks
On Model/Prompt Changes:
- A/B comparison against baseline
- Statistical significance testing
- Regression analysis across all dimensions
Case Study: Improving from 60% to 92% Accuracy
Let me share a concrete example. We had a customer support assistant with 60% resolution rate. Here's how evaluation drove improvement:
Week 1: Diagnosis
- Ran full evaluation suite
- Found 40% failure rate was split: 25% retrieval failures, 15% generation failures
- Retrieval failures: wrong documents fetched for multi-step questions
- Generation failures: model confusing similar products
Week 2: Retrieval Improvement
- Added query decomposition for complex questions
- Implemented hybrid search (was using dense only)
- Evaluation showed retrieval recall improved from 72% to 89%
- Resolution rate improved to 71%
Week 3: Generation Improvement
- Analyzed generation failures in detail
- Added product disambiguation prompt engineering
- Included relevant product context in all responses
- Generation accuracy improved from 85% to 94%
- Resolution rate improved to 81%
Week 4: Edge Cases
- Reviewed remaining failures manually
- Found cluster of questions about deprecated products
- Added explicit handling for legacy product queries
- Final resolution rate: 92%
Each improvement was validated by our evaluation suite before production deployment. Without systematic evaluation, we'd have been guessing.
Common Mistakes
1. Relying on Vibes: "The outputs look good" is not evaluation. Build systematic measurement from day one.
2. Over-indexing on Automated Metrics: Automated metrics are proxies. Regularly validate against human judgment and business outcomes.
3. Static Golden Sets: Production changes. Update your golden sets quarterly or risk measuring performance on stale data.
4. Ignoring Latency: A perfect answer in 30 seconds is worse than a good answer in 2 seconds. Include latency in your evaluation.
5. Single-Dimensional Evaluation: Quality is multi-dimensional. A response can be accurate but unhelpful, or helpful but unsafe. Measure all dimensions.
6. No Failure Analysis: When metrics drop, dig into specific failures. Aggregate statistics hide actionable insights.
7. Testing in Production Only: By the time you see production failures, users are affected. Catch issues in offline evaluation.
Conclusion
Evaluation is the unglamorous foundation of production LLM systems. Without it, you're deploying hope, not software.
Build evaluation infrastructure early. Start with a small golden set and expand. Implement LLM-as-judge with careful prompt design. Monitor production continuously. Close the loop between offline and online metrics.
The teams that build robust evaluation capabilities ship better systems, iterate faster, and catch problems before users do. The investment pays for itself many times over.
Frequently Asked Questions
Related Articles
LLM Observability and Monitoring: From Development to Production
A comprehensive guide to LLM observability—tracing, metrics, cost tracking, and the tools that make production AI systems reliable. Comparing LangSmith, Langfuse, Arize Phoenix, and more.
Fine-Tuning vs Prompting: When to Use Each
A practical guide to deciding between fine-tuning and prompt engineering for your LLM application, based on real-world experience with both approaches.
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.