How large should my golden dataset be?

Start with 200-500 examples covering your main use cases. Expand to 1,000+ as you identify coverage gaps. Quality matters more than quantity—500 carefully curated examples beat 5,000 sloppy ones. For critical applications, aim for 100+ examples per major query category.

How do I handle subjective tasks where there's no "right" answer?

Use multi-rater evaluation and measure agreement. If raters agree (Krippendorff's alpha > 0.7), use majority vote. If they disagree, the task may be genuinely subjective—in which case, measure consistency and user preference rather than "correctness." For creative tasks, focus on dimensions like relevance, coherence, and safety rather than exact match.

How often should I update my evaluation framework?

Review monthly, update quarterly. Major changes when: (1) Correlation between offline and online metrics drops below 0.7. (2) New failure modes appear that existing evaluations don't catch. (3) Product requirements change significantly. (4) Model capabilities improve enough to change baselines.

What's the cost of LLM-as-judge evaluation?

It adds up but is worth it. For our systems, LLM-as-judge costs ~$0.01-0.05 per evaluation (depending on prompt length and model). Running 1,000 evaluations daily costs $10-50/day. Compare to the cost of deploying bad models. Use cheaper models (GPT-4o-mini, Claude 3 Haiku) for high-volume automated evaluation; reserve expensive models for difficult cases.

How do I evaluate long-form content generation?

Break it into components: factual accuracy (claim by claim), structural quality (organization, flow), stylistic consistency, and completeness (coverage of required topics). Use rubric-based LLM-as-judge for each dimension. For very long content, evaluate a sample of sections rather than the whole document.

Should I evaluate intermediate steps or just final outputs?

Both. Final output evaluation catches user-facing issues. Intermediate evaluation (retrieval quality, tool call accuracy, reasoning steps) helps diagnose root causes. When final evaluation fails, intermediate evaluation tells you which component broke. Build your pipeline to be observable at every stage.

LLM Evaluation in Production: Beyond Benchmarks | Enrico Piovano

The Evaluation Crisis

Here's an uncomfortable truth: most teams deploying LLMs in production don't actually know how well their systems work. They've seen impressive benchmark scores, run a few manual tests, and shipped. When things go wrong—and they will—they have no systematic way to understand why or measure improvement.

Academic benchmarks like MMLU, HellaSwag, and HumanEval are useful for comparing foundation models, but they're nearly useless for evaluating your specific application. A model that tops the leaderboards might fail catastrophically on your data, while a smaller model with lower benchmark scores might excel.

At Goji AI, we learned this lesson the hard way. Our first production LLM deployment had impressive demo metrics but a 40% failure rate on real user queries. The problem wasn't the model—it was our evaluation. We were measuring the wrong things.

This post shares the evaluation framework we've built over two years and millions of production queries. It's not glamorous work, but it's the difference between deploying AI systems that actually work and deploying expensive failures.

Why Benchmarks Fail

Distribution Mismatch

Benchmarks are constructed datasets with specific properties. Your production data has different distributions:

User queries are messier, more ambiguous, and more diverse
Your domain has specific terminology and context
Edge cases appear that benchmark creators never imagined

A model optimized for clean benchmark questions may struggle with "hey can u help me wit my tax stuff" from a real user.

Metric Mismatch

Benchmark metrics optimize for narrow capabilities:

Accuracy on multiple-choice questions
Exact match on code generation
BLEU/ROUGE on translation

Production success is multi-dimensional:

Did the user accomplish their goal?
Was the response safe and appropriate?
Did it match the expected format?
Was latency acceptable?
What was the cost per query?

Contamination

Models may have seen benchmark data during training, either directly or through similar content. This inflates benchmark scores without improving real capability. Your proprietary data is, by definition, not contaminated—and that's where true performance shows.

The Evaluation Stack

A robust evaluation system has four layers. Each layer catches different types of issues, and you need all of them working together.

Why four layers instead of just testing the final output? Consider an analogy to software testing. You wouldn't test a web application only by clicking through the UI—you'd have unit tests for individual functions, integration tests for API endpoints, and end-to-end tests for user workflows. LLM systems are the same: component-level tests catch prompt bugs, model-level evaluations catch capability regressions, integration tests catch system issues, and production monitoring catches real-world edge cases.

The cost of missing a layer: Without unit tests, you'll waste evaluation compute on runs that fail due to template bugs. Without golden datasets, you won't notice gradual quality degradation. Without integration tests, you'll miss context window issues and tool failures. Without production monitoring, you'll be blind to the gap between your test distribution and reality.

Layer 1: Unit Tests (Component Level)

Test individual components with deterministic checks:

Prompt Template Tests:

Does the template render correctly with various inputs?
Are edge cases (empty inputs, special characters) handled?
Do format instructions produce parseable outputs?

Tool Integration Tests:

Do tools return expected responses for known inputs?
Are errors handled gracefully?
Do timeouts trigger correctly?

Output Parser Tests:

Does parsing work for all expected formats?
Are malformed outputs detected and handled?

These tests run in CI on every commit. They catch regressions in the deterministic parts of your system.

Layer 2: Evaluations (Model Level)

Test model behavior on curated datasets:

Golden Datasets: Curated examples with human-labeled correct outputs. For each example:

Input query/context
Expected output (or acceptable output range)
Evaluation criteria
Edge case labels

We maintain ~500 golden examples per major use case, stratified by:

Query type (factual, analytical, creative)
Difficulty (easy, medium, hard)
Edge case category
User segment

Automated Evaluation: For each golden example, generate model output and evaluate:

Evaluation Type	What It Measures	How
Exact Match	Deterministic outputs	String comparison
Semantic Similarity	Meaning preservation	Embedding cosine similarity
Rubric Scoring	Multi-dimensional quality	LLM-as-judge with rubric
Factuality	Claim accuracy	LLM-as-judge against sources
Format Compliance	Output structure	Schema validation
Safety	Harmful content	Classification models + rules

LLM-as-Judge: The most powerful technique for nuanced evaluation. A separate LLM evaluates outputs against criteria:

Code

Evaluate the following response on these criteria (1-5 scale):

1. Relevance: Does it address the user's question?
2. Accuracy: Are all factual claims correct?
3. Completeness: Does it cover all necessary information?
4. Clarity: Is it well-organized and easy to understand?
5. Safety: Is it free from harmful content?

Query: {query}
Response: {response}
Reference: {reference}

Key considerations:

Use a different model family as judge (Claude judging GPT outputs, or vice versa)
Include reference answers for calibration
Validate judge accuracy against human labels
Watch for position bias, verbosity bias, and self-preference

Layer 3: Integration Tests (System Level)

Test the complete system end-to-end:

Scenario Tests: Multi-turn conversations that test realistic workflows:

User asks initial question
System responds
User follows up
System maintains context and provides coherent response

Regression Tests: Specific examples that previously failed. Every production bug becomes a test case. Our regression suite has 2,000+ examples from past failures.

Performance Tests:

Latency percentiles (p50, p95, p99) under load
Throughput limits
Behavior under degraded conditions (slow APIs, partial failures)

Layer 4: Production Monitoring (Live Level)

Monitor real-world performance:

Implicit Signals:

Conversation length (longer may indicate struggle)
Retry rate (users rephrasing questions)
Abandonment rate
Time to completion

Explicit Feedback:

Thumbs up/down
Star ratings
Correction submissions
Support escalations

Quality Sampling: Randomly sample N% of production queries for human evaluation. This grounds your automated metrics in reality. We sample 1% of queries daily, stratified by user segment and query type.

Drift Detection: Monitor for distribution shift:

Input embedding drift (are queries changing?)
Output characteristic drift (response length, format, tone)
Metric drift (are automated scores changing?)

Alert on significant drift for investigation.

Metrics That Matter

Primary Metrics (Business Outcomes)

What actually matters for your application:

Application Type	Primary Metric	Target
Customer Support	Resolution rate without escalation	> 75%
Code Assistant	User acceptance rate	> 60%
Content Generation	Publication rate	> 80%
Data Analysis	Insight actionability (user survey)	> 4/5
RAG System	Answer correctness (sampled)	> 90%

These are your north star metrics. Everything else exists to predict or improve them.

Secondary Metrics (Quality Dimensions)

Multi-dimensional quality assessment:

Accuracy/Faithfulness: Are factual claims correct? For RAG systems, are claims grounded in retrieved context?

Measurement: LLM-as-judge against sources, human spot-check, citation verification.

Target: > 95% for high-stakes applications.

Relevance: Does the response address what the user actually asked?

Measurement: LLM-as-judge, semantic similarity to reference, user feedback.

Target: > 90%.

Completeness: Does it cover all necessary information without critical omissions?

Measurement: Checklist-based LLM evaluation, human review.

Target: Context-dependent.

Safety: Is it free from harmful, biased, or inappropriate content?

Measurement: Safety classifiers, rule-based filters, human review queue for edge cases.

Target: > 99.9% (false negatives are costly).

Format/Structure: Does output match expected schema and style?

Measurement: Automated parsing, schema validation.

Target: > 99%.

Operational Metrics

System health indicators:

Metric	Description	Target
Latency P50	Median response time	< 2s
Latency P95	95th percentile response time	< 5s
Error Rate	% of requests that fail	< 1%
Cost per Query	LLM API + compute costs	Application-specific
Token Efficiency	Output quality per input token	Trending up

The Correlation Problem

The critical insight: your offline metrics must correlate with online outcomes.

We've seen systems where:

LLM-as-judge scores improved by 10% but user satisfaction dropped
Automated accuracy increased but support tickets spiked
All metrics looked great but users churned

Run correlation analysis monthly:

Sample production queries with explicit feedback
Run all automated evaluations
Calculate correlation coefficients
If correlation < 0.7, your evaluation needs work

When offline and online metrics diverge, trust online metrics and fix your evaluation.

Building Golden Datasets

Your golden dataset is the foundation of everything. Here's how to build one that actually works:

Collection Strategy

Seed from Production: Start with real user queries, not synthetic examples. Sample across:

Time periods (capture temporal variation)
User segments (beginners vs. experts)
Query types (all categories you support)
Difficulty levels (estimated or labeled)

Targeted Expansion: Identify coverage gaps through error analysis. If a certain query type keeps failing, add more examples.

Adversarial Examples: Include examples designed to break your system:

Prompt injection attempts
Ambiguous queries
Out-of-scope requests
Edge cases from production failures

Labeling Protocol

Clear Criteria: Write detailed rubrics for each evaluation dimension. Labelers should agree on what "good" means.

Multiple Raters: Use 3+ raters per example for subjective dimensions. Measure inter-rater reliability (Krippendorff's alpha > 0.8 is good).

Calibration Sessions: Regular meetings where raters discuss disagreements and align on standards.

Expert Review: Domain experts validate labels for specialized content.

Dataset Hygiene

Versioning: Track dataset versions. When you update labels or add examples, create a new version. Keep old versions for comparison.

Stratification: Ensure balanced representation across important dimensions. Use stratified sampling for evaluation.

Contamination Prevention: Never use golden examples for training, prompting, or few-shot examples. Keep evaluation data strictly held-out.

Regular Refresh: Production distribution shifts. Refresh 10-20% of your dataset quarterly with new production examples.

LLM-as-Judge: The Details

LLM-as-judge is powerful but requires careful implementation:

Prompt Design

Explicit Rubrics: Don't ask "is this good?" Ask for specific dimensions with clear criteria:

Code

Rate the response on ACCURACY (1-5):
5: All claims are factually correct and verifiable
4: Nearly all claims correct, minor inaccuracies that don't affect usefulness
3: Mostly correct but contains at least one significant error
2: Multiple errors that substantially reduce reliability
1: Predominantly incorrect or misleading

Reference Answers: When available, provide reference answers for comparison. Judges are more reliable comparing than evaluating in isolation.

Chain-of-Thought: Ask the judge to explain reasoning before scoring. This improves consistency and provides debugging information.

Bias Mitigation

Position Bias: LLMs prefer responses in certain positions. When comparing two responses, run both orderings and aggregate.

Verbosity Bias: Longer responses often score higher regardless of quality. Normalize for length or explicitly instruct against this bias.

Self-Preference: Models prefer their own outputs. Use a different model family for judging.

Primacy/Recency: Information early or late in prompts gets more attention. Structure prompts carefully.

Calibration

Human Alignment: Calculate correlation between LLM-as-judge scores and human ratings on a held-out set. If correlation < 0.8, revise your prompts.

Score Distribution: Examine score distributions. If all scores are 4-5, your rubric isn't discriminative. If scores are bimodal, criteria may be ambiguous.

Failure Cases: Manually review cases where LLM-as-judge significantly disagrees with humans. These reveal prompt improvements.

Implementing Evaluation Infrastructure

Evaluation Pipeline Architecture

Code

Production
    │
    ▼
[Query Sampling] ──► [Golden Dataset Store]
    │                        │
    ▼                        ▼
[Model Inference] ◄─── [Eval Set Loader]
    │
    ▼
[Output Store] ──────► [Automated Evaluators]
    │                        │
    ▼                        ▼
[Human Review Queue]   [Metrics Store]
    │                        │
    ▼                        ▼
[Label Store] ◄───────► [Dashboard/Alerting]

Tooling

Evaluation Framework: Use or build a harness that:

Loads evaluation sets
Runs inference in parallel
Applies multiple evaluators
Stores results with full provenance
Generates reports

We use a custom framework built on top of pytest for deterministic tests and a separate async pipeline for LLM-based evaluations.

Human Evaluation Platform: For labeling and quality sampling:

Task assignment and routing
Annotation interfaces
Agreement calculation
Label aggregation
Quality control (trap questions)

Tools like Label Studio, Prodigy, or custom interfaces work. The key is workflow integration.

Dashboards: Visualize trends:

Metric time series
Slice-and-dice by dimensions
Regression detection
Correlation monitoring

We use Grafana with a custom backend that aggregates evaluation results.

Running Evaluations

On Every PR:

Unit tests (fast, deterministic)
Smoke tests on subset of golden set
Performance regression tests

Nightly:

Full golden set evaluation
All automated metrics
Comparison against previous day

Weekly:

Human evaluation of production samples
Correlation analysis
Dataset health checks

On Model/Prompt Changes:

A/B comparison against baseline
Statistical significance testing
Regression analysis across all dimensions

Case Study: Improving from 60% to 92% Accuracy

Let me share a concrete example. We had a customer support assistant with 60% resolution rate. Here's how evaluation drove improvement:

Week 1: Diagnosis

Ran full evaluation suite
Found 40% failure rate was split: 25% retrieval failures, 15% generation failures
Retrieval failures: wrong documents fetched for multi-step questions
Generation failures: model confusing similar products

Week 2: Retrieval Improvement

Added query decomposition for complex questions
Implemented hybrid search (was using dense only)
Evaluation showed retrieval recall improved from 72% to 89%
Resolution rate improved to 71%

Week 3: Generation Improvement

Analyzed generation failures in detail
Added product disambiguation prompt engineering
Included relevant product context in all responses
Generation accuracy improved from 85% to 94%
Resolution rate improved to 81%

Week 4: Edge Cases

Reviewed remaining failures manually
Found cluster of questions about deprecated products
Added explicit handling for legacy product queries
Final resolution rate: 92%

Each improvement was validated by our evaluation suite before production deployment. Without systematic evaluation, we'd have been guessing.

Common Mistakes

1. Relying on Vibes: "The outputs look good" is not evaluation. Build systematic measurement from day one.

2. Over-indexing on Automated Metrics: Automated metrics are proxies. Regularly validate against human judgment and business outcomes.

3. Static Golden Sets: Production changes. Update your golden sets quarterly or risk measuring performance on stale data.

4. Ignoring Latency: A perfect answer in 30 seconds is worse than a good answer in 2 seconds. Include latency in your evaluation.

5. Single-Dimensional Evaluation: Quality is multi-dimensional. A response can be accurate but unhelpful, or helpful but unsafe. Measure all dimensions.

6. No Failure Analysis: When metrics drop, dig into specific failures. Aggregate statistics hide actionable insights.

7. Testing in Production Only: By the time you see production failures, users are affected. Catch issues in offline evaluation.

Conclusion

Evaluation is the unglamorous foundation of production LLM systems. Without it, you're deploying hope, not software.

Build evaluation infrastructure early. Start with a small golden set and expand. Implement LLM-as-judge with careful prompt design. Monitor production continuously. Close the loop between offline and online metrics.

The teams that build robust evaluation capabilities ship better systems, iterate faster, and catch problems before users do. The investment pays for itself many times over.

LLM Evaluation in Production: Beyond Benchmarks

Table of Contents

The Evaluation Crisis

Why Benchmarks Fail

Distribution Mismatch

Metric Mismatch

Contamination

The Evaluation Stack

Layer 1: Unit Tests (Component Level)

Layer 2: Evaluations (Model Level)

Layer 3: Integration Tests (System Level)

Layer 4: Production Monitoring (Live Level)

Metrics That Matter

Primary Metrics (Business Outcomes)

Secondary Metrics (Quality Dimensions)

Operational Metrics

The Correlation Problem

Building Golden Datasets

Collection Strategy

Labeling Protocol

Dataset Hygiene

LLM-as-Judge: The Details

Prompt Design

Bias Mitigation

Calibration

Implementing Evaluation Infrastructure

Evaluation Pipeline Architecture

Tooling

Running Evaluations

Case Study: Improving from 60% to 92% Accuracy

Common Mistakes

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

LLM Observability and Monitoring: From Development to Production

Fine-Tuning vs Prompting: When to Use Each

Building Production-Ready RAG Systems: Lessons from the Field