How do I test with temperature > 0?

Use statistical assertions. Run the same prompt multiple times and verify that the distribution of results meets expectations. For creative outputs, check that results are sufficiently varied. For constrained outputs, check that all results fall within expected bounds. This requires more API calls but accurately tests stochastic behavior.

Should I mock LLM calls in integration tests?

Generally no—integration tests should verify real LLM behavior. Use mocks for unit tests and fast feedback loops. Use real API calls (with cheaper models) for integration tests that verify actual model behavior. The purpose of integration tests is catching issues that only appear with real model responses.

How often should I run expensive tests?

Structure by cost and value: unit tests on every commit (free, fast), integration tests with cheap models on main branch merges (low cost, moderate value), full regression with production models weekly (higher cost, high value), end-to-end tests pre-release (highest cost, critical value).

How do I test RAG retrieval quality?

Separate retrieval testing from generation testing. Retrieval tests check that relevant documents appear in search results—this doesn't need generation. Generation tests (given known context) check that responses are faithful to provided information. Testing both together makes debugging harder because you can't isolate which component failed.

What's the difference between testing and evaluation?

Testing catches regressions and verifies correctness—it's binary (pass/fail) and runs automatically in CI. Evaluation measures quality metrics on benchmarks—it's continuous (scores) and informs model selection and prompt optimization. Both are essential but serve different purposes. Testing gates deployments; evaluation guides development.

How do I handle flaky tests from LLM variation?

First, minimize flakiness through deterministic settings (temperature=0, seed values where supported). Second, use flexible assertions that check semantic meaning rather than exact strings. Third, for inherently variable outputs, use statistical assertions over multiple runs. Fourth, accept that some flakiness is inherent and set appropriate retry policies.

How do I test prompt security?

Use red teaming tools like Promptfoo or DeepEval's security metrics. Test against known injection patterns, jailbreak attempts, and PII extraction queries. Verify that guardrails trigger appropriately. Security testing should be part of regular regression suites, not just one-time validation.

How should I track test coverage for LLM applications?

Traditional code coverage metrics still apply to non-LLM code. For LLM interactions, track coverage of query types (what kinds of inputs are tested), tool combinations (which tool sequences are exercised), and edge cases (what unusual scenarios are covered). Build a test case inventory and identify gaps in coverage.

Testing LLM Applications: A Practical Guide for Production Systems

Testing LLM-powered applications requires fundamentally different approaches than traditional software testing. The probabilistic nature of language models, the cost of API calls, and the challenge of evaluating semantic correctness all demand specialized strategies. Yet rigorous testing remains essential—perhaps more so, given how easily subtle prompt changes can break production systems in unexpected ways.

This guide provides practical patterns for testing LLM applications at every level, from fast unit tests that never touch an API to comprehensive evaluation frameworks that measure semantic quality. We'll cover the 2025 landscape of testing tools and techniques that enable teams to ship reliable AI applications with confidence.

Why LLM Testing is Fundamentally Different

Traditional software testing relies on deterministic behavior: given input X, expect output Y. LLM applications break this assumption in several fundamental ways that require adapted testing strategies.

Non-Deterministic Outputs

Even with temperature set to zero, model outputs can vary between calls. The same prompt might produce "Here's the answer:" in one response and "The answer is:" in another—both semantically correct but different strings. This means traditional equality assertions often fail inappropriately, and tests must be designed to check semantic meaning rather than exact text matches.

The variation increases with higher temperatures used for creative tasks. A test that passes one day might fail the next simply due to natural model variation, not because of any actual regression. This randomness isn't a bug—it's inherent to how language models work—but it requires fundamentally different assertion strategies.

Cost Considerations

Running a thousand test cases against GPT-4 costs real money—potentially hundreds of dollars for a comprehensive test suite. Unlike traditional unit tests that execute instantly and freely, LLM integration tests consume API credits. Teams must carefully balance test coverage against testing budgets, using cheaper models in CI and reserving expensive models for critical validation.

This cost pressure creates a natural incentive to minimize LLM calls during testing, which leads to the testing pyramid structure where most tests avoid API calls entirely.

Latency Impact

A single LLM call typically takes 500 milliseconds to 5 seconds depending on the model and response length. A test suite with 100 integration tests becomes painfully slow—potentially 10+ minutes just for LLM latency. This makes the traditional "run all tests on every commit" approach impractical for LLM-heavy test suites.

The latency problem compounds with agent tests that involve multiple LLM calls per test case. A single agent workflow test might require 5-10 sequential LLM calls, making individual test runs take 30+ seconds.

Semantic Correctness

Testing whether a summary is "good" isn't a simple equality check. The model might produce a perfectly valid summary using completely different wording than any expected output. Traditional assertions can't capture this—you need fuzzy matching, semantic similarity comparisons, or LLM-as-judge approaches where another model evaluates the output quality.

This semantic evaluation challenge applies to most LLM outputs: answering questions, generating content, extracting information, and making decisions. The output needs to be evaluated for meaning and quality, not just string matching.

The Testing Pyramid for LLM Applications

The classic testing pyramid applies to LLM applications, but with different proportions and cost considerations than traditional software:

Unit Tests (80% of your tests) form the base of the pyramid. These tests never call an LLM—they're fast, free, and run on every commit. They cover prompt templates, output parsers, tool schemas, validation logic, and all the code around your LLM calls.

Integration Tests (15% of your tests) make real LLM calls to verify actual model behavior. They use cheaper models in CI (like GPT-4o-mini instead of GPT-4o), run sparingly (on main branch merges, not every PR), and track costs to stay within budget.

End-to-End Tests (5% of your tests) exercise complete workflows—full agent loops, multi-step reasoning, production-like scenarios. These are slow, expensive, and reserved for critical paths and pre-release validation. They might run weekly or only before major releases.

This distribution means most of your testing happens without touching an LLM at all. The key insight is that most LLM application code—prompt construction, response parsing, tool execution, state management—doesn't require live model calls to test effectively.

Unit Testing: Fast, Free, and Comprehensive

The majority of LLM application code doesn't actually call an LLM. Prompt templates are string-building functions. Output parsers convert text to structured data. Tool definitions specify schemas. Validation logic checks inputs and outputs. All of this can be tested without API calls.

Testing Prompt Templates

Prompt templates combine static instructions with dynamic data to create the final prompt sent to the model. Testing them is straightforward string testing: verify that required elements appear in the output, that variable substitution works correctly, and that edge cases (empty inputs, special characters, very long inputs) are handled gracefully.

Key things to test include: persona and role instructions appear correctly, all provided tools are listed, user input is properly escaped and delimited, context and conversation history are formatted correctly, and length limits are respected. These tests run in milliseconds and catch many prompt bugs before they reach the model.

Testing Output Parsers

Output parsers convert LLM text responses into structured data your application can use. They must handle the full range of model outputs—well-formatted JSON, JSON wrapped in markdown code blocks, partial responses, and complete garbage when the model fails to follow instructions.

Parser tests should verify successful parsing of well-formed responses, graceful handling of wrapped or prefixed content (models often add "Here's the JSON:" before the actual JSON), appropriate error handling for malformed responses, and extraction of the right fields even when models use slightly different key names. Since parser inputs are just strings, these tests are fast and deterministic.

Testing Tool Schemas

If you're using structured tool definitions (Pydantic models, JSON schemas), test that they validate inputs correctly. This includes accepting valid inputs with all required fields, rejecting invalid inputs with appropriate error messages, applying constraints (min/max values, string patterns, required fields), and coercing types appropriately where expected.

Schema tests catch bugs where your tool definitions don't match what the model actually produces, or where edge cases bypass validation. They're especially important for tools that perform sensitive operations—you want tight validation before executing anything.

Testing Token Counting and Context Limits

Context overflow is a common production failure mode. Test that your prompt construction respects token limits: calculate expected token counts for typical prompts, verify truncation logic works correctly, ensure critical content (system instructions, current query) is preserved when truncation is needed, and test behavior near context boundaries.

Token counting functions are deterministic and fast to test. Building a comprehensive suite of context management tests prevents production failures when users provide unexpectedly long inputs.

Mocking LLM Responses

For tests that verify LLM interaction logic without making real API calls, mocking provides a fast, free, deterministic alternative. The goal is testing your application's handling of model responses, not the model itself.

When to Use Mocks

Mocks are appropriate when testing how your application handles different response types (successful responses, tool calls, refusals), error conditions (API failures, rate limits, timeouts), streaming behavior (chunk handling, partial responses), and multi-turn conversation logic (context accumulation, state management).

Mocks are not appropriate when testing actual model behavior, prompt effectiveness, or semantic quality—those require real LLM calls.

Effective Mocking Strategies

Create mock fixtures that return realistic response structures matching the actual API format. For OpenAI-style APIs, this means mocking the nested structure of choices, messages, and tool calls. For streaming, mock an iterator that yields chunks with realistic timing simulation.

The key is making mocks realistic enough that your code exercises the same paths it would with real responses. Poorly structured mocks can pass tests while hiding bugs that appear with real API responses.

Recording and Replaying

VCR-style testing records real API responses and replays them in future test runs. This provides the realism of real responses with the speed and determinism of mocks. Record responses for representative scenarios, then replay them to catch regressions.

The tradeoff is that recorded responses can become stale as models update. Establish a cadence for refreshing recordings—perhaps monthly or when model versions change.

Integration Testing: Real LLM Calls with Cost Control

Some tests must use real LLM calls to verify actual model behavior. The key is making them cost-effective and strategically valuable.

Cost-Controlled Test Execution

Structure your test suite to control costs through several mechanisms:

Model tiering: Use cheaper models (GPT-4o-mini, Claude 3.5 Haiku) for most integration tests. Reserve expensive models (GPT-4o, Claude 3.5 Sonnet) for critical tests where capability differences matter.

Selective execution: Mark integration tests so they can be skipped in routine CI runs. Run them on main branch merges rather than every pull request. Use environment variables to control which test tiers execute.

Token limits: Set conservative max_tokens limits in tests. Most test scenarios don't need 4000-token responses—100-500 tokens usually suffices for verification.

Batching: Where possible, batch multiple test queries into single API calls, or design tests that verify multiple aspects of a single response.

Assertion Strategies for Non-Deterministic Outputs

Since exact string matching fails with LLM outputs, use assertion strategies that accommodate variation:

Key concept checking verifies that responses contain expected concepts without requiring exact wording. If testing a summary, check that key entities and themes appear, not that specific sentences match.

Structural validation ensures responses have expected structure—proper JSON, required fields present, values within expected ranges—without checking exact content.

Constraint checking verifies outputs meet specified constraints: length limits, format requirements, inclusion/exclusion of specific elements.

Semantic similarity uses embeddings to verify that responses are semantically similar to expected outputs, even with different wording. A similarity threshold (e.g., 0.85) allows for natural variation while catching major deviations.

Response Quality Thresholds

Rather than pass/fail assertions, integration tests can use quality thresholds. A test might require responses to score above 0.7 on relevance, or to contain at least 3 of 5 expected elements. This accommodates LLM variation while still catching serious regressions.

Track threshold metrics over time to identify gradual degradation that might not trigger hard failures but indicates declining quality.

LLM-as-Judge Evaluation

One of the most powerful techniques for evaluating LLM outputs is using another LLM as a judge. Research shows that state-of-the-art models can align with human judgment up to 85%—actually higher than typical human-to-human agreement rates of 81%.

How LLM-as-Judge Works

The pattern involves sending the original query, the generated response, and evaluation criteria to a judge model. The judge returns a score and reasoning. This enables automated evaluation of subjective qualities like helpfulness, relevance, accuracy, and tone that would otherwise require human review.

The judge model should be different from the model being evaluated to avoid self-evaluation bias. Using GPT-4 to judge GPT-4 outputs can inflate scores; using Claude to judge GPT-4 (or vice versa) provides more objective assessment.

Evaluation Approaches

Direct scoring evaluates a single response against criteria, returning a numeric score. This is efficient and suitable for objective assessments like faithfulness checking or format compliance. The judge sees only one response and rates it independently.

Pairwise comparison presents two responses and asks the judge which is better. Research shows pairwise comparisons lead to more stable results with smaller differences between LLM judgments and human annotations. This is particularly effective for subjective qualities like persuasiveness or coherence.

Improving Judge Accuracy

Several techniques significantly improve LLM-as-judge accuracy:

Chain-of-thought prompting asks the judge to explain its reasoning before providing a score. This improves accuracy because the final score is supported by explicit reasoning, and obvious errors in reasoning can be caught.

Few-shot examples provide calibration for the judge. Research from Databricks shows that providing 2-3 examples per score level can improve accuracy by 25-30% compared to zero-shot evaluation.

Rubric design matters significantly. Binary evaluations (Pass/Fail) are most reliable. Three-point scales (Excellent/Acceptable/Poor) work well. Ten-point or hundred-point scales without clear examples lead to inconsistent scoring.

Low temperature (0.1 or lower) ensures deterministic judge outputs. However, extremely low temperatures may bias toward lower scores, so some calibration is needed.

Known Biases and Mitigations

LLM judges exhibit predictable biases that can be mitigated:

Position bias causes judges to favor responses based on presentation order rather than quality. Mitigation: evaluate both orderings (A,B) and (B,A) and average or compare results.

Verbosity bias causes judges to rate longer responses higher regardless of quality. Mitigation: explicitly instruct the judge to reward conciseness, or normalize for length in scoring.

Self-preference bias causes models to prefer outputs similar to what they would generate. Mitigation: use a different model family for judging than for generation.

The 2025 LLM Testing Tool Landscape

Several specialized frameworks have emerged for LLM testing, each with different strengths and approaches.

DeepEval

DeepEval is an open-source evaluation framework designed specifically for testing LLM outputs. It provides a pytest-like experience specialized for LLM applications, with 60+ built-in metrics covering prompt quality, RAG faithfulness, conversation coherence, and safety.

Key strengths include its code-first approach that integrates naturally into existing test suites, extensive metric library including G-Eval and task completion measures, built-in red teaming capabilities for security testing, and clean APIs that minimize setup friction.

DeepEval is particularly strong for teams that want programmatic control over their evaluation logic and prefer open-source solutions they can customize and extend.

Promptfoo

Promptfoo is a developer-focused testing tool emphasizing red teaming and security testing alongside performance evaluation. It uses YAML configuration for defining test cases, making it easy to specify large test suites declaratively.

Key strengths include its red teaming capabilities—Promptfoo can probe prompts for vulnerabilities, test for prompt injections, check for PII leaks, and identify edge cases that break guardrails. Its native CI/CD integration through GitHub Actions makes it straightforward to incorporate into deployment pipelines.

Promptfoo is particularly valuable for security-conscious teams and those who prefer declarative test configuration over programmatic test code.

Braintrust

Braintrust provides an end-to-end platform for LLM evaluation with strong CI/CD integration. Its GitHub Action automatically runs experiments and posts detailed comparisons directly on pull requests, showing score breakdowns without requiring custom code.

Key strengths include comprehensive integration support for major frameworks, production-ready infrastructure, and team collaboration features. The platform approach means less setup work but also less flexibility for custom evaluation logic.

Braintrust is well-suited for teams wanting a managed solution with minimal setup, particularly those already using supported frameworks.

RAGAs

RAGAs specializes in evaluating Retrieval-Augmented Generation systems. It calculates metrics specifically relevant to RAG: context relevancy (is retrieved content relevant to the query?), faithfulness (is the response grounded in retrieved content?), and answer relevance (does the response actually answer the question?).

Key strengths include purpose-built RAG metrics, integration with common RAG frameworks, and evaluation of both retrieval and generation quality separately.

RAGAs is essential for teams building RAG systems who need to understand whether failures come from retrieval, generation, or both.

Choosing the Right Tool

The choice depends on your priorities:

Open-source preference with extensive metrics: DeepEval
Security and red teaming focus: Promptfoo
Managed platform with minimal setup: Braintrust
RAG-specific evaluation: RAGAs
Combination approach: Many teams use multiple tools—DeepEval for unit-level metrics, Promptfoo for security testing, RAGAs for RAG evaluation

Regression Testing for Prompts

Prompt changes can have unexpected downstream effects. A minor wording adjustment might dramatically change model behavior on edge cases. Regression testing catches these changes before they reach production.

Building Test Case Suites

Effective regression suites include:

Representative queries covering the full range of expected user inputs—simple questions, complex multi-part requests, edge cases, and known difficult inputs.

Expected behavior specifications for each query: what tool should be called, what type of response is appropriate, what elements must be present or absent.

Historical failure cases that previously caused issues. Every production bug should become a regression test to prevent recurrence.

Adversarial inputs that test robustness: prompt injection attempts, confusing queries, requests outside scope.

Store test cases in structured formats (YAML, JSON) that can be version-controlled alongside prompts. When prompts change, the same test suite validates that behavior remains correct.

Baseline Comparison

Track evaluation metrics over time and fail tests when metrics regress beyond a threshold. If your routing accuracy is typically 92%, fail the build if it drops below 87%. This catches gradual degradation that wouldn't trigger individual test failures.

Baselines should be updated deliberately, not automatically. When metrics improve, explicitly update the baseline. This creates an audit trail of expected quality levels.

Prompt Change Workflow

Before merging prompt changes:

Run the full regression suite against the new prompt
Compare metrics to baseline—no regression beyond threshold
Review any cases where behavior changed (even if still passing)
Update baseline if metrics improved
Add new test cases for any new capabilities or edge cases

This workflow catches prompt regressions before they affect users and builds confidence that changes are safe.

Testing Agents and Multi-Step Workflows

Agents combine multiple components—LLM calls, tool executions, state management—into complex workflows. Testing them requires strategies at multiple granularities.

Testing Individual Tools

Each tool should be tested independently of the LLM and agent framework. Tool tests verify that the tool executes correctly given valid inputs, handles errors gracefully (API failures, invalid data), respects resource limits (timeouts, rate limits), and returns properly formatted outputs the agent can process.

Tool tests don't require LLM calls—they test the tool implementation itself. This isolation makes debugging easier: if an agent fails, you can determine whether the problem is tool execution or LLM decision-making.

Testing Tool Selection

Integration tests verify that the LLM selects appropriate tools for different queries. Given a factual question, does the agent choose search? Given a calculation, does it choose the calculator? These tests require real LLM calls but can use cheaper models since they're testing routing logic, not tool execution quality.

Build a matrix of query types and expected tool selections. This becomes your routing regression suite—any prompt change that affects tool selection should show up in these tests.

Testing Full Trajectories

End-to-end tests exercise complete agent workflows: multiple turns of conversation, multiple tool calls, synthesis of results into final responses. These tests are slow and expensive but catch integration issues that component tests miss.

Focus trajectory tests on critical user journeys and known complex scenarios. Verify that the agent completes tasks, stays within iteration limits, handles errors gracefully, and produces useful final outputs.

Testing Error Recovery

Agents should handle failures gracefully—tool errors, unexpected model outputs, timeout conditions. Inject failures into agent tests: make tools throw exceptions, return malformed data, or timeout. Verify that the agent recovers appropriately rather than crashing or producing nonsense.

Error recovery tests are particularly important for production agents that will inevitably encounter real-world failures.

CI/CD Integration

Automated testing in CI/CD pipelines ensures consistent quality without manual intervention.

Pipeline Structure

A well-structured pipeline separates test tiers by speed and cost:

On every pull request: Run all unit tests (fast, free). These catch obvious bugs before review.

On merge to main: Run integration tests with cheaper models. These verify LLM behavior without excessive cost.

Nightly or weekly: Run full regression suites with production models. These catch subtle quality changes.

Pre-release: Run end-to-end tests and performance benchmarks. These validate production readiness.

This structure balances fast feedback (unit tests on every PR) with comprehensive validation (full suites on scheduled runs).

Cost Tracking and Budgets

Track token usage and costs for each CI run. Set budget thresholds—if a test run exceeds the expected cost, fail the build and investigate. This prevents runaway costs from test suite growth or misconfiguration.

Store cost data historically to identify trends. Gradually increasing test costs might indicate test suite bloat or inefficient test design.

Environment Configuration

Use environment variables to control test behavior in CI:

Model selection: CI runs should use cheaper models than local development or production validation.

Test tier selection: Skip expensive tests on routine runs; enable them for release validation.

Timeout configuration: CI environments might need different timeouts than local development.

API key management: Securely inject API keys for different testing tiers.

Performance Testing

Understanding performance characteristics is essential for production LLM applications.

Latency Benchmarking

Measure and track key latency metrics:

Time to first token (TTFT) measures how quickly streaming responses begin. This affects perceived responsiveness—users should see output beginning within 1-2 seconds.

Total response time measures end-to-end latency including all model processing. Set SLOs appropriate for your use case—interactive chat needs faster responses than batch processing.

Percentile distributions matter more than averages. A system with 500ms average but 10s P99 will frustrate users on every 100th request. Track P50, P95, and P99 separately.

Throughput Testing

Load testing reveals how your system behaves under concurrent usage:

Requests per second your infrastructure can handle while maintaining latency SLOs.

Queue depth when load exceeds capacity—how requests back up and what happens to waiting requests.

Degradation patterns as load increases—does latency increase gradually or cliff at some threshold?

Load testing should simulate realistic traffic patterns: mostly simple queries with occasional complex ones, bursts during peak hours, sustained load during busy periods.

Cost Efficiency

Track cost per request across different query types and user segments. Identify expensive patterns—perhaps certain query types trigger much higher token usage—and optimize or budget accordingly.

Efficiency improvements (better prompts that need fewer tokens, caching, model routing) directly reduce operating costs at scale.

Best Practices Summary

Effective LLM testing combines several principles:

Maximize unit test coverage. Most LLM application code doesn't need real LLM calls to test. Build comprehensive unit tests for prompt construction, parsing, validation, and tool logic.

Use mocks strategically. Test your application's handling of model responses without paying for API calls. Save integration tests for verifying actual model behavior.

Control integration test costs. Use cheaper models in CI, limit token counts, run expensive tests selectively. Track costs to prevent budget surprises.

Embrace LLM-as-judge. Automated semantic evaluation enables testing subjective qualities at scale. Invest in good rubrics and judge prompts.

Build regression suites. Every prompt change should run against comprehensive test cases. Track metrics over time and catch regressions early.

Test at multiple granularities. Unit tests for components, integration tests for LLM behavior, end-to-end tests for complete workflows. Each level catches different issues.

Automate in CI/CD. Consistent automated testing catches issues before they reach users. Structure pipelines to balance speed with thoroughness.

Table of Contents