What's the single highest-impact optimization?

For most applications, prompt caching provides the best effort-to-impact ratio. If you have repeated context (system prompts, few-shot examples, RAG documents), enabling caching can reduce costs by 40-90% on cached content with minimal code changes. Start there, then layer additional optimizations.

How do I know if semantic caching is appropriate for my application?

Analyze your query logs. If you see many semantically similar queries (users asking the same questions with different wording), semantic caching will help. If queries are highly unique or personalized, semantic caching won't achieve meaningful hit rates and may not justify the infrastructure cost.

What similarity threshold should I use for semantic caching?

Start conservative (0.90-0.95) and monitor both hit rate and quality. Gradually lower the threshold while watching for inappropriate cache returns. The optimal threshold balances cache effectiveness against response accuracy—this varies by application and must be tuned empirically.

Should I use OpenAI or Anthropic for prompt caching?

Anthropic's explicit caching is more cost-effective for high-volume applications because cache reads are 90% cheaper (vs. 50% for OpenAI). OpenAI's automatic caching is more convenient when you don't want to modify code. For maximum savings with control, Anthropic's approach is superior; for simplicity, OpenAI's automatic caching is easier.

How do I handle model routing for quality-sensitive applications?

Implement quality verification on the routing decision. Either verify before routing (classify complexity accurately) or verify after routing (check response quality and escalate if needed). The latter catches routing errors but adds latency. For quality-sensitive applications, err toward routing to more capable models rather than aggressive cost optimization.

What's the latency impact of these optimizations?

Prompt caching reduces latency (50-85% improvement on cached content). Semantic cache hits provide instant responses (major latency improvement). Model routing to smaller models reduces latency (smaller models respond faster). Batch processing adds latency (24-hour window) but only applies to non-real-time workloads. Net effect for user-facing requests is usually latency improvement, not degradation.

How do I calculate ROI on caching infrastructure?

Calculate: (API cost savings from cache hits) - (embedding costs for semantic cache) - (infrastructure costs for cache storage and search). For prompt caching, there's no additional infrastructure—savings are pure. For semantic caching, you need embedding API costs (~$0.10-0.20 per million tokens) and vector storage costs (varies by provider). If monthly API savings exceed monthly infrastructure costs, the ROI is positive.

What about fine-tuning for cost reduction?

Fine-tuning a smaller model to match larger model performance can dramatically reduce costs—running your own 7B model is orders of magnitude cheaper than API calls. But fine-tuning requires training data, infrastructure, and ongoing maintenance. It's a significant investment appropriate for high-volume, stable use cases where the economics justify the effort.

LLM Cost Engineering: Token Budgeting, Caching, and Model Routing for Production

LLM API costs can spiral quickly in production. A single customer service chatbot handling 100,000 daily conversations might consume $10,000+ monthly in API fees. A RAG application sending 10,000-token contexts with every query multiplies costs dramatically. Without deliberate cost engineering, LLM expenses can become the dominant line item in your infrastructure budget.

The good news is that systematic optimization can reduce LLM costs by 60-80% without sacrificing quality. Through intelligent caching, model routing, prompt optimization, and batching, organizations achieve dramatic savings while often improving latency and user experience. This guide covers the complete toolkit for LLM cost engineering in production.

Understanding LLM Cost Drivers

Before optimizing, you need to understand what drives LLM costs. The cost structure differs significantly from traditional compute resources.

Token-Based Pricing

LLM APIs charge per token—roughly 4 characters or 0.75 words in English. Costs vary dramatically between models: 2025 API pricing ranges from $0.25 to$ 15 per million input tokens and $1.25 to$ 75 per million output tokens depending on the model.

Output tokens cost 3-5x more than input tokens for most providers. This makes controlling response length one of the most impactful cost levers available. A verbose response that could have been concise directly increases costs. Instructing models to be brief, setting appropriate max_tokens limits, and post-processing to trim unnecessary content all reduce output token consumption.

The Hidden Cost Multipliers

Several factors multiply base token costs in ways that aren't immediately obvious:

Context accumulation in multi-turn conversations means you're resending the entire conversation history with each turn. A 20-turn conversation might send 50,000+ tokens just in context, even if each individual message is short.

RAG overhead adds retrieved documents to every query. If you're retrieving 5 chunks of 500 tokens each, that's 2,500 additional input tokens per request—often more than the query and response combined.

Retry and fallback logic can silently double or triple costs. If your application retries failed requests or falls back to alternative models, those additional calls accumulate.

Development and testing often uses production models during debugging, burning through tokens on non-production traffic.

Cost Monitoring Fundamentals

You can't optimize what you don't measure. Implement comprehensive cost tracking:

Per-request attribution tags each API call with user ID, feature, conversation ID, and other dimensions. This enables understanding which users, features, or use cases drive costs.

Token breakdown separates input tokens, output tokens, cached tokens, and any other token categories your provider reports. Different token types have different costs.

Budget alerting notifies teams when spending approaches thresholds. Hierarchical budgets can allocate spending across teams, applications, or customer segments with granular controls.

Organizations using systematic cost monitoring report 30-50% cost reductions just from visibility—identifying and fixing inefficient patterns they didn't know existed.

Prompt Caching: 90% Cost Reduction on Repeated Context

Prompt caching is the single most impactful cost optimization for applications with repeated context. When you send the same system prompt, few-shot examples, or RAG context repeatedly, prompt caching avoids reprocessing that content—reducing both cost and latency dramatically.

How Prompt Caching Works

Modern LLM providers cache the computed key-value (KV) states for prompt prefixes. When a subsequent request shares the same prefix, the provider retrieves the cached computation rather than reprocessing. Cached tokens cost 90% less than fresh tokens, and latency improves because the model doesn't need to recompute attention for the cached portion.

The key constraint is that caching works on prefixes—the beginning of your prompt. If your system prompt is 1,000 tokens and you send it identically across requests, those 1,000 tokens can be cached. But if you insert variable content before the system prompt, caching breaks because the prefix changed.

OpenAI's Automatic Caching

OpenAI implements caching automatically with no code changes required. When you send a request, OpenAI checks if a cached prefix exists and routes to it when possible. The minimum cacheable prefix is 1,024 tokens.

Pricing: Cache writes have no additional cost. Cache reads are charged at 50% of normal input pricing for most models. Cache entries persist for approximately 24 hours with the GPT-4 and GPT-5 series, though exact TTL isn't guaranteed.

Implications: OpenAI's automatic approach is convenient but provides limited control. You can't force caching or know definitively whether a request hit cache until you check the response metadata. Design your prompts with stable prefixes—put variable content at the end, not the beginning.

Anthropic's Manual Caching

Anthropic provides explicit cache control through the cache_control parameter. You mark specific content blocks as cacheable, and Anthropic guarantees routing to cached entries when available. Up to four cache breakpoints are supported.

Pricing: Anthropic charges a premium for cache writes—1.25x base cost for 5-minute TTL entries, 2x for 1-hour TTL entries. But cache reads are just 0.1x base cost (90% savings). The write premium means you need multiple reads to break even, but high-traffic applications see massive net savings.

TTL options: 5-minute cache (default) suits conversational applications where context is reused across turns. 1-hour cache suits applications with stable context reused across many requests, like documentation assistants.

Control advantages: Explicit caching lets you guarantee that expensive context (like large RAG retrievals or comprehensive system prompts) is cached. You know exactly what's cached and can design around it.

Caching Economics

The break-even calculation determines whether caching saves money:

For OpenAI (no write premium, 50% read discount): Break-even is immediate—any cache hit saves money.

For Anthropic (1.25x write cost, 0.1x read cost): With 5-minute TTL, you need approximately 1.4 reads per write to break even. With 1-hour TTL (2x write cost), you need about 2.2 reads per write.

In practice, applications with any meaningful traffic easily exceed these thresholds. A customer service bot handling multiple messages per conversation gets 5-10 cache hits per conversation start, yielding substantial savings.

Latency Benefits

Beyond cost, prompt caching dramatically reduces latency. Anthropic reports latency reductions up to 85% for long prompts—a 100,000-token context that took 11.5 seconds dropped to 2.4 seconds with caching. Time-to-first-token improves by 50-80% for cached prefixes because the model skips recomputing attention for cached content.

For user-facing applications, this latency improvement often matters more than cost savings. Faster responses improve user experience and enable use cases that would be impractical with full-prompt latency.

Semantic Caching: Reuse Across Similar Queries

Prompt caching requires exact prefix matches. Semantic caching goes further—caching responses based on query meaning rather than exact text. If a user asks "What's the weather in NYC?" and another asks "How's the weather in New York City?", semantic caching recognizes these as equivalent and returns the cached response.

How Semantic Caching Works

Semantic caching systems convert queries to vector embeddings that capture semantic meaning. When a new query arrives, its embedding is compared against cached query embeddings using similarity search. If a sufficiently similar query exists in cache, the cached response is returned without calling the LLM.

The similarity threshold determines the tradeoff between cache hit rate and response accuracy. A threshold of 0.95 (very similar) provides high accuracy but lower hit rates. A threshold of 0.85 (moderately similar) increases hit rates but risks returning inappropriate cached responses. The optimal threshold depends on your application's tolerance for variation.

Implementation Options

GPTCache is the most popular open-source semantic caching library. GPTCache integrates with LangChain and LlamaIndex, supporting multiple embedding models and vector stores. It handles embedding generation, similarity search, and cache eviction automatically.

Redis with vector search provides a production-grade foundation for semantic caching. RedisVL's SemanticCache offers turnkey semantic caching with configurable similarity thresholds, TTL-based expiration, and high-performance similarity search via HNSW indexing.

LangChain's RedisSemanticCache provides a drop-in cache backend for LangChain applications. With minimal configuration, you specify your Redis connection and embedding model, and LangChain automatically caches and retrieves responses based on semantic similarity.

Semantic Caching Considerations

Embedding costs: Every cache lookup requires embedding the query, which costs money (though much less than an LLM call). For very short queries, embedding cost might exceed the savings from occasional cache hits.

Staleness: Semantic caches can return outdated information. If cached responses reference time-sensitive data, stale responses create problems. Set appropriate TTLs and consider cache invalidation for responses that depend on changing data.

Context sensitivity: Two semantically similar questions might require different answers depending on conversation context. "Tell me more" means different things in different conversations. Simple semantic caching doesn't capture this context, though advanced systems like ContextCache address multi-turn awareness.

Cache poisoning: If incorrect responses enter the cache, they'll be served to subsequent users. Implement quality checks before caching and provide mechanisms to invalidate bad entries.

When to Use Semantic Caching

Semantic caching works well for:

FAQ-style queries where users ask the same questions with different wording
Lookup queries requesting specific facts that don't change frequently
High-traffic applications where cache hit rates compound into significant savings
Latency-sensitive applications where cache hits provide instant responses

Semantic caching works poorly for:

Personalized responses that should differ per user
Time-sensitive queries where cached responses quickly become stale
Creative or generative tasks where variation is desirable
Low-traffic applications where cache infrastructure costs exceed savings

Redis for LLM Caching: Deep Dive

Redis has emerged as the de facto standard for LLM caching infrastructure. Its sub-millisecond latency, built-in vector search, and flexible data structures make it ideal for both semantic caching and session management.

Why Redis for LLM Caching

Redis provides the performance characteristics LLM caching demands:

Sub-millisecond latency: Redis operations complete in under 1ms, ensuring cache lookups don't add noticeable delay. This is critical when the goal is reducing latency, not just cost.

Vector search capabilities: Redis supports native vector similarity search through the RediSearch module, enabling semantic caching without additional infrastructure. HNSW (Hierarchical Navigable Small World) indexing provides efficient approximate nearest neighbor search.

TTL-based expiration: Built-in key expiration handles cache invalidation automatically. Set TTLs appropriate to your data freshness requirements—responses expire without manual cleanup.

Persistence options: Redis can persist data to disk, surviving restarts. For caching, you might prefer volatile storage; for conversation state, persistence matters.

Horizontal scaling: Redis Cluster distributes data across nodes for applications exceeding single-node capacity.

RedisVL: The AI-Native Python Client

RedisVL (Redis Vector Library) provides a purpose-built Python client for AI applications. It simplifies vector search, semantic caching, and session management with an intuitive API.

SemanticCache interface: RedisVL's SemanticCache combines Redis's caching capabilities with vector search. When a query arrives, RedisVL generates an embedding, searches for similar cached queries, and returns cached responses if similarity exceeds the threshold.

Key SemanticCache features:

Configurable similarity thresholds (distance metrics: cosine, L2, IP)
TTL-based automatic expiration
Multiple embedding model support
Built-in HNSW indexing for fast similarity search

How SemanticCache works:

Query arrives at your application
RedisVL generates an embedding using your configured embedding model
Vector similarity search finds the most similar cached query
If similarity exceeds threshold, return cached response
If no match, call the LLM, cache the response with its embedding, return response

HNSW Vector Indexing

Redis uses HNSW (Hierarchical Navigable Small World) graphs for efficient vector similarity search. HNSW provides approximate nearest neighbor search—trading perfect accuracy for dramatically faster search times.

HNSW parameters:

M: Number of connections per node. Higher M improves recall but increases memory and indexing time. Typical values: 16-64.
EF_CONSTRUCTION: Size of dynamic candidate list during index building. Higher values improve index quality but slow construction. Typical values: 100-500.
EF_RUNTIME: Size of dynamic candidate list during search. Higher values improve recall but slow queries. Can be tuned at query time.

For semantic caching, HNSW's approximate search is ideal—you don't need perfect nearest neighbor, just "similar enough" to return cached responses.

Vector Quantization for Cost Efficiency

Redis now supports int8 quantization for vector embeddings. Quantization compresses float embeddings to 8-bit integers, reducing memory usage by 75% and improving search speed by 30% while maintaining 99.99% search accuracy.

For high-volume caching with millions of cached queries, quantization significantly reduces infrastructure costs without meaningful quality degradation.

LangCache: Managed Semantic Caching

LangCache, introduced in Redis's 2025 Spring Release, provides managed semantic caching as a service. Key features:

REST API: Simple HTTP interface for cache operations—no Redis client library required Optimized performance: Advanced optimizations for caching accuracy and speed Managed infrastructure: Redis handles scaling, persistence, and maintenance

LangCache suits teams wanting semantic caching benefits without managing Redis infrastructure.

Redis for Rate Limiting

Beyond caching, Redis excels at rate limiting—essential for managing LLM API costs:

Token bucket implementation: Store bucket state in Redis. Each request decrements tokens; tokens regenerate over time. Redis's atomic operations ensure accurate counting across distributed systems.

Sliding window rate limiting: Track request timestamps in sorted sets. Count requests in the recent window. Redis's sorted set operations make this efficient.

Distributed coordination: When running multiple application instances, Redis provides the shared state needed for accurate rate limiting across the fleet.

Redis Architecture for LLM Applications

A typical Redis deployment for LLM applications includes:

Cache layer: SemanticCache for LLM responses. Configure appropriate TTLs based on content freshness requirements.

Session layer: Conversation history and user state. Use Redis hashes for structured session data with per-session TTLs.

Rate limiting layer: Request counters and token buckets. Use Redis strings with atomic increment operations.

Embedding cache: Cache embeddings for repeated text to avoid redundant embedding API calls. Embeddings are deterministic—cache them aggressively.

Performance Optimization

Maximize Redis caching performance:

Connection pooling: Reuse Redis connections rather than creating new ones per request. Connection establishment adds latency.

Pipelining: Batch multiple Redis operations into single round-trips when possible. Reduces network overhead.

Local caching: For extremely hot data, add in-process caching (LRU cache) in front of Redis. Check local cache first, fall back to Redis.

Appropriate TTLs: Balance cache hit rates against staleness. Too short TTLs reduce hits; too long TTLs serve stale data.

Monitoring: Track cache hit rates, latency percentiles, and memory usage. Redis provides extensive metrics through INFO command and Redis Insight.

Model Routing: Right-Size Every Request

Not every query requires the most powerful (and expensive) model. A simple factual question doesn't need GPT-4's reasoning capabilities—a smaller, cheaper model handles it equally well. Model routing directs each request to the most cost-effective model capable of handling it.

The Routing Opportunity

Research on model routing demonstrates that starting 90% of queries with smaller models and escalating only complex requests to premium models achieves 87% cost reduction while maintaining quality. The insight is that task difficulty varies dramatically, but uniform model selection treats every request identically.

Consider the cost differential: GPT-4o costs roughly $5 per million input tokens; GPT-4o-mini costs$ 0.15 per million—a 33x difference. If 80% of your queries can be handled by the smaller model, routing saves 80% of costs on those queries while maintaining full capability for the complex 20%.

Routing Strategies

Complexity classification uses a small model or classifier to assess query complexity before routing. Simple queries (greetings, factual lookups, straightforward instructions) route to cheap models. Complex queries (multi-step reasoning, nuanced analysis, creative tasks) route to capable models.

The classifier can be rule-based (keyword matching, query length, presence of specific patterns), a small ML model trained on query-difficulty labels, or even an LLM-based classifier (though this adds latency and cost).

Confidence-based routing sends all queries to a cheap model first, then escalates to a more capable model if the cheap model's confidence is low. This requires models that provide reliable confidence signals—calibrated probabilities, explicit uncertainty expressions, or consistency across multiple samples.

Domain-based routing uses different models for different task types. Code generation might use a specialized code model. Creative writing might use a model fine-tuned for that purpose. Factual queries might use a smaller general model. The routing logic matches query intent to specialized models.

Cascading: Sequential Escalation

Cascading extends routing by trying models in sequence: start with the cheapest model, evaluate the response, and escalate to more capable models only if needed. Research shows cascade routing combines the adaptability of routing with the cost-efficiency of cascading, improving performance by 4% while reducing costs.

The cascade sequence typically progresses from smallest/cheapest to largest/most expensive: first try a 7B parameter model, then a 70B model, then GPT-4 class. Escalation triggers when the current model's response fails quality checks, expresses uncertainty, or matches patterns indicating the task exceeds its capabilities.

Cascading adds latency for queries that escalate—you pay for multiple model calls. But for applications where most queries are simple, the savings on non-escalated queries outweigh the cost of occasional escalation.

Quality Verification

Both routing and cascading require mechanisms to verify response quality:

Self-consistency checks whether the model gives the same answer across multiple samples. High consistency suggests confidence; low consistency suggests the model is uncertain and might benefit from escalation.

Verification queries ask a model to check another model's response. This adds cost but catches errors before they reach users.

Rule-based checks verify that responses meet format requirements, contain expected elements, and don't contain obvious errors. Failed checks trigger escalation or retry.

LLM-as-judge uses a capable model to evaluate responses from cheaper models, escalating when quality scores fall below thresholds.

Implementation Considerations

Latency tradeoffs: Classification and quality verification add latency. For latency-sensitive applications, keep classification fast (small models, simple rules) and be willing to over-route to capable models rather than adding verification steps.

Training data: Supervised classifiers need labeled data mapping queries to appropriate models. This data can come from historical request logs annotated with which model successfully handled each query.

Threshold tuning: Quality thresholds for escalation require tuning. Too aggressive escalation wastes money on unnecessary model calls. Too conservative escalation serves poor-quality responses. Monitor quality metrics across model tiers and adjust thresholds based on observed performance.

Batch Processing: 50% Discount for Asynchronous Work

When immediate responses aren't required, batch processing offers substantial savings. OpenAI's Batch API provides a 50% discount on both input and output tokens for requests that can wait up to 24 hours (though most complete within minutes to hours).

When Batching Applies

Batch processing suits workloads that don't require real-time responses:

Evaluation and testing: Running prompts against large test datasets, evaluating model changes, or generating baseline metrics. These tasks run in the background and results are analyzed later.

Data generation: Creating training data for fine-tuning, generating synthetic datasets, or producing content at scale. The volume makes batching essential for cost control.

Scheduled processing: Nightly summarization of daily content, weekly report generation, or periodic data enrichment. The schedule accommodates batch processing windows.

Bulk operations: Migrating data, backfilling embeddings, or processing historical records. These one-time or periodic jobs don't need immediate completion.

Batch API Mechanics

The Batch API accepts requests in JSONL format—one request per line in a file. You upload the file, create a batch job, and poll for completion. Results arrive in another JSONL file with responses matched to original requests.

Rate limits are dramatically higher for batch processing—250 million input tokens enqueued for GPT-4 class models compared to standard rate limits. This enables processing that would be impossible with synchronous APIs.

Completion time is guaranteed within 24 hours but typically much faster. Most batch jobs complete within minutes to a few hours depending on size and current load.

Cost calculation: Combining batch processing with prompt caching yields remarkable efficiency. A workflow using both might pay $1 per million input tokens and$ 4 per million output tokens—a fraction of synchronous full-price rates.

Batch Design Patterns

Chunked processing splits large jobs into multiple batches to monitor progress and handle failures gracefully. Rather than one million-request batch, submit ten 100,000-request batches with checkpointing between them.

Priority queuing separates truly non-urgent work (run whenever) from time-sensitive batch work (complete within hours). Different queues can have different retry and monitoring policies.

Hybrid architectures use synchronous APIs for user-facing requests and batch APIs for background processing, even within the same application. User interactions get immediate responses; derived processing (generating embeddings, creating summaries, updating indexes) happens asynchronously.

Token Optimization: Doing More with Less

Beyond caching and routing, optimizing token usage directly reduces costs at the source.

Prompt Compression

Verbose prompts waste tokens. Prompt compression techniques reduce token count while preserving effectiveness:

Instruction condensation rewrites lengthy instructions into concise equivalents. "Please provide a detailed and comprehensive summary of the following document, making sure to include all key points and relevant details" becomes "Summarize comprehensively." The model understands either; the second costs fewer tokens.

Example pruning reduces few-shot examples to the minimum needed for quality. Often 2-3 examples work as well as 5-6, halving example token cost.

Context selection for RAG retrieves only the most relevant chunks rather than padding to a fixed count. If 2 chunks answer the question, don't retrieve 5 chunks "just in case."

Dynamic prompting adjusts prompt detail based on query complexity. Simple queries get minimal instructions; complex queries get detailed guidance. This right-sizes prompts to tasks.

Output Control

Since output tokens cost more than input tokens, controlling output length has outsized impact:

Max_tokens limits prevent runaway responses. Set limits appropriate to expected response length—don't allow 4,000 tokens when 500 suffices.

Format instructions guide concise responses. "Respond in 2-3 sentences" or "Provide a bulleted list with up to 5 items" constrains output length.

Structured outputs (JSON mode, function calling) produce predictable response lengths without conversational padding. A JSON response with specific fields is typically shorter than prose covering the same information.

Token-Budget-Aware Reasoning

Chain-of-thought reasoning improves quality but increases token usage substantially. Research on token-budget-aware reasoning shows that reasoning processes can be compressed based on task complexity. Simple tasks don't need lengthy reasoning chains; complex tasks benefit from more tokens.

The practical application: don't request chain-of-thought for simple queries where the direct answer is obvious. Reserve detailed reasoning for problems that genuinely benefit from step-by-step thinking.

Context Management

Multi-turn conversations accumulate context that may no longer be relevant. Strategies for managing context:

Summarization checkpoints periodically summarize conversation history and restart context with the summary. A 50-turn conversation might be summarized every 10 turns, keeping context bounded while preserving key information.

Selective inclusion includes only relevant prior turns rather than complete history. If the current question is self-contained, prior turns may be unnecessary.

Sliding windows keep only the most recent N turns, dropping older context. This works for conversations where recent context matters most.

Memory systems store conversation information externally and retrieve relevant portions rather than including everything in context. This moves from "keep all context in prompt" to "retrieve relevant context on demand."

Putting It Together: Compound Savings

Individual optimizations yield incremental savings. Combining them produces compound savings that transform cost structures:

Base cost: 100% (no optimization)

Add prompt caching: 40% savings on repeated context → 60% of base

Add semantic caching: 20% additional queries served from cache → 48% of base

Add model routing: 50% of remaining queries handled by cheaper models → 36% of base

Add batch processing for async work: 50% discount on 30% of volume → 31% of base

Add token optimization: 10% reduction in tokens across all requests → 28% of base

This compound effect—starting from 100% and reaching 28%—represents 72% total savings. Real-world results vary, but organizations report 60-80% cost reductions through systematic optimization.

Implementation Priority

Not all optimizations are equally easy to implement. Prioritize by effort-to-impact ratio:

High impact, low effort: Enable prompt caching (often just prompt restructuring), set appropriate max_tokens limits, use cheaper models where possible.

High impact, medium effort: Implement semantic caching, build model routing logic, convert async workloads to batch processing.

Medium impact, higher effort: Build sophisticated context management, implement cascade routing with quality verification, develop custom prompt compression.

Start with quick wins, measure results, then tackle higher-effort optimizations based on remaining cost drivers.

Monitoring and Continuous Optimization

Cost optimization isn't a one-time project—it requires ongoing monitoring and adjustment.

Key Metrics to Track

Cost per conversation/request reveals trends and anomalies. Sudden increases indicate problems; gradual increases suggest growing complexity or feature additions.

Cache hit rates for both prompt and semantic caching indicate efficiency. Low hit rates suggest caching isn't providing value; investigate whether cache TTL is appropriate or whether query patterns preclude effective caching.

Model utilization by tier shows routing effectiveness. If the expensive model handles 90% of queries, routing isn't working. If the cheap model handles 99%, you might be under-utilizing capability.

Token efficiency (output tokens per request, context tokens per conversation) reveals optimization opportunities. Rising token counts indicate growing prompts or responses.

Alerting and Budgets

Configure alerts for cost anomalies—sudden spikes, budget threshold approaches, unusual patterns. Alerts should trigger before costs become problems, not after.

Implement hard budget caps where appropriate. A runaway process or bug shouldn't be able to spend unlimited money. Rate limiting by cost (not just requests) provides financial protection.

A/B Testing Optimizations

Before fully deploying optimizations, A/B test them:

Does the cheaper model actually maintain quality for routed queries?
Does semantic cache return appropriate responses, or do users notice degradation?
Does prompt compression affect task success rates?

Quality metrics should accompany cost metrics. Savings that damage quality aren't real savings—they're just deferred costs in user churn and remediation.

Table of Contents