LLM Cost Engineering: Token Budgeting, Caching, and Model Routing for Production
Comprehensive guide to reducing LLM costs by 60-80% in production. Covers prompt caching (OpenAI vs Anthropic), semantic caching with Redis and GPTCache, model routing and cascading, batch processing, and token optimization strategies.
Table of Contents
LLM Cost Engineering: Token Budgeting, Caching, and Model Routing for Production
LLM API costs can spiral quickly in production. A single customer service chatbot handling 100,000 daily conversations might consume $10,000+ monthly in API fees. A RAG application sending 10,000-token contexts with every query multiplies costs dramatically. Without deliberate cost engineering, LLM expenses can become the dominant line item in your infrastructure budget.
The good news is that systematic optimization can reduce LLM costs by 60-80% without sacrificing quality. Through intelligent caching, model routing, prompt optimization, and batching, organizations achieve dramatic savings while often improving latency and user experience. This guide covers the complete toolkit for LLM cost engineering in production.
Understanding LLM Cost Drivers
Before optimizing, you need to understand what drives LLM costs. The cost structure differs significantly from traditional compute resources.
Token-Based Pricing
LLM APIs charge per token—roughly 4 characters or 0.75 words in English. Costs vary dramatically between models: 2025 API pricing ranges from 15 per million input tokens and 75 per million output tokens depending on the model.
Output tokens cost 3-5x more than input tokens for most providers. This makes controlling response length one of the most impactful cost levers available. A verbose response that could have been concise directly increases costs. Instructing models to be brief, setting appropriate max_tokens limits, and post-processing to trim unnecessary content all reduce output token consumption.
The Hidden Cost Multipliers
Several factors multiply base token costs in ways that aren't immediately obvious:
Context accumulation in multi-turn conversations means you're resending the entire conversation history with each turn. A 20-turn conversation might send 50,000+ tokens just in context, even if each individual message is short.
RAG overhead adds retrieved documents to every query. If you're retrieving 5 chunks of 500 tokens each, that's 2,500 additional input tokens per request—often more than the query and response combined.
Retry and fallback logic can silently double or triple costs. If your application retries failed requests or falls back to alternative models, those additional calls accumulate.
Development and testing often uses production models during debugging, burning through tokens on non-production traffic.
Cost Monitoring Fundamentals
You can't optimize what you don't measure. Implement comprehensive cost tracking:
Per-request attribution tags each API call with user ID, feature, conversation ID, and other dimensions. This enables understanding which users, features, or use cases drive costs.
Token breakdown separates input tokens, output tokens, cached tokens, and any other token categories your provider reports. Different token types have different costs.
Budget alerting notifies teams when spending approaches thresholds. Hierarchical budgets can allocate spending across teams, applications, or customer segments with granular controls.
Organizations using systematic cost monitoring report 30-50% cost reductions just from visibility—identifying and fixing inefficient patterns they didn't know existed.
Prompt Caching: 90% Cost Reduction on Repeated Context
Prompt caching is the single most impactful cost optimization for applications with repeated context. When you send the same system prompt, few-shot examples, or RAG context repeatedly, prompt caching avoids reprocessing that content—reducing both cost and latency dramatically.
How Prompt Caching Works
Modern LLM providers cache the computed key-value (KV) states for prompt prefixes. When a subsequent request shares the same prefix, the provider retrieves the cached computation rather than reprocessing. Cached tokens cost 90% less than fresh tokens, and latency improves because the model doesn't need to recompute attention for the cached portion.
The key constraint is that caching works on prefixes—the beginning of your prompt. If your system prompt is 1,000 tokens and you send it identically across requests, those 1,000 tokens can be cached. But if you insert variable content before the system prompt, caching breaks because the prefix changed.
OpenAI's Automatic Caching
OpenAI implements caching automatically with no code changes required. When you send a request, OpenAI checks if a cached prefix exists and routes to it when possible. The minimum cacheable prefix is 1,024 tokens.
Pricing: Cache writes have no additional cost. Cache reads are charged at 50% of normal input pricing for most models. Cache entries persist for approximately 24 hours with the GPT-4 and GPT-5 series, though exact TTL isn't guaranteed.
Implications: OpenAI's automatic approach is convenient but provides limited control. You can't force caching or know definitively whether a request hit cache until you check the response metadata. Design your prompts with stable prefixes—put variable content at the end, not the beginning.
Anthropic's Manual Caching
Anthropic provides explicit cache control through the cache_control parameter. You mark specific content blocks as cacheable, and Anthropic guarantees routing to cached entries when available. Up to four cache breakpoints are supported.
Pricing: Anthropic charges a premium for cache writes—1.25x base cost for 5-minute TTL entries, 2x for 1-hour TTL entries. But cache reads are just 0.1x base cost (90% savings). The write premium means you need multiple reads to break even, but high-traffic applications see massive net savings.
TTL options: 5-minute cache (default) suits conversational applications where context is reused across turns. 1-hour cache suits applications with stable context reused across many requests, like documentation assistants.
Control advantages: Explicit caching lets you guarantee that expensive context (like large RAG retrievals or comprehensive system prompts) is cached. You know exactly what's cached and can design around it.
Caching Economics
The break-even calculation determines whether caching saves money:
For OpenAI (no write premium, 50% read discount): Break-even is immediate—any cache hit saves money.
For Anthropic (1.25x write cost, 0.1x read cost): With 5-minute TTL, you need approximately 1.4 reads per write to break even. With 1-hour TTL (2x write cost), you need about 2.2 reads per write.
In practice, applications with any meaningful traffic easily exceed these thresholds. A customer service bot handling multiple messages per conversation gets 5-10 cache hits per conversation start, yielding substantial savings.
Latency Benefits
Beyond cost, prompt caching dramatically reduces latency. Anthropic reports latency reductions up to 85% for long prompts—a 100,000-token context that took 11.5 seconds dropped to 2.4 seconds with caching. Time-to-first-token improves by 50-80% for cached prefixes because the model skips recomputing attention for cached content.
For user-facing applications, this latency improvement often matters more than cost savings. Faster responses improve user experience and enable use cases that would be impractical with full-prompt latency.
Semantic Caching: Reuse Across Similar Queries
Prompt caching requires exact prefix matches. Semantic caching goes further—caching responses based on query meaning rather than exact text. If a user asks "What's the weather in NYC?" and another asks "How's the weather in New York City?", semantic caching recognizes these as equivalent and returns the cached response.
How Semantic Caching Works
Semantic caching systems convert queries to vector embeddings that capture semantic meaning. When a new query arrives, its embedding is compared against cached query embeddings using similarity search. If a sufficiently similar query exists in cache, the cached response is returned without calling the LLM.
The similarity threshold determines the tradeoff between cache hit rate and response accuracy. A threshold of 0.95 (very similar) provides high accuracy but lower hit rates. A threshold of 0.85 (moderately similar) increases hit rates but risks returning inappropriate cached responses. The optimal threshold depends on your application's tolerance for variation.
Implementation Options
GPTCache is the most popular open-source semantic caching library. GPTCache integrates with LangChain and LlamaIndex, supporting multiple embedding models and vector stores. It handles embedding generation, similarity search, and cache eviction automatically.
Redis with vector search provides a production-grade foundation for semantic caching. RedisVL's SemanticCache offers turnkey semantic caching with configurable similarity thresholds, TTL-based expiration, and high-performance similarity search via HNSW indexing.
LangChain's RedisSemanticCache provides a drop-in cache backend for LangChain applications. With minimal configuration, you specify your Redis connection and embedding model, and LangChain automatically caches and retrieves responses based on semantic similarity.
Semantic Caching Considerations
Embedding costs: Every cache lookup requires embedding the query, which costs money (though much less than an LLM call). For very short queries, embedding cost might exceed the savings from occasional cache hits.
Staleness: Semantic caches can return outdated information. If cached responses reference time-sensitive data, stale responses create problems. Set appropriate TTLs and consider cache invalidation for responses that depend on changing data.
Context sensitivity: Two semantically similar questions might require different answers depending on conversation context. "Tell me more" means different things in different conversations. Simple semantic caching doesn't capture this context, though advanced systems like ContextCache address multi-turn awareness.
Cache poisoning: If incorrect responses enter the cache, they'll be served to subsequent users. Implement quality checks before caching and provide mechanisms to invalidate bad entries.
When to Use Semantic Caching
Semantic caching works well for:
- FAQ-style queries where users ask the same questions with different wording
- Lookup queries requesting specific facts that don't change frequently
- High-traffic applications where cache hit rates compound into significant savings
- Latency-sensitive applications where cache hits provide instant responses
Semantic caching works poorly for:
- Personalized responses that should differ per user
- Time-sensitive queries where cached responses quickly become stale
- Creative or generative tasks where variation is desirable
- Low-traffic applications where cache infrastructure costs exceed savings
Redis for LLM Caching: Deep Dive
Redis has emerged as the de facto standard for LLM caching infrastructure. Its sub-millisecond latency, built-in vector search, and flexible data structures make it ideal for both semantic caching and session management.
Why Redis for LLM Caching
Redis provides the performance characteristics LLM caching demands:
Sub-millisecond latency: Redis operations complete in under 1ms, ensuring cache lookups don't add noticeable delay. This is critical when the goal is reducing latency, not just cost.
Vector search capabilities: Redis supports native vector similarity search through the RediSearch module, enabling semantic caching without additional infrastructure. HNSW (Hierarchical Navigable Small World) indexing provides efficient approximate nearest neighbor search.
TTL-based expiration: Built-in key expiration handles cache invalidation automatically. Set TTLs appropriate to your data freshness requirements—responses expire without manual cleanup.
Persistence options: Redis can persist data to disk, surviving restarts. For caching, you might prefer volatile storage; for conversation state, persistence matters.
Horizontal scaling: Redis Cluster distributes data across nodes for applications exceeding single-node capacity.
RedisVL: The AI-Native Python Client
RedisVL (Redis Vector Library) provides a purpose-built Python client for AI applications. It simplifies vector search, semantic caching, and session management with an intuitive API.
SemanticCache interface: RedisVL's SemanticCache combines Redis's caching capabilities with vector search. When a query arrives, RedisVL generates an embedding, searches for similar cached queries, and returns cached responses if similarity exceeds the threshold.
Key SemanticCache features:
- Configurable similarity thresholds (distance metrics: cosine, L2, IP)
- TTL-based automatic expiration
- Multiple embedding model support
- Built-in HNSW indexing for fast similarity search
How SemanticCache works:
- Query arrives at your application
- RedisVL generates an embedding using your configured embedding model
- Vector similarity search finds the most similar cached query
- If similarity exceeds threshold, return cached response
- If no match, call the LLM, cache the response with its embedding, return response
HNSW Vector Indexing
Redis uses HNSW (Hierarchical Navigable Small World) graphs for efficient vector similarity search. HNSW provides approximate nearest neighbor search—trading perfect accuracy for dramatically faster search times.
HNSW parameters:
- M: Number of connections per node. Higher M improves recall but increases memory and indexing time. Typical values: 16-64.
- EF_CONSTRUCTION: Size of dynamic candidate list during index building. Higher values improve index quality but slow construction. Typical values: 100-500.
- EF_RUNTIME: Size of dynamic candidate list during search. Higher values improve recall but slow queries. Can be tuned at query time.
For semantic caching, HNSW's approximate search is ideal—you don't need perfect nearest neighbor, just "similar enough" to return cached responses.
Vector Quantization for Cost Efficiency
Redis now supports int8 quantization for vector embeddings. Quantization compresses float embeddings to 8-bit integers, reducing memory usage by 75% and improving search speed by 30% while maintaining 99.99% search accuracy.
For high-volume caching with millions of cached queries, quantization significantly reduces infrastructure costs without meaningful quality degradation.
LangCache: Managed Semantic Caching
LangCache, introduced in Redis's 2025 Spring Release, provides managed semantic caching as a service. Key features:
REST API: Simple HTTP interface for cache operations—no Redis client library required Optimized performance: Advanced optimizations for caching accuracy and speed Managed infrastructure: Redis handles scaling, persistence, and maintenance
LangCache suits teams wanting semantic caching benefits without managing Redis infrastructure.
Redis for Rate Limiting
Beyond caching, Redis excels at rate limiting—essential for managing LLM API costs:
Token bucket implementation: Store bucket state in Redis. Each request decrements tokens; tokens regenerate over time. Redis's atomic operations ensure accurate counting across distributed systems.
Sliding window rate limiting: Track request timestamps in sorted sets. Count requests in the recent window. Redis's sorted set operations make this efficient.
Distributed coordination: When running multiple application instances, Redis provides the shared state needed for accurate rate limiting across the fleet.
Redis Architecture for LLM Applications
A typical Redis deployment for LLM applications includes:
Cache layer: SemanticCache for LLM responses. Configure appropriate TTLs based on content freshness requirements.
Session layer: Conversation history and user state. Use Redis hashes for structured session data with per-session TTLs.
Rate limiting layer: Request counters and token buckets. Use Redis strings with atomic increment operations.
Embedding cache: Cache embeddings for repeated text to avoid redundant embedding API calls. Embeddings are deterministic—cache them aggressively.
Performance Optimization
Maximize Redis caching performance:
Connection pooling: Reuse Redis connections rather than creating new ones per request. Connection establishment adds latency.
Pipelining: Batch multiple Redis operations into single round-trips when possible. Reduces network overhead.
Local caching: For extremely hot data, add in-process caching (LRU cache) in front of Redis. Check local cache first, fall back to Redis.
Appropriate TTLs: Balance cache hit rates against staleness. Too short TTLs reduce hits; too long TTLs serve stale data.
Monitoring: Track cache hit rates, latency percentiles, and memory usage. Redis provides extensive metrics through INFO command and Redis Insight.
Model Routing: Right-Size Every Request
Not every query requires the most powerful (and expensive) model. A simple factual question doesn't need GPT-4's reasoning capabilities—a smaller, cheaper model handles it equally well. Model routing directs each request to the most cost-effective model capable of handling it.
The Routing Opportunity
Research on model routing demonstrates that starting 90% of queries with smaller models and escalating only complex requests to premium models achieves 87% cost reduction while maintaining quality. The insight is that task difficulty varies dramatically, but uniform model selection treats every request identically.
Consider the cost differential: GPT-4o costs roughly 0.15 per million—a 33x difference. If 80% of your queries can be handled by the smaller model, routing saves 80% of costs on those queries while maintaining full capability for the complex 20%.
Routing Strategies
Complexity classification uses a small model or classifier to assess query complexity before routing. Simple queries (greetings, factual lookups, straightforward instructions) route to cheap models. Complex queries (multi-step reasoning, nuanced analysis, creative tasks) route to capable models.
The classifier can be rule-based (keyword matching, query length, presence of specific patterns), a small ML model trained on query-difficulty labels, or even an LLM-based classifier (though this adds latency and cost).
Confidence-based routing sends all queries to a cheap model first, then escalates to a more capable model if the cheap model's confidence is low. This requires models that provide reliable confidence signals—calibrated probabilities, explicit uncertainty expressions, or consistency across multiple samples.
Domain-based routing uses different models for different task types. Code generation might use a specialized code model. Creative writing might use a model fine-tuned for that purpose. Factual queries might use a smaller general model. The routing logic matches query intent to specialized models.
Cascading: Sequential Escalation
Cascading extends routing by trying models in sequence: start with the cheapest model, evaluate the response, and escalate to more capable models only if needed. Research shows cascade routing combines the adaptability of routing with the cost-efficiency of cascading, improving performance by 4% while reducing costs.
The cascade sequence typically progresses from smallest/cheapest to largest/most expensive: first try a 7B parameter model, then a 70B model, then GPT-4 class. Escalation triggers when the current model's response fails quality checks, expresses uncertainty, or matches patterns indicating the task exceeds its capabilities.
Cascading adds latency for queries that escalate—you pay for multiple model calls. But for applications where most queries are simple, the savings on non-escalated queries outweigh the cost of occasional escalation.
Quality Verification
Both routing and cascading require mechanisms to verify response quality:
Self-consistency checks whether the model gives the same answer across multiple samples. High consistency suggests confidence; low consistency suggests the model is uncertain and might benefit from escalation.
Verification queries ask a model to check another model's response. This adds cost but catches errors before they reach users.
Rule-based checks verify that responses meet format requirements, contain expected elements, and don't contain obvious errors. Failed checks trigger escalation or retry.
LLM-as-judge uses a capable model to evaluate responses from cheaper models, escalating when quality scores fall below thresholds.
Implementation Considerations
Latency tradeoffs: Classification and quality verification add latency. For latency-sensitive applications, keep classification fast (small models, simple rules) and be willing to over-route to capable models rather than adding verification steps.
Training data: Supervised classifiers need labeled data mapping queries to appropriate models. This data can come from historical request logs annotated with which model successfully handled each query.
Threshold tuning: Quality thresholds for escalation require tuning. Too aggressive escalation wastes money on unnecessary model calls. Too conservative escalation serves poor-quality responses. Monitor quality metrics across model tiers and adjust thresholds based on observed performance.
Batch Processing: 50% Discount for Asynchronous Work
When immediate responses aren't required, batch processing offers substantial savings. OpenAI's Batch API provides a 50% discount on both input and output tokens for requests that can wait up to 24 hours (though most complete within minutes to hours).
When Batching Applies
Batch processing suits workloads that don't require real-time responses:
Evaluation and testing: Running prompts against large test datasets, evaluating model changes, or generating baseline metrics. These tasks run in the background and results are analyzed later.
Data generation: Creating training data for fine-tuning, generating synthetic datasets, or producing content at scale. The volume makes batching essential for cost control.
Scheduled processing: Nightly summarization of daily content, weekly report generation, or periodic data enrichment. The schedule accommodates batch processing windows.
Bulk operations: Migrating data, backfilling embeddings, or processing historical records. These one-time or periodic jobs don't need immediate completion.
Batch API Mechanics
The Batch API accepts requests in JSONL format—one request per line in a file. You upload the file, create a batch job, and poll for completion. Results arrive in another JSONL file with responses matched to original requests.
Rate limits are dramatically higher for batch processing—250 million input tokens enqueued for GPT-4 class models compared to standard rate limits. This enables processing that would be impossible with synchronous APIs.
Completion time is guaranteed within 24 hours but typically much faster. Most batch jobs complete within minutes to a few hours depending on size and current load.
Cost calculation: Combining batch processing with prompt caching yields remarkable efficiency. A workflow using both might pay 4 per million output tokens—a fraction of synchronous full-price rates.
Batch Design Patterns
Chunked processing splits large jobs into multiple batches to monitor progress and handle failures gracefully. Rather than one million-request batch, submit ten 100,000-request batches with checkpointing between them.
Priority queuing separates truly non-urgent work (run whenever) from time-sensitive batch work (complete within hours). Different queues can have different retry and monitoring policies.
Hybrid architectures use synchronous APIs for user-facing requests and batch APIs for background processing, even within the same application. User interactions get immediate responses; derived processing (generating embeddings, creating summaries, updating indexes) happens asynchronously.
Token Optimization: Doing More with Less
Beyond caching and routing, optimizing token usage directly reduces costs at the source.
Prompt Compression
Verbose prompts waste tokens. Prompt compression techniques reduce token count while preserving effectiveness:
Instruction condensation rewrites lengthy instructions into concise equivalents. "Please provide a detailed and comprehensive summary of the following document, making sure to include all key points and relevant details" becomes "Summarize comprehensively." The model understands either; the second costs fewer tokens.
Example pruning reduces few-shot examples to the minimum needed for quality. Often 2-3 examples work as well as 5-6, halving example token cost.
Context selection for RAG retrieves only the most relevant chunks rather than padding to a fixed count. If 2 chunks answer the question, don't retrieve 5 chunks "just in case."
Dynamic prompting adjusts prompt detail based on query complexity. Simple queries get minimal instructions; complex queries get detailed guidance. This right-sizes prompts to tasks.
Output Control
Since output tokens cost more than input tokens, controlling output length has outsized impact:
Max_tokens limits prevent runaway responses. Set limits appropriate to expected response length—don't allow 4,000 tokens when 500 suffices.
Format instructions guide concise responses. "Respond in 2-3 sentences" or "Provide a bulleted list with up to 5 items" constrains output length.
Structured outputs (JSON mode, function calling) produce predictable response lengths without conversational padding. A JSON response with specific fields is typically shorter than prose covering the same information.
Token-Budget-Aware Reasoning
Chain-of-thought reasoning improves quality but increases token usage substantially. Research on token-budget-aware reasoning shows that reasoning processes can be compressed based on task complexity. Simple tasks don't need lengthy reasoning chains; complex tasks benefit from more tokens.
The practical application: don't request chain-of-thought for simple queries where the direct answer is obvious. Reserve detailed reasoning for problems that genuinely benefit from step-by-step thinking.
Context Management
Multi-turn conversations accumulate context that may no longer be relevant. Strategies for managing context:
Summarization checkpoints periodically summarize conversation history and restart context with the summary. A 50-turn conversation might be summarized every 10 turns, keeping context bounded while preserving key information.
Selective inclusion includes only relevant prior turns rather than complete history. If the current question is self-contained, prior turns may be unnecessary.
Sliding windows keep only the most recent N turns, dropping older context. This works for conversations where recent context matters most.
Memory systems store conversation information externally and retrieve relevant portions rather than including everything in context. This moves from "keep all context in prompt" to "retrieve relevant context on demand."
Putting It Together: Compound Savings
Individual optimizations yield incremental savings. Combining them produces compound savings that transform cost structures:
Base cost: 100% (no optimization)
Add prompt caching: 40% savings on repeated context → 60% of base
Add semantic caching: 20% additional queries served from cache → 48% of base
Add model routing: 50% of remaining queries handled by cheaper models → 36% of base
Add batch processing for async work: 50% discount on 30% of volume → 31% of base
Add token optimization: 10% reduction in tokens across all requests → 28% of base
This compound effect—starting from 100% and reaching 28%—represents 72% total savings. Real-world results vary, but organizations report 60-80% cost reductions through systematic optimization.
Implementation Priority
Not all optimizations are equally easy to implement. Prioritize by effort-to-impact ratio:
High impact, low effort: Enable prompt caching (often just prompt restructuring), set appropriate max_tokens limits, use cheaper models where possible.
High impact, medium effort: Implement semantic caching, build model routing logic, convert async workloads to batch processing.
Medium impact, higher effort: Build sophisticated context management, implement cascade routing with quality verification, develop custom prompt compression.
Start with quick wins, measure results, then tackle higher-effort optimizations based on remaining cost drivers.
Monitoring and Continuous Optimization
Cost optimization isn't a one-time project—it requires ongoing monitoring and adjustment.
Key Metrics to Track
Cost per conversation/request reveals trends and anomalies. Sudden increases indicate problems; gradual increases suggest growing complexity or feature additions.
Cache hit rates for both prompt and semantic caching indicate efficiency. Low hit rates suggest caching isn't providing value; investigate whether cache TTL is appropriate or whether query patterns preclude effective caching.
Model utilization by tier shows routing effectiveness. If the expensive model handles 90% of queries, routing isn't working. If the cheap model handles 99%, you might be under-utilizing capability.
Token efficiency (output tokens per request, context tokens per conversation) reveals optimization opportunities. Rising token counts indicate growing prompts or responses.
Alerting and Budgets
Configure alerts for cost anomalies—sudden spikes, budget threshold approaches, unusual patterns. Alerts should trigger before costs become problems, not after.
Implement hard budget caps where appropriate. A runaway process or bug shouldn't be able to spend unlimited money. Rate limiting by cost (not just requests) provides financial protection.
A/B Testing Optimizations
Before fully deploying optimizations, A/B test them:
- Does the cheaper model actually maintain quality for routed queries?
- Does semantic cache return appropriate responses, or do users notice degradation?
- Does prompt compression affect task success rates?
Quality metrics should accompany cost metrics. Savings that damage quality aren't real savings—they're just deferred costs in user churn and remediation.
Frequently Asked Questions
Related Articles
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
vLLM in Production: The Complete Guide to High-Performance LLM Serving
A comprehensive guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.
LLM Observability and Monitoring: From Development to Production
A comprehensive guide to LLM observability—tracing, metrics, cost tracking, and the tools that make production AI systems reliable. Comparing LangSmith, Langfuse, Arize Phoenix, and more.
Testing LLM Applications: A Practical Guide for Production Systems
Comprehensive guide to testing LLM-powered applications. Covers unit testing strategies, integration testing with cost control, LLM-as-judge evaluation, regression testing, and CI/CD integration with 2025 tools like DeepEval and Promptfoo.
Mastering LLM Context Windows: Strategies for Long-Context Applications
Practical techniques for managing context windows in production LLM applications—from compression to hierarchical processing to infinite context architectures.