How much latency does routing add?

Depends on the router type. Rule-based routing adds <1ms. BERT classifiers add 20-50ms. LLM-based classifiers add 100-500ms. Gateways like OpenRouter add 25-40ms for their routing logic. For most applications, sub-50ms overhead is acceptable given 500ms-5s typical LLM response times. Latency-critical applications should use faster routing approaches or pre-computed routing.

How do I get training data for custom routers?

Start with LLM judges—use GPT-5.2 or Claude Opus 4.5 to evaluate which model performs better on your queries. This bootstraps training data quickly. Refine with human evaluation for high-stakes queries. Production feedback (user ratings, task completion, regeneration requests) provides ongoing labels. RouteLLM shows that judge-augmented data significantly improves router performance.

Should I use a managed gateway or build custom routing?

Managed gateways (OpenRouter, Portkey) provide immediate value with minimal engineering. They handle fallbacks, observability, and multi-provider access out of the box. Build custom routing when you need: specific routing logic tied to your domain, data sovereignty requirements, latency optimization below what gateways offer, or optimization for your specific query distribution. Many teams start managed and add custom routing as they scale.

What's the right model mix for cost optimization?

A common [production pattern](https://www.clarifai.com/blog/gemini-3.0-vs-other-models): route 70% of queries to efficient models (Gemini 3 Flash, GPT-4o Mini), 25% to high-capability models (Claude Sonnet 4.5, GPT-5.1), and 5% to frontier models (Claude Opus 4.5, GPT-5.2). This assumes most queries are routine. Adjust based on your actual query complexity distribution.

How do I handle routing for streaming responses?

Routing must complete before streaming begins—you can't change models mid-stream. Use fast routing approaches (rules, small classifiers) to minimize time-to-first-token impact. Fallbacks during streaming are complex; most systems commit to a model once streaming starts and handle failures by returning errors or restarting the entire request with a different model.

What's the right fallback chain depth?

Typically 2-3 models. Deeper chains add complexity without much benefit—if three models fail, there's likely a systemic issue rather than random failures. Order by capability similarity: fallback from Claude Opus 4.5 to Claude Sonnet 4.5 (similar ecosystem, slightly lower capability) rather than to Gemini Flash (major capability and style change).

Can routers generalize across model versions?

RouteLLM research shows routers generalize when strong and weak models change. Routers learn query complexity rather than model-specific patterns. However, routing thresholds may need adjustment when model capabilities change significantly—GPT-5.2 is substantially more capable than GPT-4o, so queries that needed GPT-4o might be handled by GPT-5.1. Monitor quality when deploying new models and retune thresholds as needed.

What about routing to self-hosted open-source models?

Self-hosted models (Qwen3, DeepSeek-V3, Llama 4) can dramatically reduce costs for high-volume applications. Route appropriate queries to self-hosted infrastructure while using API models for capabilities self-hosted models lack. The router needs to know which models are available locally versus via API. Latency characteristics differ—self-hosted models avoid network latency but may have lower throughput.

LLM Routing & Model Selection: Intelligent Multi-Model Orchestration for Production

Q: How do I get training data for custom routers?

Start with LLM judges—use GPT-5.2 or Claude Opus 4.5 to evaluate which model performs better on your queries. This bootstraps training data quickly. Refine with human evaluation for high-stakes queries. Production feedback (user ratings, task completion, regeneration requests) provides ongoing labels. RouteLLM shows that judge-augmented data significantly improves router performance.

Q: Should I use a managed gateway or build custom routing?

Managed gateways (OpenRouter, Portkey) provide immediate value with minimal engineering. They handle fallbacks, observability, and multi-provider access out of the box. Build custom routing when you need: specific routing logic tied to your domain, data sovereignty requirements, latency optimization below what gateways offer, or optimization for your specific query distribution. Many teams start managed and add custom routing as they scale.

Q: What's the right model mix for cost optimization?

A common [production pattern](https://www.clarifai.com/blog/gemini-3.0-vs-other-models): route 70% of queries to efficient models (Gemini 3 Flash, GPT-4o Mini), 25% to high-capability models (Claude Sonnet 4.5, GPT-5.1), and 5% to frontier models (Claude Opus 4.5, GPT-5.2). This assumes most queries are routine. Adjust based on your actual query complexity distribution.

Q: How do I handle routing for streaming responses?

Routing must complete before streaming begins—you can't change models mid-stream. Use fast routing approaches (rules, small classifiers) to minimize time-to-first-token impact. Fallbacks during streaming are complex; most systems commit to a model once streaming starts and handle failures by returning errors or restarting the entire request with a different model.

Q: What's the right fallback chain depth?

Typically 2-3 models. Deeper chains add complexity without much benefit—if three models fail, there's likely a systemic issue rather than random failures. Order by capability similarity: fallback from Claude Opus 4.5 to Claude Sonnet 4.5 (similar ecosystem, slightly lower capability) rather than to Gemini Flash (major capability and style change).

Q: Can routers generalize across model versions?

RouteLLM research shows routers generalize when strong and weak models change. Routers learn query complexity rather than model-specific patterns. However, routing thresholds may need adjustment when model capabilities change significantly—GPT-5.2 is substantially more capable than GPT-4o, so queries that needed GPT-4o might be handled by GPT-5.1. Monitor quality when deploying new models and retune thresholds as needed.

Q: What about routing to self-hosted open-source models?

Self-hosted models (Qwen3, DeepSeek-V3, Llama 4) can dramatically reduce costs for high-volume applications. Route appropriate queries to self-hosted infrastructure while using API models for capabilities self-hosted models lack. The router needs to know which models are available locally versus via API. Latency characteristics differ—self-hosted models avoid network latency but may have lower throughput.

The 2025 LLM landscape offers unprecedented choice—from lightweight models like Gemini 3 Flash at $0.50 per million tokens to frontier reasoning systems like Claude Opus 4.5 at$ 15 per million. Not every query requires the most powerful model. A simple factual question doesn't need GPT-5.2's advanced reasoning; a smaller, cheaper model handles it equally well. Yet uniformly using cheaper models sacrifices quality on complex tasks.

LLM routing solves this by examining incoming queries and directing them to the best-suited model for each specific request. Research demonstrates that intelligent routing can reduce inference costs by up to 85% while maintaining—or even improving—response quality. This guide covers the complete landscape of routing strategies, from the 2025 model ecosystem to production-ready tools and implementation patterns.

The 2025 Model Landscape for Routing

Understanding the current model ecosystem is essential for effective routing. The landscape has evolved dramatically, with clear tiers emerging based on capability, cost, and specialization.

Frontier Models (Tier 1)

These models represent the cutting edge of capability, commanding premium pricing but delivering superior performance on complex reasoning, coding, and analysis tasks.

GPT-5.2 from OpenAI owns mathematical reasoning with a perfect AIME score. The model introduced adaptive reasoning that dynamically adjusts "thinking time" based on task complexity. Pricing sits around $3-12 per million input tokens depending on the specific variant, with GPT-5.1 offering better value at approximately$ 1.25 input / $10 output per million tokens—75% cheaper than GPT-4o while delivering superior reasoning.

Claude Opus 4.5 from Anthropic leads coding quality benchmarks, achieving 77.2% on SWE-bench Verified—the highest success rate for resolving real GitHub issues. The model excels at long-context analysis and complex document understanding. Pricing is premium at $15 input /$ 75 output per million tokens, but justified for coding-intensive work where quality matters more than speed.

Gemini 3 Pro from Google integrates "Deep Think" mode for extended reasoning, competing directly with OpenAI's o-series models. Pricing around $2 input /$ 12 output per million tokens positions it competitively, with search grounding available in a free tier for applications that benefit from real-time information.

Grok 4.1 from xAI emphasizes speed and competitive pricing at $3 input /$ 15 output per million tokens, with particular strength in real-time information synthesis.

High-Capability Models (Tier 2)

These models offer excellent capability at significantly lower cost—the sweet spot for many production applications.

Claude 4.5 Sonnet delivers most of Opus's capability at $3 input /$ 15 output per million tokens—5x cheaper than Opus. For most tasks that don't require Opus-level reasoning, Sonnet provides better value.

GPT-4o remains highly capable at $5 input /$ 20 output per million tokens, though increasingly superseded by GPT-5 series for new deployments.

Gemini 2.5 Pro offers strong multimodal capabilities with tiered pricing: $1.25 input for prompts under 128K tokens,$ 2.50 for longer contexts.

DeepSeek-V3.2 represents the open-source frontier with 671B parameters, offering GPT-4 level performance at dramatically lower self-hosted costs. For organizations with GPU infrastructure, DeepSeek provides a compelling alternative to API-based models.

Efficient Models (Tier 3)

These models handle routine tasks at minimal cost—essential for high-volume applications where routing sends the majority of traffic.

Gemini 3 Flash stands out as the value champion, delivering 78% on coding benchmarks and 90.4% on scientific reasoning at just $0.50 per million input tokens. For routing strategies, Flash serves as an excellent default for the majority of queries.

GPT-4o Mini at $0.60 input /$ 2.40 output per million tokens provides OpenAI-ecosystem compatibility at 8x lower cost than GPT-4o.

Claude Haiku 4.5 at $0.25 input /$ 1.25 output per million tokens offers the Anthropic ecosystem's most cost-effective option for simple tasks.

Gemini 1.5 Flash-8B represents the efficiency extreme at $0.0375 per million tokens—suitable for very high-volume, simple classification or filtering tasks.

Open-Source Models for Self-Hosted Routing

Organizations with GPU infrastructure can dramatically reduce costs by self-hosting open-source models as routing targets.

Qwen3 from Alibaba offers a 235B parameter MoE model with ~22B active parameters per token. Strong on reasoning, code, and multilingual tasks with 32K native context extending to 131K. Apache 2.0 licensed for commercial use.

DeepSeek-R1 excels at long reasoning chains and math-heavy tasks, with performance approaching O3 and Gemini 2.5 Pro. The model family includes distilled variants from tiny to mid-sized, enabling flexible deployment.

Llama 4 Maverick from Meta focuses on coding efficiency with strong multimodal capabilities for text and image understanding.

Mistral Small 3 (24B parameters) achieves state-of-the-art capabilities comparable to larger models, ideal for fast-response conversational agents and low-latency function calling.

Model Routing Implications

The 2025 pricing landscape creates clear routing opportunities:

Model Tier	Representative Models	Input Cost (per 1M)	Best For
Frontier	Claude Opus 4.5, GPT-5.2	$3-15	Complex reasoning, critical coding
High-Cap	Claude Sonnet 4.5, GPT-5.1	$1.25-3	General complex tasks
Efficient	Gemini 3 Flash, GPT-4o Mini	$0.50-0.60	Routine queries, high volume
Ultra-Efficient	Gemini Flash-8B, Haiku 4.5	$0.04-0.25	Simple tasks, filtering

The cost differential between tiers is dramatic—Claude Opus 4.5 costs 30x more than Gemini 3 Flash. Routing 70% of queries to efficient models while reserving frontier models for complex tasks yields significantly better ROI than uniform model selection.

Routing Fundamentals: Strategies and Tradeoffs

Understanding the landscape of routing approaches helps choose the right strategy for your application.

Routing vs. Cascading vs. Cascade Routing

Three distinct approaches exist for multi-model selection:

Routing selects a single model per query upfront. A classifier or rule system examines the query and routes to one model, which produces the final response. This is fast—only one model call per query—but requires accurate upfront classification. Misrouting means either wasted cost (routing simple queries to expensive models) or quality loss (routing complex queries to weak models).

Cascading tries models sequentially, starting with the cheapest. If the first model's response fails quality checks, the system escalates to the next model. This continues until quality thresholds are met or all models are exhausted. Cascading doesn't require accurate upfront classification but adds latency for queries that escalate—you pay for multiple model calls.

Cascade routing unifies both approaches into a theoretically optimal strategy. Research from ETH Zurich proves that cascade routing combines the adaptability of routing with the cost-efficiency of cascading, improving performance by 4% while reducing costs compared to either approach alone. The system both routes and cascades: initial routing directs queries to likely-appropriate models, and cascading provides a safety net when initial routing errs.

The Planner-Executor Pattern

A practical pattern emerging in 2025 production systems is the planner-executor hybrid. Enterprise implementations typically use:

Planner models (GPT-5.2, Gemini 3 Deep Think, Claude Sonnet 4.5) handle hard reasoning, tool orchestration, and complex decision-making. These models process fewer requests but require high capability.

Executor models (Haiku 4.5, DeepSeek-V3.2, Qwen3-30B) handle high-volume, low-latency calls—the routine work that comprises most traffic. These models optimize for cost and speed.

The planner determines what needs to happen; executors carry out the plan. For agentic workflows, this might mean Claude Sonnet 4.5 deciding which tools to call and in what order, while Gemini 3 Flash executes individual API calls and processes results.

Classification Approaches

The core challenge in routing is determining query complexity or appropriate model. Several classification approaches exist:

Rule-based classification uses keywords, patterns, query length, and heuristics. Simple queries (greetings, factual lookups) route to efficient models; complex queries (multi-step reasoning, code generation) route to capable models. Rule-based systems are fast and interpretable but brittle—they miss nuanced complexity and require manual maintenance.

Common rule signals include:

Query length (longer queries often indicate complexity)
Presence of code blocks or technical terminology
Keywords indicating reasoning ("analyze," "compare," "explain why")
Conversation context (follow-up questions may be simpler)
User tier (premium users might default to better models)

ML classifier routing trains a small model to predict which LLM will best handle each query. The classifier might be a fine-tuned BERT model (adding ~50ms latency) or a lightweight neural network. Training requires labeled data mapping queries to appropriate models, which can come from historical logs, human annotation, or LLM judges.

LLM-based classification uses a small, cheap LLM to assess query complexity before routing. This adds cost and latency but captures nuance that rules miss. The classifier LLM might output a complexity score, a task category, or a direct model recommendation.

Embedding similarity compares incoming queries to historical queries where model performance is known. If similar past queries performed well on a specific model, route the new query there. This approach improves with scale as the historical database grows.

Quality Verification for Cascading

Cascading requires mechanisms to verify response quality and decide whether to escalate:

Self-consistency checking runs the same query multiple times and compares responses. High consistency suggests confidence; low consistency suggests uncertainty that might benefit from a more capable model. Research shows this correlates well with actual correctness.

Confidence extraction analyzes response patterns for uncertainty signals. Hedging language ("I think," "possibly," "it might be") or explicit uncertainty statements trigger escalation. Some systems use a verification prompt asking the model to rate its own confidence.

Rule-based validation checks that responses meet format requirements, contain expected elements, and don't contain obvious errors. For structured outputs, schema validation catches malformed responses. For text, length and relevance checks identify problematic outputs.

LLM-as-judge uses a capable model to evaluate responses from cheaper models. When quality scores fall below thresholds, the system escalates. This adds cost but catches subtle quality issues that rule-based checks miss. Using a different model family for judging (e.g., Claude judging GPT outputs) reduces self-preference bias.

RouteLLM: The Research Foundation

RouteLLM from LMSYS provides the foundational research framework for LLM routing. Published at ICLR 2025, it formalizes the routing problem and provides open-source implementations of multiple router architectures.

The RouteLLM Framework

RouteLLM frames routing as learning to predict which model will provide better responses for a given query. The framework trains routers on preference data—human judgments of which model performed better on specific prompts. This training data comes primarily from Chatbot Arena, a platform where users interact with anonymous models and vote on which response they prefer.

Each training data point consists of a prompt and a comparison between two models' response quality (a win for model A, a win for model B, or a tie). The router learns to predict these preferences and routes queries accordingly.

The key insight is that routers don't need to predict absolute quality—they need to predict relative performance between model tiers. This is a simpler learning problem that generalizes well.

Router Architectures

RouteLLM explores four router architectures, each with different tradeoffs:

Similarity-weighted (SW) ranking computes weighted Elo ratings based on query similarity to training examples. When a new query arrives, the router finds similar historical queries and aggregates the model preferences from those examples. This approach requires no model training—only a similarity search over the preference database. It's simple to implement and update but requires maintaining a large preference database.

Matrix factorization learns latent representations of both queries and models. The router predicts preference by computing the interaction between query embeddings and model embeddings. This captures complex patterns in compact representations but requires training and periodic retraining as models evolve.

BERT classifier fine-tunes a BERT model to predict which model will provide a better response. Given a query, the classifier outputs a probability distribution over models. BERT provides strong language understanding with inference latency around 20-50ms—acceptable for most applications but potentially significant for latency-critical use cases.

Causal LLM classifier fine-tunes a small language model (like Llama-3-8B) on the preference prediction task. The LLM receives the query and outputs a routing decision. This approach captures the most nuance but has the highest inference cost among the router options—100-500ms depending on model size and hardware.

Performance Results

RouteLLM evaluations demonstrate dramatic cost savings:

MT-Bench: Over 85% cost reduction compared to always using the expensive model
MMLU: 45% cost reduction while maintaining quality
GSM8K: 35% cost reduction on math reasoning tasks

The routers achieve these savings by correctly identifying that most queries don't require frontier model capabilities. On MT-Bench, for example, the router determined that over 85% of queries could be handled adequately by cheaper models.

Generalization and Transfer

Critically, the routers demonstrate strong generalization. When the strong and weak models are changed at test time (different from training), performance is maintained. This suggests routers learn general query complexity rather than model-specific patterns.

This generalization has practical implications: you can train a router once and continue using it as you update model versions. You don't need to retrain every time OpenAI releases a new GPT variant or Anthropic updates Claude.

Data Augmentation with LLM Judges

The researchers found that augmenting human preference data with synthetic labels from GPT-4 as a judge significantly improves router performance. The augmented dataset enables the causal LLM classifier to achieve over 50% improvement in routing accuracy compared to random routing.

This finding enables practical bootstrapping: teams can generate initial training data using LLM judges to evaluate which model performs better on their specific queries. As the system runs in production, human feedback can refine and improve the router over time.

Production Routing Tools

Several production-ready tools have emerged for LLM routing, each with different approaches and strengths.

Martian

Martian takes a unique approach to routing by focusing on understanding model internals. Rather than training classifiers on preference data, Martian builds predictive models of how different LLMs will behave on specific queries. Their technology "tries to understand the internals of what's going on inside of these models" to predict behavior accurately.

Martian claims their routing helps enterprises "exceed the quality of even the best individual frontier models at a fraction of the cost" by leveraging the collective intelligence of many models. The insight is that different models excel at different tasks—Martian's routing exploits this specialization.

Accenture's investment in Martian and integration into their enterprise AI services validates the approach for production use. As Martian notes, "Agents are like the killer use case for routing"—each step in an agentic workflow benefits from model-specific optimization.

Not Diamond

Not Diamond provides an AI model router that automatically determines which LLM is best-suited for any query. The system leverages evaluation data to predictively route requests, claiming to outperform every individual LLM on accuracy by up to 25% while reducing costs up to 10x.

Not Diamond's key innovation is Prompt Adaptation—an agentic system that programmatically rewrites and optimizes prompts for different models. Rather than just routing the original prompt, the system adapts prompts to each model's strengths. This can improve performance beyond what any single model achieves with the original prompt.

The platform supports both out-of-the-box routing and custom router training. Teams can use Not Diamond's pre-trained router immediately or train custom routers on their specific evaluation data to optimize for their use case.

Samwell AI reports a 10% improvement in LLM output quality with a 10% reduction in inference costs and latency using Not Diamond's technology—demonstrating that routing can improve both quality and cost simultaneously.

Arch-Router

Arch-Router represents the efficiency frontier—a 1.5B parameter model that runs in tens of milliseconds on commodity GPUs. Benchmarks show approximately 28x lower end-to-end latency than commercial competitors while matching or exceeding routing accuracy.

For latency-sensitive applications where even 50ms of routing overhead is problematic, Arch-Router provides a compelling option. The small model size enables edge deployment scenarios where larger routers are impractical. Self-hosting the router also eliminates external dependencies and associated latency.

RouterArena

RouterArena provides an open platform for comprehensive comparison of LLM routers. The benchmark enables fair evaluation across router implementations, helping teams choose the right approach for their requirements. As the routing space matures, standardized benchmarks become essential for making informed decisions.

LLM Gateways: Unified Access with Routing

LLM gateways provide unified API access to multiple models while incorporating routing, fallback, and observability features. For teams not building custom routing, gateways offer immediate value.

OpenRouter

OpenRouter serves as an LLM API gateway and marketplace. Teams integrate once—typically by pointing their OpenAI SDK at OpenRouter's base URL—and gain access to 300+ models across providers including GPT-5 series, Claude 4.5, Gemini 3, and open-source options.

Auto Router: OpenRouter's Auto Router (powered by Not Diamond) automatically selects the best model for each prompt. Instead of manually choosing a model, the router analyzes prompts and selects from curated options based on complexity, task type, and model capabilities.

Fallback handling: OpenRouter documents automatic failover when upstream providers return errors, alongside continuous health monitoring and rate-limit management. If OpenAI experiences issues, requests automatically route to alternatives.

Privacy controls: Zero Data Retention (ZDR) configuration routes requests only to providers that don't store prompts. By default, only request metadata is retained; prompt logging requires explicit opt-in.

Latency overhead: OpenRouter adds approximately 25-40ms under typical production conditions—minimal for most applications but potentially significant for real-time voice or gaming applications.

LiteLLM

LiteLLM is an open-source gateway supporting 100+ LLMs through a unified API. It provides both a Python SDK and a proxy server for production deployments.

Routing and load balancing: LiteLLM's router provides multiple strategies for distributing calls across deployments. The recommended "simple-shuffle" strategy optimizes for production performance. Teams can configure provider lists, health checks, and automatic fallbacks to handle outages.

Performance: Benchmarks show 8ms P95 latency at 1,000 requests per second—suitable for high-throughput production workloads where gateway overhead must be minimal.

2025 features: Recent updates include MCP server support for tool integration, batch API routing to different provider accounts, prompt management with versioning, and an Agent Hub for registering organizational agents. GPT-5.1 support, NVIDIA NIM integration, and RunwayML for video generation extend capabilities beyond text.

Self-hosted advantage: Unlike cloud gateways, LiteLLM can be self-hosted for maximum data control. This suits organizations with strict data residency requirements or those wanting to eliminate external dependencies. The open-source model also enables customization of routing logic.

Portkey

Portkey provides an enterprise-grade AI gateway with integrated observability and governance. The platform handles over 10 billion LLM requests monthly with 99.9999% uptime and sub-10 millisecond latency.

Routing features: Portkey enables dynamic model switching, workload distribution, and failover with configurable rules. Routing can be based on latency, cost, availability, or custom criteria. Canary deployments enable gradual rollout of new models. Circuit breakers prevent cascade failures when providers degrade.

Observability integration: Recognized as a Gartner Cool Vendor in LLM Observability (2025), Portkey tracks 50+ AI-specific metrics per request including hallucination rates, token optimization opportunities, and response quality. Observability is native to the gateway rather than requiring separate integration.

Scale validation: One food delivery platform scaled from 37 million to over 2 billion monthly requests on Portkey, cutting effective spend growth by half through caching, batching, and routing optimization.

Gateway Comparison

Gateway	Deployment	Models	Routing	Latency	Pricing	Best For
OpenRouter	Cloud	300+	Auto + Manual	~25-40ms	Usage-based	Quick start, broad access
LiteLLM	Self-hosted/Cloud	100+	Configurable	~8ms P95	Free (OSS)	Data control, customization
Portkey	Cloud	1600+	Dynamic	<10ms	From $49/mo	Enterprise, observability

Fallback and Reliability Strategies

Production routing requires robust fallback handling. Models fail, rate limits trigger, and providers experience outages. Effective fallback design keeps requests alive when primary providers fail without breaking user experience.

Fallback Architecture Patterns

Primary-secondary chains define explicit fallback sequences. If GPT-5.2 fails, try Claude 4.5 Sonnet. If that fails, try Gemini 3 Pro. The chain continues until success or exhaustion. Chains should be ordered by capability similarity—falling back from GPT-5.2 to Claude Sonnet 4.5 maintains capability better than falling back to GPT-4o Mini.

Health-aware routing monitors provider health and routes around trouble proactively. Rather than waiting for failures, the system tracks error rates, latency percentiles, and availability metrics. When a provider shows degradation, traffic shifts to alternatives before failures occur.

Circuit breakers prevent cascade failures by stopping requests to failing providers. When error rates exceed thresholds (e.g., >10% errors in 1 minute), the circuit "opens" and routes all traffic to alternatives. After a cooldown period, the circuit "half-opens" to test a small percentage of traffic. If tests succeed, the circuit closes and normal traffic resumes.

Graceful degradation accepts reduced capability when necessary. If all frontier models fail, fall back to efficient models rather than returning errors. Users get answers—perhaps simpler or less nuanced—rather than failures. For many queries, efficient model responses are perfectly adequate.

Network-Aware Routing

NetMCP research from the University of Hong Kong adds network awareness to routing decisions. The SONAR algorithm considers both semantic relevance and network performance, accounting for latency, reliability, and congestion to different providers.

In testing with unstable networks, traditional routing approaches showed 90% failure rates while SONAR avoided all failures. For applications deployed across regions or on unreliable networks, network-aware routing significantly improves reliability.

Timeout and Retry Strategies

LLM calls can hang indefinitely without proper timeout handling. Production systems need:

Connection timeouts that fail fast when providers are unreachable. Five seconds is typically appropriate—if connection isn't established by then, the provider is likely having issues.

Read timeouts that bound how long to wait for responses. This depends on expected response length and model. Streaming responses need different handling than complete responses. For GPT-5.2 with extended thinking, timeouts might need to be several minutes.

Exponential backoff on retries to avoid overwhelming recovering providers. Start with short delays (100-200ms), double on each retry, cap at reasonable maximums (30-60 seconds). Add jitter to prevent thundering herd problems when many clients retry simultaneously.

Retry budgets that limit total retry attempts across the request lifecycle. Without budgets, retries can cascade into self-inflicted denial of service. A typical budget might allow 3 total attempts across all fallback providers.

Implementation Patterns

Practical routing implementation involves several key decisions and patterns.

Starting Simple: Category-Based Routing

Enterprise guidance recommends starting with simple category-based routing before advancing to ML classifiers. Define broad categories and assign appropriate models:

Category	Example Queries	Recommended Model
Simple Q&A	"What's the capital of France?"	Gemini 3 Flash
General Chat	Casual conversation, greetings	Haiku 4.5
Code Generation	"Write a Python function to..."	Claude Sonnet 4.5
Complex Analysis	"Compare these architectures..."	GPT-5.1 or Claude Sonnet 4.5
Critical Reasoning	High-stakes decisions, complex math	Claude Opus 4.5 or GPT-5.2

This provides immediate cost savings while you gather data for more sophisticated routing. As traffic grows, refine categories based on observed patterns. Perhaps "Code Generation" splits into "simple scripts," "complex algorithms," and "debugging"—each with different optimal models.

Router Training Pipeline

For teams building custom routers, the pipeline typically involves:

Data collection: Gather queries with quality labels indicating which model performed best. Sources include:

Production logs with user feedback (thumbs up/down, regeneration requests)
Human evaluation of model responses on representative samples
Synthetic labels from LLM judges (GPT-5.2 evaluating other models' outputs)
A/B test results comparing model performance on identical queries

Feature engineering: Extract features that predict query complexity:

Query length and vocabulary complexity
Presence of code, math, or technical terminology
Named entities and domain indicators
Embedding-based features capturing semantic complexity
Conversation context and history

Model training: Train classifiers on the labeled data. BERT-based classifiers provide good accuracy with acceptable latency (~50ms). For faster routing, logistic regression on engineered features works surprisingly well. For maximum accuracy, fine-tuned LLM classifiers capture the most nuance.

Threshold tuning: Adjust routing thresholds based on cost-quality tradeoffs. The threshold determines what "confidence" is required to route to a cheaper model. Conservative thresholds route more to capable models (higher quality, higher cost); aggressive thresholds route more to cheap models (lower cost, potential quality loss). Tune based on your specific quality requirements and budget constraints.

Continuous refinement: Monitor routing decisions in production. Identify cases where routing erred—queries routed to cheap models that got poor responses, or queries routed to expensive models that didn't need them. Add these cases to training data. Routers should improve continuously as they see more traffic.

Cost-Quality Tradeoff Optimization

The fundamental routing tradeoff is cost versus quality. Several strategies help optimize this tradeoff:

Quality thresholds by use case: Not all queries need the same quality bar. Customer-facing responses might require 95% quality threshold; internal analytics might accept 80%. Route accordingly.

User tier differentiation: Premium users might default to better models; free users might see more aggressive routing to cheaper models. This aligns cost with revenue.

Time-based routing: During peak hours when latency matters most, route to faster models. During off-peak, route to more capable models that might be slower.

Confidence-based escalation: If the router is uncertain about classification, default to more capable models. Only route to cheap models when confidence is high.

Latency Considerations

Routing adds latency before the actual LLM call. For most applications, 10-50ms of routing overhead is acceptable—dwarfed by 500ms-5s LLM response times. For latency-critical applications (real-time chat, voice interfaces, gaming), consider:

Pre-computed routing for known patterns. If you can classify queries before the full request arrives (based on conversation context, user profile, or URL parameters), routing latency can be hidden in parallel with other setup.

Cached routing decisions reuse decisions for similar queries. If the router recently classified a semantically similar query, use that decision rather than re-running classification. Embedding similarity enables efficient cache lookup.

Parallel speculation starts requests to multiple models simultaneously, returning the first adequate response. This trades cost for latency—you pay for multiple model calls but get the fastest response. Useful for latency-critical, high-value queries.

Edge-deployed routers like Arch-Router minimize network latency by running routing logic close to the application. The 1.5B parameter model runs in tens of milliseconds on commodity GPUs.

Monitoring and Observability

Effective routing requires comprehensive monitoring:

Routing distribution tracks what percentage of traffic goes to each model. Sudden shifts indicate either changed query patterns or router issues. Expected distribution should be relatively stable; deviations warrant investigation.

Quality by route measures response quality segmented by routed model. If cheap-model quality drops, routing thresholds may need adjustment. Quality metrics might include user feedback, task completion rates, or LLM-judge scores.

Fallback rates track how often primary routing fails and fallbacks activate. High fallback rates suggest provider issues or routing to inappropriate models that frequently fail.

Cost attribution tracks spending by route, query type, user segment, and feature. This enables cost optimization and anomaly detection. Unexpected cost spikes often indicate routing bugs or traffic pattern changes.

Latency breakdown separates routing latency from model latency. If total latency increases, this identifies whether routing or models are responsible.

Specialized Routing Scenarios

Different application types benefit from tailored routing strategies.

Agentic Workflows

AI agents chain multiple model calls to accomplish tasks. Each step may benefit from different models: planning might use a reasoning-optimized model, code generation a code-specialized model, and synthesis a general-purpose model.

Production patterns for agentic routing include:

Step-specific routing: Different agent steps route to different models. A pipeline might classify a query using Gemini 3 Flash, retrieve documents via search, send complex reasoning to GPT-5.2, then use Claude Sonnet 4.5 for code generation.

Context-aware routing: Earlier steps inform routing for later steps. If early planning steps indicate high complexity, subsequent steps might proactively route to more capable models.

Tool-specific routing: Different tools might be backed by different models. A calculator tool might use a math-specialized model; a code executor might use a code-specialized model.

Failure recovery routing: When agent steps fail, routing can select alternative models for retry. If Claude Sonnet 4.5 produces invalid code, retry with Claude Opus 4.5 rather than the same model.

RAG Applications

Retrieval-augmented generation adds document context to queries. Routing considerations include:

Context length routing: Queries with large retrieved contexts route to models with appropriate context windows. A 100K token context can't route to a 32K-context model. The 2025 landscape offers options from 32K (many efficient models) to 200K+ (Gemini, Claude).

Domain routing: If retrieved documents indicate a specific domain (legal, medical, financial), route to domain-specialized models or models known to perform well in that domain.

Synthesis complexity routing: Simple factual lookups in retrieved context route cheaply; complex synthesis across multiple contradictory documents routes to capable models. The presence of multiple documents or conflicting information signals complexity.

Applications handling images, audio, or video alongside text need routing across modalities:

Modality-specific routing: Image analysis routes to vision models (GPT-5.2 Vision, Claude 4.5 Vision, Gemini 3); audio transcription routes to speech models; text routes to language models.

Unified routing for multi-modal queries: Queries combining modalities (e.g., "describe this image and answer questions about it") route to multi-modal models capable of handling all components—Gemini 3 Pro, Claude 4.5, or GPT-5.2 with vision.

Cost optimization across modalities: Vision and audio processing often cost more than text. Route to cheaper text-only models when queries don't require multi-modal understanding.

Table of Contents