How many retry attempts should I configure?

Typically 3-5 attempts with exponential backoff. More retries increase eventual success rate but add latency and load. Consider your latency requirements and the nature of likely failures. For rate limits, more retries make sense since you're just waiting. For server errors, fewer retries with faster failover may be better.

Should I retry all errors?

No. Retry transient errors (rate limits, server errors, timeouts). Don't retry client errors (bad request, auth failures) since they'll keep failing. Be cautious with 500 errors—they might be transient or might indicate a bug triggered by your specific input.

How do I handle rate limits across multiple instances of my application?

Use centralized rate limiting. Track API usage in a shared store (Redis is common) and coordinate across instances. Alternatively, divide your rate limit quota across instances, giving each instance a portion of the total allowance.

What's the right timeout for LLM calls?

Depends on the use case. Interactive chat: 30-60 seconds. Complex reasoning: 2-5 minutes. Batch processing: longer but with checkpointing. Always set timeouts; never wait indefinitely. Use streaming to improve perceived responsiveness.

How do I choose fallback providers?

Prioritize by capability similarity, then cost. If GPT-5.2 is primary, Claude 4.5 Sonnet is a good fallback (similar capability). Falling back to GPT-4o-mini would degrade quality significantly. Also consider: similar APIs simplify integration; geographic diversity protects against regional outages.

Should I implement circuit breakers in addition to retries?

Yes. Retries handle transient failures; circuit breakers handle persistent failures. Without circuit breakers, you'll waste time retrying against a provider that's clearly down. The patterns are complementary: retry transient failures, circuit-break persistent ones.

How do I test resilience patterns?

Inject failures. Mock API responses to return errors, introduce artificial latency, and simulate rate limits. Use chaos engineering principles: regularly test failover in staging environments. Ensure you can actually fail over before you need to in production.

Error Handling & Resilience for LLM Applications: Production Patterns

LLM APIs fail. Rate limits trigger. Providers experience outages. Network connections timeout. In production, these aren't edge cases—they're routine occurrences that your application must handle gracefully. Implementing robust retry logic alone reduces LLM API failures by up to 90% and creates better user experiences.

This guide covers production patterns for building resilient LLM applications: retry strategies, circuit breakers, fallback mechanisms, and multi-provider architectures that keep your application running when individual components fail.

Understanding LLM Failure Modes

Before implementing resilience patterns, understand what can go wrong.

Transient Failures

Transient failures are temporary issues that typically resolve on their own:

Rate limits (429): You've exceeded the provider's request quota. The API rejects requests until the limit window resets. Most providers return a retry-after header indicating when to retry.

Server overload (503): The provider's infrastructure is temporarily overwhelmed. Requests fail, but the service will recover. Common during high-traffic periods or after new model launches.

Network timeouts: Connection couldn't be established or the response didn't arrive within the timeout window. Often caused by network congestion, not provider issues.

Gateway errors (502, 504): Intermediate infrastructure (load balancers, proxies) failed to connect to backend services. Usually resolves quickly.

Transient failures are candidates for retry—waiting briefly and trying again often succeeds.

Persistent Failures

Persistent failures won't resolve with simple retries:

Authentication failures (401, 403): Invalid API key, expired credentials, or insufficient permissions. Retrying won't help; the underlying issue must be fixed.

Invalid requests (400): Malformed input, unsupported parameters, or content policy violations. The request itself is problematic.

Context length exceeded: The prompt exceeds the model's context window. Retrying with the same input will always fail.

Provider outages: Extended service unavailability. Retries might eventually succeed, but failover to alternative providers is more effective.

Distinguishing transient from persistent failures is crucial—retrying persistent failures wastes resources and delays appropriate error handling.

Partial Failures

Some failures occur mid-response:

Streaming interruptions: Connection drops during token streaming. Some output was received; the rest is lost.

Incomplete responses: The model stops generating before completing the task. The finish_reason might indicate truncation or content filtering.

Tool call failures: In agentic workflows, individual tool executions may fail while the overall conversation continues.

Partial failures require decisions about whether to retry the entire request, resume from a checkpoint, or surface the partial result to users.

Retry Strategies

Retries are the first line of defense against transient failures.

Exponential Backoff with Jitter

The standard retry pattern uses exponential backoff: wait increasingly longer between attempts to give the system time to recover. Adding jitter—randomness to wait times—prevents "thundering herd" problems where many clients retry simultaneously and overwhelm the recovering service.

The pattern works as follows: if a request fails, wait a base delay (say 1 second). If the retry fails, wait double (2 seconds). Then 4 seconds, 8 seconds, and so on. Add random jitter (±20-30%) to each wait time so clients don't synchronize their retries.

Typical configuration: Base delay of 1 second, multiplier of 2, maximum delay cap of 60 seconds, and maximum 3-5 attempts. This means a failing request might wait 1s, then 2s, then 4s before giving up—a total of about 7 seconds of retry time.

Retry Budgets

Unbounded retries can cascade into self-inflicted denial of service. Retry budgets limit total retry attempts across the request lifecycle:

Per-request budget: Maximum attempts for a single user request, typically 3-5. Beyond this, fail fast and return an error.

Time-based budget: Maximum total time spent on retries, say 30 seconds. Even if attempts remain, stop retrying after the time budget expires.

System-wide budget: When overall retry rate exceeds a threshold (e.g., 10% of requests are retries), stop retrying entirely. High retry rates indicate systemic issues that retries won't solve.

Retry Conditions

Not all errors should be retried:

Retry: 429 (rate limit), 500 (server error), 502 (bad gateway), 503 (service unavailable), 504 (gateway timeout), connection timeouts, network errors.

Don't retry: 400 (bad request), 401 (unauthorized), 403 (forbidden), 404 (not found), context length errors, content policy violations.

Maybe retry: 500 errors might be transient or might indicate a persistent bug triggered by your specific input. Consider limited retries with careful monitoring.

Idempotency Considerations

For operations with side effects, ensure retries don't cause duplicate actions:

Idempotent operations: Read-only queries, stateless generations. Safe to retry freely.

Non-idempotent operations: Tool calls that modify data, send emails, or charge payments. Require idempotency keys or explicit deduplication to prevent duplicate execution.

When using function calling or tool use, track which tool calls have already executed. If a retry includes previously-completed tool calls, skip them rather than re-executing.

Circuit Breaker Pattern

Retries assume failures are transient. When failures are persistent, retries waste resources and add latency. The circuit breaker pattern detects sustained failures and stops calling failing services.

How Circuit Breakers Work

The circuit breaker monitors success and failure rates:

Closed state: Normal operation. Requests flow through. Failures are counted.

Open state: Too many failures. Requests fail immediately without attempting the call. This "fails fast" rather than waiting for inevitable timeouts.

Half-open state: After a cooldown period, the circuit allows a few test requests. If they succeed, the circuit closes. If they fail, it opens again.

Salesforce's Agentforce implements circuit breakers: "If 40% or more of OpenAI traffic fails within a 60-second window, Agentforce bypasses retries entirely and routes all traffic to the equivalent model on Azure OpenAI." This prevents wasting time on a provider that's clearly having issues.

Circuit Breaker Configuration

Key parameters to tune:

Failure threshold: What percentage or count of failures triggers opening? Too sensitive and normal variation triggers false alarms. Too insensitive and real outages aren't detected quickly.

Monitoring window: Over what time period are failures counted? Short windows detect issues quickly but may react to brief blips. Longer windows smooth out noise but delay detection.

Cooldown period: How long does the circuit stay open before testing? Long enough for the provider to recover, short enough to resume normal operation promptly. Typically 30-60 seconds.

Test request count: How many requests pass through in half-open state? Enough to reliably assess recovery, few enough to limit exposure if the service is still failing.

Circuit Breaker Scope

Circuit breakers can operate at different scopes:

Per-provider: One breaker for OpenAI, another for Anthropic. A single provider's issues don't affect traffic to others.

Per-model: Separate breakers for GPT-4o and GPT-4o-mini. Model-specific issues are isolated.

Per-endpoint: Different breakers for chat completions, embeddings, and fine-tuning APIs. Endpoint-specific problems don't affect unrelated functionality.

Finer granularity provides better isolation but requires more configuration and monitoring.

Fallback Strategies

When the primary service fails, fallbacks provide continuity.

Provider Fallback Chains

Define explicit sequences of alternative providers:

Primary: GPT-5.2 for best capability Secondary: Claude 4.5 Sonnet if OpenAI fails Tertiary: Gemini 3 Pro if both fail Last resort: Self-hosted Llama 4 for basic functionality

Order chains by capability similarity. Falling back from GPT-5.2 to Claude maintains capability better than falling back to a much smaller model.

Multi-provider libraries automate this pattern: "Resilient-LLM wraps every call in bounded retries with exponential backoff and flips a circuit-breaker when the error count crosses a threshold, automatically redirecting traffic to the next provider in the list."

Model Fallback

Within a single provider, fall back to alternative models:

Capability fallback: If GPT-5.2 times out, try GPT-4o. Less capable but more likely to respond.

Cost-aware fallback: Primary model is premium (for best quality); fallback is cheaper (for availability). Accept lower quality rather than complete failure.

Context-aware fallback: If a large model's context window is exceeded, fall back to a model with a larger window, or truncate context and retry with the original model.

Graceful Degradation

When all sophisticated options fail, provide reduced but useful functionality:

Cached responses: Return cached responses for common queries. Stale data is better than no data.

Simplified processing: Skip advanced features (like RAG enhancement or multi-step reasoning) and provide basic responses.

Queue for later: Accept the request but process it asynchronously when services recover. Appropriate for non-time-sensitive operations.

Honest errors: When nothing works, tell users clearly. "Our AI service is temporarily unavailable. Please try again in a few minutes." is better than hanging indefinitely.

Fallback Quality Considerations

Fallbacks trade quality for availability. Consider:

Quality thresholds: If fallback quality is unacceptable for certain use cases, fail rather than return poor results. A medical advice application might prefer errors over incorrect guidance.

User notification: Consider informing users when fallback is active. "We're using a backup system, so responses may be slower or less detailed."

Fallback monitoring: Track how often fallbacks activate and their quality metrics. High fallback rates indicate problems that need attention.

Rate Limit Handling

Rate limits are the most common LLM API failure mode. Sophisticated handling is essential.

Understanding Rate Limit Types

Providers impose multiple limit types:

Requests per minute (RPM): Total API calls allowed per minute, regardless of size.

Tokens per minute (TPM): Total tokens (input + output) processed per minute. Large requests consume more quota.

Requests per day (RPD): Daily caps, typically for free tiers or trial accounts.

Concurrent requests: Maximum simultaneous in-flight requests.

Different models often have different limits. GPT-4 typically has lower limits than GPT-3.5; capacity constraints vary by model.

Proactive Rate Limit Management

Rather than hitting limits and handling errors, avoid hitting limits:

Client-side throttling: Track your request rate and slow down before hitting limits. If you're at 80% of your RPM limit, introduce small delays.

Token budgeting: Estimate token usage before sending requests. If a batch of requests would exceed TPM limits, spread them across time or reduce batch size.

Request queuing: Queue requests and release them at a controlled rate. Smooth traffic patterns avoid bursts that trigger limits.

Reactive Rate Limit Handling

When limits are hit despite proactive measures:

Respect retry-after: Most 429 responses include a retry-after header indicating when to retry. Honor it rather than guessing.

Backoff strategies: If no retry-after is provided, use exponential backoff. Rate limits often reset on minute boundaries, so waiting 60 seconds may be more effective than waiting 10.

Request prioritization: When rate-limited, prioritize which queued requests to send first. User-facing requests might take priority over background tasks.

Token Bucket Pattern

The token bucket pattern provides smooth rate limiting:

The bucket holds tokens representing request capacity. Tokens are added at a steady rate (matching your API limit). Each request consumes tokens. If the bucket is empty, requests wait until tokens are available.

This pattern smooths bursty traffic and naturally handles varying request sizes when configured with token-based (rather than request-based) consumption.

Timeout Management

LLM calls can take anywhere from 500ms to several minutes. Proper timeout configuration prevents hung requests from blocking your application.

Timeout Types

Connection timeout: Maximum time to establish a connection. Typically 5-10 seconds. If the provider is unreachable, fail fast.

Read timeout: Maximum time to receive data after connection is established. Must account for model processing time, which varies significantly by model and prompt complexity.

Total timeout: End-to-end maximum for the entire operation. Provides a backstop regardless of individual timeout configurations.

Timeout Configuration by Use Case

Different use cases need different timeouts:

Interactive chat: Users expect responses within seconds. Set aggressive timeouts (30-60 seconds) and fail fast. Streaming reduces perceived latency.

Complex reasoning: Extended thinking or multi-step agent workflows may legitimately take minutes. Set longer timeouts (5-10 minutes) but provide progress indicators.

Batch processing: Background jobs can tolerate longer waits. Set generous timeouts but implement checkpointing for very long operations.

Tool execution: Individual tool calls in agentic workflows should have short timeouts. Better to fail one tool quickly and adapt than wait indefinitely.

Streaming Timeout Considerations

Streaming responses require different timeout thinking:

Time to first token: How long until streaming begins? If no tokens arrive within 30 seconds, something is wrong.

Inter-token timeout: Maximum gap between tokens. If 60 seconds pass without a token, the stream may have died.

Total stream timeout: Maximum duration for the complete response. Even streaming responses should have upper bounds.

Cancellation and Cleanup

When timeouts trigger:

Clean cancellation: Stop waiting for the response but don't assume the request didn't process. The model may have generated a complete response that simply wasn't delivered.

Resource cleanup: Release held resources (database connections, memory buffers) immediately when timeouts occur.

User feedback: Inform users that the operation timed out and suggest retry or alternative actions.

Multi-Provider Architecture

Production applications increasingly use multiple LLM providers for resilience and optimization.

Provider Abstraction Layer

Abstract provider-specific details behind a unified interface:

Common request format: Translate your application's request format to each provider's API format.

Response normalization: Convert provider-specific responses to a common format your application understands.

Error standardization: Map provider-specific error codes to common error types for consistent handling.

This abstraction enables seamless provider switching without application changes.

Health Monitoring

Continuously monitor provider health:

Active health checks: Periodically send test requests to each provider. Detect issues before user traffic is affected.

Passive monitoring: Track success rates, latency, and error types for real traffic. Detect degradation that health checks might miss.

Alerting: Notify on-call engineers when provider health degrades. Enable proactive response before users are significantly impacted.

Load Balancing Across Providers

Beyond failover, consider active load balancing:

Cost optimization: Route requests to the cheapest provider capable of handling them. Reserve expensive providers for tasks requiring their capabilities.

Latency optimization: Route to the fastest provider for latency-sensitive requests.

Quota management: Spread traffic across providers to avoid hitting any single provider's limits.

Geographic routing: Route to geographically appropriate providers for data residency or latency reasons.

Provider-Specific Quirks

Each provider has unique behaviors to handle:

Different error formats: OpenAI returns structured error objects; Anthropic has different field names; open-source models via vLLM have their own formats.

Capability differences: Tool calling syntax varies. Response formats differ. Some providers support features others don't.

Rate limit semantics: How limits are counted, when they reset, and how they're communicated differs by provider.

Your abstraction layer should handle these differences so application code doesn't need to.

Observability for Resilience

You can't improve what you can't see.

Key Metrics to Track

Success rate: Percentage of requests that complete successfully. Track by provider, model, and endpoint.

Error distribution: Breakdown of error types. Are you seeing mostly rate limits? Timeouts? Server errors? Different error types require different responses.

Latency percentiles: P50, P95, P99 latency. Averages hide problems; percentiles reveal them.

Retry rate: What percentage of requests require retries? High retry rates indicate issues even if requests eventually succeed.

Fallback rate: How often do fallback providers activate? Frequent fallbacks suggest primary provider problems.

Circuit breaker state: Which breakers are open? How often do they trip? How long do they stay open?

Alerting Strategy

Alert on conditions that require action:

Error rate spikes: Sudden increases in error rates indicate emerging issues.

Extended circuit breaker open: If a breaker stays open for extended periods, manual investigation is needed.

Elevated latency: Sustained latency increases may indicate provider degradation.

Approaching rate limits: Alert before hitting limits so you can proactively adjust traffic.

Avoid alert fatigue by tuning thresholds and focusing on actionable conditions.

Incident Response

When things go wrong:

Runbooks: Document response procedures for common failure scenarios. When the on-call engineer gets paged at 3 AM, clear guidance helps.

Manual overrides: Enable operators to manually trigger fallbacks, disable providers, or adjust thresholds without code deployments.

Post-mortems: After incidents, analyze what happened, what the impact was, and how to prevent recurrence.

Table of Contents