Streaming & Real-Time Patterns for LLM Applications
Comprehensive guide to implementing streaming in LLM applications. Covers SSE vs WebSockets, token-by-token rendering, streaming with tool calls, backpressure handling, error recovery, and production best practices.
Table of Contents
Streaming & Real-Time Patterns for LLM Applications
LLM responses can take seconds to generate. Without streaming, users stare at blank screens, unsure if the system is working. Streaming transforms this experience—tokens appear as they're generated, providing immediate feedback and dramatically improving perceived performance. Time-to-first-token under 300-700ms feels snappy; without streaming, users wait the full response time before seeing anything.
This guide covers production patterns for streaming LLM responses: choosing between SSE and WebSockets, implementing token-by-token rendering, handling streaming with tool calls, managing backpressure, and recovering from mid-stream errors.
Why Streaming Matters
The difference between streaming and non-streaming is user experience.
Perceived Performance
A 3-second response that streams feels faster than a 3-second response that appears all at once. Users see progress immediately—tokens appearing, content forming. The cognitive experience shifts from "waiting" to "watching the AI think." This psychological difference significantly impacts user satisfaction.
Time to First Token (TTFT)
The key metric for streaming performance is time-to-first-token: how quickly does the first token appear? This is what users perceive as "response time," even though the full response takes longer. Modern LLM APIs typically achieve TTFT of 200-500ms, far faster than full response times of 2-10 seconds.
Progressive Enhancement
Streaming enables progressive user experiences:
Early interaction: Users can start reading while generation continues. They might find their answer in the first few sentences and stop reading.
Cancellation: Users can abort generation mid-stream if the response is going in the wrong direction. This saves tokens (cost) and time.
Real-time feedback: Users see when responses are cut off, when the model is struggling, or when generation takes unusually long.
SSE vs WebSockets: Choosing the Right Protocol
Two protocols dominate LLM streaming: Server-Sent Events (SSE) and WebSockets.
Server-Sent Events (SSE)
SSE wins for most LLM applications because LLM streaming is fundamentally one-way: the server sends tokens to the client. SSE is purpose-built for this pattern.
Advantages:
- Native browser support via the EventSource API
- Automatic reconnection on connection drops
- Simple implementation—just HTTP with a specific content type
- Works through proxies and load balancers without special configuration
- Built-in event types and last-event-ID for resumption
Disadvantages:
- Unidirectional only (server to client)
- Limited to text data (binary requires encoding)
- Some browsers limit concurrent connections per domain
SSE is the right choice for chat interfaces, content generation, and any use case where the client primarily receives data.
WebSockets
WebSockets provide full-duplex, bidirectional communication. Both client and server can send messages at any time.
Advantages:
- Bidirectional—client can send data mid-stream
- Binary data support
- Lower overhead per message after connection establishment
- Better for high-frequency, bidirectional messaging
Disadvantages:
- More complex implementation and infrastructure
- Requires special handling for load balancers and proxies
- No automatic reconnection—must implement manually
- State management complexity at scale
WebSockets are the right choice for collaborative editing, voice interfaces, or scenarios requiring client-to-server communication during generation.
Decision Framework
The rule of thumb: use SSE for chat unless you truly need bidirectional messaging. SSE gets you 90% of the benefit with 10% of the complexity.
| Use Case | Recommendation |
|---|---|
| Chat interfaces | SSE |
| Content generation | SSE |
| Code completion | SSE |
| Voice assistants | WebSockets |
| Collaborative editing | WebSockets |
| Interactive tool progress | WebSockets |
| Multi-party applications | WebSockets |
Hybrid Architectures
For complex systems, consider hybrid approaches:
Frontend to backend: WebSockets for bidirectional interactivity Backend to LLM providers: SSE for token streaming
This keeps the user-facing layer interactive while using the simpler protocol for LLM communication.
Implementing Token Streaming
Getting tokens from the LLM to the user's screen involves several components.
Backend Streaming
Most LLM providers support streaming through a stream: true parameter. The response arrives as a series of chunks rather than a single complete message.
OpenAI-style streaming: Returns chunks in JSON format, each containing a delta with the new token(s). The stream ends with a [DONE] marker or a chunk with finish_reason set.
Anthropic-style streaming: Uses Server-Sent Events with typed events: message_start, content_block_delta, message_stop. Different events carry different information.
Your backend must handle these chunks, extract the token content, and forward them to the client.
Frontend Rendering
Rendering streamed tokens requires balancing responsiveness with performance:
Naive approach: Append each token to the DOM as it arrives. Simple but can cause performance issues with rapid updates.
Batched rendering: Accumulate tokens and render every 30-60ms or every 20-60 characters. This prevents "reflow storms" where constant DOM updates cause jank.
Virtual rendering: For very long responses, use virtualization to render only visible content. Full response stays in memory; only the visible portion updates the DOM.
Markdown Rendering
LLM responses often contain Markdown. Rendering Markdown while streaming is tricky:
Problem: Partial Markdown can be invalid. A table started but not finished can't render correctly. An unclosed code block breaks formatting.
Solutions:
- Buffer until Markdown structures complete (detect closing markers)
- Use incremental Markdown parsers that handle incomplete input
- Render as plain text during streaming, convert to formatted Markdown on completion
- Accept occasional rendering glitches during streaming for simplicity
Code Highlighting
Streaming code presents similar challenges:
Language detection: Can't always determine the language until more code is visible Syntax highlighting: Partial code may not parse correctly Line numbers: Must update as lines are added
Consider delaying syntax highlighting until code blocks complete, or use heuristic-based highlighting that tolerates incomplete input.
Streaming with Tool Calls
Modern LLMs support tool/function calling. Streaming with tool calls adds complexity.
Tool Call Detection
During streaming, the model may decide to call a tool. This is signaled in the stream:
OpenAI: Tool calls appear in chunks with tool_calls field. The tool name and arguments are streamed incrementally.
Anthropic: Tool use is indicated by tool_use content blocks. Arguments stream as JSON fragments.
Your application must detect when tool calls begin, accumulate the complete tool call specification, and execute the tool.
Parallel Tool Execution
Multiple tool calls may be requested:
Sequential streaming: Tool calls appear one after another in the stream. Wait for all tool specifications before executing.
Parallel execution: Once all tool calls are known, execute them in parallel for efficiency.
Progress feedback: While tools execute, show users that work is happening. "Searching..." or "Calculating..." maintains the real-time feel.
Tool Results and Continuation
After tool execution, results return to the model for continuation:
Non-streaming continuation: Send tool results and get a complete response. Simpler but loses streaming benefits.
Streaming continuation: Send tool results and stream the continuation. Maintains the real-time experience but requires managing multiple streaming phases.
Cancellation During Tool Execution
Users may cancel during tool execution:
Before tool execution: Simply stop processing During tool execution: May need to cancel in-flight operations After tool execution: Tool results may be discarded; partial work may have side effects
Consider whether tool operations are idempotent and how cancellation affects system state.
Error Handling in Streams
Streams can fail at any point. Robust error handling maintains user experience.
Connection Failures
SSE auto-reconnection: EventSource automatically reconnects on connection loss. Include event IDs to enable resumption from the last received event.
WebSocket reconnection: Must implement manually. Maintain connection state and implement exponential backoff for reconnection attempts.
User feedback: Show connection status. "Reconnecting..." is better than silent failure.
Mid-Stream Errors
The stream may fail after some content is delivered:
Partial content handling: Decide whether to show partial content or discard it. For chat, partial responses are usually better than nothing.
Error indicators: Clearly indicate that generation was interrupted. "Response was interrupted. [Retry]" gives users a clear path forward.
Retry logic: Offer to retry from the beginning. For long responses, consider implementing checkpoint-based resumption.
Content Filter Interruptions
Models may stop generating due to content policy violations:
Detection: The finish_reason field indicates why generation stopped. Values like content_filter or length distinguish normal completion from interruptions.
User communication: If content was filtered, inform users appropriately. "The response was filtered due to content guidelines."
Timeout Handling
Streams can stall without explicitly failing:
Inter-token timeout: If no token arrives within a threshold (e.g., 30-60 seconds), assume the stream has failed.
Total timeout: Cap total stream duration. Even streaming responses should have upper bounds.
Graceful termination: On timeout, close the connection cleanly and show users what was received.
Backpressure and Flow Control
When consumers can't keep up with producers, backpressure mechanisms prevent overwhelming the system.
Client-Side Backpressure
If the frontend can't render tokens as fast as they arrive:
Buffering: Accumulate tokens in memory, render at a sustainable rate. Risk: memory growth if the gap persists.
Token dropping: In extreme cases, drop tokens and show an indicator. Rare in practice since LLM generation isn't that fast.
Render throttling: Intentionally render at a fixed rate regardless of arrival rate. Smooths the visual experience.
Server-Side Backpressure
If your backend can't forward tokens as fast as the LLM generates:
Connection buffering: Most networking stacks buffer automatically. Monitor buffer sizes to detect problems.
Explicit flow control: Some streaming protocols support explicit pause/resume signals.
Adaptive behavior: If a client consistently falls behind, consider reducing quality of service rather than accumulating unbounded buffers.
Network Buffering Considerations
Proxies and load balancers can buffer streaming responses:
Nginx buffering: By default, Nginx buffers proxy responses. Disable with proxy_buffering off for streaming.
CDN behavior: Some CDNs don't handle streaming well. Test your specific CDN or bypass it for streaming endpoints.
Connection keep-alive: Ensure infrastructure doesn't close idle-seeming connections during slow streaming.
User Experience Patterns
Streaming enables specific UX patterns that improve the chat experience.
Cancellation (Stop Button)
Users should be able to stop generation at any time:
Immediate response: The stop button should work instantly. Use AbortController to cancel the request.
Clear feedback: Show that generation was stopped. The partial response remains visible.
Cost savings: Cancelled generations stop token billing (for most providers).
Regeneration
After generation completes (or is stopped), offer regeneration:
Full regeneration: Request a new response from scratch. May differ due to model randomness.
Resume: For interrupted responses, resume from where it stopped. More complex but preserves partial progress.
Edit and Continue
Some interfaces allow editing the prompt and continuing:
Mid-conversation editing: User edits their message and regenerates the response. Requires re-running from the edited point.
Streaming implications: The previous stream must be fully cancelled before starting a new one.
Typing Indicators
Show when the AI is "thinking" before tokens arrive:
Before streaming: Show indicator during TTFT wait During streaming: Tokens provide their own indicator; typing indicator may be unnecessary During tool execution: Show what's happening ("Searching web...")
Progressive Disclosure
For long responses, consider progressive disclosure:
Collapsible sections: Show summaries that expand to full content Scroll anchoring: Keep the viewport stable as new content appears below "More" loading: Paginate very long responses
Production Best Practices
Lessons from operating streaming at scale.
Performance Optimization
Minimize TTFT: Time-to-first-token is the key metric. Optimize everything between user input and first token appearance: API latency, routing decisions, preprocessing.
Batch client updates: Don't update the DOM on every token. Batch updates every 30-60ms to prevent jank.
Avoid large payloads: Keep individual stream events small. Large events increase latency variance.
Connection reuse: Maintain persistent connections where possible. Connection establishment adds latency.
Monitoring and Metrics
Track streaming-specific metrics:
TTFT distribution: P50, P95, P99 time-to-first-token Token rate: Tokens per second during streaming Stream completion rate: What percentage of streams complete successfully vs. error/cancel? Stream duration distribution: How long do streams typically last?
Anti-Patterns to Avoid
Common mistakes in streaming implementations:
Sending entire conversation on every turn: Use summaries or sliding windows to manage context. Resending everything is slow and expensive.
Rendering every single token: Coalesce into small batches for performance.
No abort path: Users get stuck waiting. Always implement cancellation.
Custom binary framing: Use SSE/NDJSON when they work. Custom protocols add complexity without benefit for most use cases.
Ignoring proxy buffering: Production proxies buffer by default. Explicitly disable for streaming endpoints.
Security Considerations
Streaming introduces security considerations:
Authentication: Validate auth before streaming begins. Don't stream to unauthorized users.
Content filtering: Apply output filters even during streaming. Don't let partial streams bypass moderation.
Resource limits: Prevent abuse through maximum stream duration and token limits.
Connection limits: Limit concurrent streams per user to prevent resource exhaustion.
Implementation Patterns by Framework
Different frameworks have different streaming approaches.
Next.js / React
Next.js supports streaming through React Server Components and the Vercel AI SDK:
Edge Runtime: Use edge functions for lower latency streaming
Vercel AI SDK: Provides hooks like useChat that handle streaming automatically
ReadableStream: For custom implementations, use Web Streams API
Python / FastAPI
FastAPI supports streaming through StreamingResponse:
Generator functions: Yield chunks from an async generator
SSE libraries: Use sse-starlette for proper SSE formatting
Async iteration: Stream directly from async LLM client responses
Node.js / Express
Express streaming through chunked transfer encoding:
res.write(): Send chunks as they arrive
Server-Sent Events: Use libraries like express-sse or implement manually
Backpressure: Respect drain events on the response stream
Rails
Rails supports streaming through ActionController::Live:
SSE: Rails has built-in SSE support Turbo Streams: Hotwire's Turbo Streams provide an alternative streaming approach Thread safety: Be aware of thread safety when streaming in Rails
Frequently Asked Questions
Related Articles
Error Handling & Resilience for LLM Applications: Production Patterns
Comprehensive guide to building resilient LLM applications. Covers retry strategies with exponential backoff, circuit breakers, fallback patterns, rate limit handling, timeout management, and multi-provider failover for production systems.
Building Agentic AI Systems: A Complete Implementation Guide
A comprehensive guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.
LLM Observability and Monitoring: From Development to Production
A comprehensive guide to LLM observability—tracing, metrics, cost tracking, and the tools that make production AI systems reliable. Comparing LangSmith, Langfuse, Arize Phoenix, and more.