Should I use SSE or WebSockets for LLM streaming?

Use SSE (Server-Sent Events) for most LLM streaming use cases. SSE is simpler, works over HTTP, handles reconnection automatically, and is sufficient for the unidirectional server-to-client flow of streaming tokens. Use WebSockets only if you need bidirectional real-time communication (like collaborative editing or voice chat). SSE has better infrastructure support (load balancers, proxies) and is easier to debug.

How do I handle streaming with function/tool calls?

Most LLM APIs stream tool calls differently than regular text. Watch for special events or delta types indicating tool calls. Accumulate tool call arguments across chunks, then execute tools when the call is complete. After tool execution, send results back and continue streaming the final response. The key is detecting when you're receiving tool call deltas vs regular content deltas.

What's a good time-to-first-token (TTFT) target?

Aim for 200-500ms for a responsive feel. Under 300ms feels instant; 500-700ms is acceptable. Above 1 second and users notice the delay. TTFT depends on model size, provider infrastructure, prompt length, and network latency. Measure TTFT separately from total response time—it's the key metric for perceived performance.

How do I handle errors during streaming?

Implement graceful degradation: catch errors mid-stream, display what was received, show an error indicator, and offer retry options. For transient errors (network blips), consider automatic retry with exponential backoff. Always provide a way for users to see partial responses rather than losing everything on error.

How do I implement backpressure in streaming?

If your frontend can't render tokens as fast as they arrive, implement buffering with periodic flushing, or use requestAnimationFrame to batch UI updates. On the backend, most LLM APIs handle backpressure internally. For custom implementations, use async generators with proper await points to naturally create backpressure.

Can I stream structured JSON outputs?

Yes, but it requires special handling. Stream the raw tokens and parse incrementally using a streaming JSON parser. Libraries like `partial-json` or `@streamparser/json` can parse incomplete JSON. Alternatively, have the LLM output JSON lines (one complete JSON object per line) for easier parsing.

Streaming & Real-Time Patterns for LLM Applications

LLM responses can take seconds to generate. Without streaming, users stare at blank screens, unsure if the system is working. Streaming transforms this experience—tokens appear as they're generated, providing immediate feedback and dramatically improving perceived performance. Time-to-first-token under 300-700ms feels snappy; without streaming, users wait the full response time before seeing anything.

This guide covers production patterns for streaming LLM responses: choosing between SSE and WebSockets, implementing token-by-token rendering, handling streaming with tool calls, managing backpressure, and recovering from mid-stream errors.

Why Streaming Matters

The difference between streaming and non-streaming is user experience.

Perceived Performance

A 3-second response that streams feels faster than a 3-second response that appears all at once. Users see progress immediately—tokens appearing, content forming. The cognitive experience shifts from "waiting" to "watching the AI think." This psychological difference significantly impacts user satisfaction.

Time to First Token (TTFT)

The key metric for streaming performance is time-to-first-token: how quickly does the first token appear? This is what users perceive as "response time," even though the full response takes longer. Modern LLM APIs typically achieve TTFT of 200-500ms, far faster than full response times of 2-10 seconds.

Progressive Enhancement

Streaming enables progressive user experiences:

Early interaction: Users can start reading while generation continues. They might find their answer in the first few sentences and stop reading.

Cancellation: Users can abort generation mid-stream if the response is going in the wrong direction. This saves tokens (cost) and time.

Real-time feedback: Users see when responses are cut off, when the model is struggling, or when generation takes unusually long.

SSE vs WebSockets: Choosing the Right Protocol

Two protocols dominate LLM streaming: Server-Sent Events (SSE) and WebSockets.

Server-Sent Events (SSE)

SSE wins for most LLM applications because LLM streaming is fundamentally one-way: the server sends tokens to the client. SSE is purpose-built for this pattern.

Advantages:

Native browser support via the EventSource API
Automatic reconnection on connection drops
Simple implementation—just HTTP with a specific content type
Works through proxies and load balancers without special configuration
Built-in event types and last-event-ID for resumption

Disadvantages:

Unidirectional only (server to client)
Limited to text data (binary requires encoding)
Some browsers limit concurrent connections per domain

SSE is the right choice for chat interfaces, content generation, and any use case where the client primarily receives data.

WebSockets

WebSockets provide full-duplex, bidirectional communication. Both client and server can send messages at any time.

Advantages:

Bidirectional—client can send data mid-stream
Binary data support
Lower overhead per message after connection establishment
Better for high-frequency, bidirectional messaging

Disadvantages:

More complex implementation and infrastructure
Requires special handling for load balancers and proxies
No automatic reconnection—must implement manually
State management complexity at scale

WebSockets are the right choice for collaborative editing, voice interfaces, or scenarios requiring client-to-server communication during generation.

Decision Framework

The rule of thumb: use SSE for chat unless you truly need bidirectional messaging. SSE gets you 90% of the benefit with 10% of the complexity.

Use Case	Recommendation
Chat interfaces	SSE
Content generation	SSE
Code completion	SSE
Voice assistants	WebSockets
Collaborative editing	WebSockets
Interactive tool progress	WebSockets
Multi-party applications	WebSockets

Hybrid Architectures

For complex systems, consider hybrid approaches:

Frontend to backend: WebSockets for bidirectional interactivity Backend to LLM providers: SSE for token streaming

This keeps the user-facing layer interactive while using the simpler protocol for LLM communication.

Implementing Token Streaming

Getting tokens from the LLM to the user's screen involves several components.

Backend Streaming

Most LLM providers support streaming through a stream: true parameter. The response arrives as a series of chunks rather than a single complete message.

OpenAI-style streaming: Returns chunks in JSON format, each containing a delta with the new token(s). The stream ends with a [DONE] marker or a chunk with finish_reason set.

Anthropic-style streaming: Uses Server-Sent Events with typed events: message_start, content_block_delta, message_stop. Different events carry different information.

Your backend must handle these chunks, extract the token content, and forward them to the client.

Frontend Rendering

Rendering streamed tokens requires balancing responsiveness with performance:

Naive approach: Append each token to the DOM as it arrives. Simple but can cause performance issues with rapid updates.

Batched rendering: Accumulate tokens and render every 30-60ms or every 20-60 characters. This prevents "reflow storms" where constant DOM updates cause jank.

Virtual rendering: For very long responses, use virtualization to render only visible content. Full response stays in memory; only the visible portion updates the DOM.

Markdown Rendering

LLM responses often contain Markdown. Rendering Markdown while streaming is tricky:

Problem: Partial Markdown can be invalid. A table started but not finished can't render correctly. An unclosed code block breaks formatting.

Solutions:

Buffer until Markdown structures complete (detect closing markers)
Use incremental Markdown parsers that handle incomplete input
Render as plain text during streaming, convert to formatted Markdown on completion
Accept occasional rendering glitches during streaming for simplicity

Code Highlighting

Streaming code presents similar challenges:

Language detection: Can't always determine the language until more code is visible Syntax highlighting: Partial code may not parse correctly Line numbers: Must update as lines are added

Consider delaying syntax highlighting until code blocks complete, or use heuristic-based highlighting that tolerates incomplete input.

Streaming with Tool Calls

Modern LLMs support tool/function calling. Streaming with tool calls adds complexity.

Tool Call Detection

During streaming, the model may decide to call a tool. This is signaled in the stream:

OpenAI: Tool calls appear in chunks with tool_calls field. The tool name and arguments are streamed incrementally.

Anthropic: Tool use is indicated by tool_use content blocks. Arguments stream as JSON fragments.

Your application must detect when tool calls begin, accumulate the complete tool call specification, and execute the tool.

Parallel Tool Execution

Multiple tool calls may be requested:

Sequential streaming: Tool calls appear one after another in the stream. Wait for all tool specifications before executing.

Parallel execution: Once all tool calls are known, execute them in parallel for efficiency.

Progress feedback: While tools execute, show users that work is happening. "Searching..." or "Calculating..." maintains the real-time feel.

Tool Results and Continuation

After tool execution, results return to the model for continuation:

Non-streaming continuation: Send tool results and get a complete response. Simpler but loses streaming benefits.

Streaming continuation: Send tool results and stream the continuation. Maintains the real-time experience but requires managing multiple streaming phases.

Cancellation During Tool Execution

Users may cancel during tool execution:

Before tool execution: Simply stop processing During tool execution: May need to cancel in-flight operations After tool execution: Tool results may be discarded; partial work may have side effects

Consider whether tool operations are idempotent and how cancellation affects system state.

Error Handling in Streams

Streams can fail at any point. Robust error handling maintains user experience.

Connection Failures

SSE auto-reconnection: EventSource automatically reconnects on connection loss. Include event IDs to enable resumption from the last received event.

WebSocket reconnection: Must implement manually. Maintain connection state and implement exponential backoff for reconnection attempts.

User feedback: Show connection status. "Reconnecting..." is better than silent failure.

Mid-Stream Errors

The stream may fail after some content is delivered:

Partial content handling: Decide whether to show partial content or discard it. For chat, partial responses are usually better than nothing.

Error indicators: Clearly indicate that generation was interrupted. "Response was interrupted. [Retry]" gives users a clear path forward.

Retry logic: Offer to retry from the beginning. For long responses, consider implementing checkpoint-based resumption.

Content Filter Interruptions

Models may stop generating due to content policy violations:

Detection: The finish_reason field indicates why generation stopped. Values like content_filter or length distinguish normal completion from interruptions.

User communication: If content was filtered, inform users appropriately. "The response was filtered due to content guidelines."

Timeout Handling

Streams can stall without explicitly failing:

Inter-token timeout: If no token arrives within a threshold (e.g., 30-60 seconds), assume the stream has failed.

Total timeout: Cap total stream duration. Even streaming responses should have upper bounds.

Graceful termination: On timeout, close the connection cleanly and show users what was received.

Backpressure and Flow Control

When consumers can't keep up with producers, backpressure mechanisms prevent overwhelming the system.

Client-Side Backpressure

If the frontend can't render tokens as fast as they arrive:

Buffering: Accumulate tokens in memory, render at a sustainable rate. Risk: memory growth if the gap persists.

Token dropping: In extreme cases, drop tokens and show an indicator. Rare in practice since LLM generation isn't that fast.

Render throttling: Intentionally render at a fixed rate regardless of arrival rate. Smooths the visual experience.

Server-Side Backpressure

If your backend can't forward tokens as fast as the LLM generates:

Connection buffering: Most networking stacks buffer automatically. Monitor buffer sizes to detect problems.

Explicit flow control: Some streaming protocols support explicit pause/resume signals.

Adaptive behavior: If a client consistently falls behind, consider reducing quality of service rather than accumulating unbounded buffers.

Network Buffering Considerations

Proxies and load balancers can buffer streaming responses:

Nginx buffering: By default, Nginx buffers proxy responses. Disable with proxy_buffering off for streaming.

CDN behavior: Some CDNs don't handle streaming well. Test your specific CDN or bypass it for streaming endpoints.

Connection keep-alive: Ensure infrastructure doesn't close idle-seeming connections during slow streaming.

User Experience Patterns

Streaming enables specific UX patterns that improve the chat experience.

Cancellation (Stop Button)

Users should be able to stop generation at any time:

Immediate response: The stop button should work instantly. Use AbortController to cancel the request.

Clear feedback: Show that generation was stopped. The partial response remains visible.

Cost savings: Cancelled generations stop token billing (for most providers).

Regeneration

After generation completes (or is stopped), offer regeneration:

Full regeneration: Request a new response from scratch. May differ due to model randomness.

Resume: For interrupted responses, resume from where it stopped. More complex but preserves partial progress.

Edit and Continue

Some interfaces allow editing the prompt and continuing:

Mid-conversation editing: User edits their message and regenerates the response. Requires re-running from the edited point.

Streaming implications: The previous stream must be fully cancelled before starting a new one.

Typing Indicators

Show when the AI is "thinking" before tokens arrive:

Before streaming: Show indicator during TTFT wait During streaming: Tokens provide their own indicator; typing indicator may be unnecessary During tool execution: Show what's happening ("Searching web...")

Progressive Disclosure

For long responses, consider progressive disclosure:

Collapsible sections: Show summaries that expand to full content Scroll anchoring: Keep the viewport stable as new content appears below "More" loading: Paginate very long responses

Production Best Practices

Lessons from operating streaming at scale.

Performance Optimization

Minimize TTFT: Time-to-first-token is the key metric. Optimize everything between user input and first token appearance: API latency, routing decisions, preprocessing.

Batch client updates: Don't update the DOM on every token. Batch updates every 30-60ms to prevent jank.

Avoid large payloads: Keep individual stream events small. Large events increase latency variance.

Connection reuse: Maintain persistent connections where possible. Connection establishment adds latency.

Monitoring and Metrics

Track streaming-specific metrics:

TTFT distribution: P50, P95, P99 time-to-first-token Token rate: Tokens per second during streaming Stream completion rate: What percentage of streams complete successfully vs. error/cancel? Stream duration distribution: How long do streams typically last?

Anti-Patterns to Avoid

Common mistakes in streaming implementations:

Sending entire conversation on every turn: Use summaries or sliding windows to manage context. Resending everything is slow and expensive.

Rendering every single token: Coalesce into small batches for performance.

No abort path: Users get stuck waiting. Always implement cancellation.

Custom binary framing: Use SSE/NDJSON when they work. Custom protocols add complexity without benefit for most use cases.

Ignoring proxy buffering: Production proxies buffer by default. Explicitly disable for streaming endpoints.

Security Considerations

Streaming introduces security considerations:

Authentication: Validate auth before streaming begins. Don't stream to unauthorized users.

Content filtering: Apply output filters even during streaming. Don't let partial streams bypass moderation.

Resource limits: Prevent abuse through maximum stream duration and token limits.

Connection limits: Limit concurrent streams per user to prevent resource exhaustion.

Implementation Patterns by Framework

Different frameworks have different streaming approaches.

Next.js / React

Next.js supports streaming through React Server Components and the Vercel AI SDK:

Edge Runtime: Use edge functions for lower latency streaming Vercel AI SDK: Provides hooks like useChat that handle streaming automatically ReadableStream: For custom implementations, use Web Streams API

Python / FastAPI

FastAPI supports streaming through StreamingResponse:

Generator functions: Yield chunks from an async generator SSE libraries: Use sse-starlette for proper SSE formatting Async iteration: Stream directly from async LLM client responses

Node.js / Express

Express streaming through chunked transfer encoding:

res.write(): Send chunks as they arrive Server-Sent Events: Use libraries like express-sse or implement manually Backpressure: Respect drain events on the response stream

Rails

Rails supports streaming through ActionController::Live:

SSE: Rails has built-in SSE support Turbo Streams: Hotwire's Turbo Streams provide an alternative streaming approach Thread safety: Be aware of thread safety when streaming in Rails

Table of Contents