Should I use full history or summarization?

Start with full history for short conversations (under 20 turns). Add summarization when conversations routinely exceed context limits or costs become problematic. The complexity of summarization is only worth it when you need it.

How do I handle references to earlier conversation?

For recent references, sliding window or full history works. For references to much earlier context, you need either long-term memory with retrieval or summarization that preserves referenced information. Entity tracking helps—if users reference an entity, include relevant context about that entity.

What's the right context window budget allocation?

A common split: 40% for conversation history, 30% for retrieved context (RAG), 20% for system prompt and instructions, 10% buffer for response. Adjust based on your application—RAG-heavy applications need more retrieval budget; chat-focused applications need more history.

How do I maintain persona consistency across long conversations?

Include persona-critical information in every request (in the system prompt or core memory). If the persona has preferences or behaviors established earlier in conversation, summarize these in the rolling summary. Explicitly instruct the model to maintain consistency.

Should memory updates happen during or after conversations?

Both have merits. During-conversation updates (hot path) ensure immediate availability but add latency. After-conversation updates (background) avoid latency but delay memory availability. Consider hybrid: critical information updates immediately; reflective analysis happens in background.

How do I test conversation state management?

Create test conversations that span many turns, include references to earlier context, and test edge cases (very long messages, repeated information, conflicting information). Measure: Are references correctly resolved? Is relevant history included? Does summarization preserve important information?

How do I handle conflicting information in memory?

Recency bias is usually appropriate—newer information supersedes older. But track confidence: explicit user statements override inferred preferences. When conflict is detected, you might ask for clarification rather than silently using one version.

Conversation State Management for LLM Applications

LLM models are stateless. Each API call processes the input independently, with no inherent memory of previous interactions. Yet users expect conversations—coherent multi-turn dialogues where the AI remembers what was discussed, what preferences were expressed, and what context matters.

Bridging this gap between stateless models and stateful conversations is one of the core challenges in LLM application development. This guide covers production strategies for conversation state management: from basic context accumulation to sophisticated memory architectures that achieve 90% token cost reduction and 26% quality improvement over naive approaches.

The State Management Challenge

Understanding why conversation state is hard helps design better solutions.

The Stateless Reality

Every LLM API call is independent. The model doesn't remember your previous calls. When users say "tell me more about that," the model has no idea what "that" refers to unless you explicitly provide that context in the current request.

Applications create the illusion of memory by including conversation history in each request. The model reads the history as context and responds as if it remembers—but it's actually just reading and responding to text.

Context Window Limits

The fundamental constraint is the context window—the maximum number of tokens the model can process in a single request. Context windows have grown dramatically (from 4K to 200K+ tokens), but they're still finite. Long conversations eventually exceed any window.

When history exceeds the context window, something must give:

Truncate older messages (losing potentially relevant context)
Summarize history (losing detail)
Selectively include relevant portions (requiring relevance detection)

Cost Implications

Including full conversation history is expensive. Token costs grow quadratically with conversation length: each new message includes all previous messages. A 50-turn conversation might consume 50,000+ tokens in context alone, before the new query and response.

Smart state management isn't just about fitting in context windows—it's about controlling costs while maintaining conversation quality.

Basic Approaches

Start simple, add complexity only when needed.

Full History Inclusion

The simplest approach includes the complete conversation history in each request. Every user message and assistant response is sent as context for the next turn.

Advantages: Perfect recall. No information loss. Simple implementation.

Disadvantages: Costs grow quadratically. Eventually exceeds context limits. Performance degrades with very long contexts (models struggle with "lost in the middle" problems).

When to use: Short conversations (under 10-20 turns), cost-insensitive applications, or use cases requiring perfect recall.

Sliding Window

Keep only the most recent N messages, discarding older history as new messages arrive.

Advantages: Bounded cost and context size. Simple implementation. Works well when recent context is most relevant.

Disadvantages: Abrupt information loss. A user might reference something from early in the conversation that's no longer in context.

When to use: Conversations where recent context matters most, or as a simple fallback when more sophisticated approaches aren't needed.

Token-Based Truncation

Similar to sliding window, but truncate based on token count rather than message count. Keep messages totaling under a token budget.

Advantages: More precise control over context size. Adapts to varying message lengths.

Disadvantages: Same abrupt information loss. May cut mid-message if not careful.

When to use: When precise token budget control is needed, especially for cost management.

Summarization Strategies

Summarization compresses conversation history, preserving essential information in fewer tokens.

Periodic Summarization

At regular intervals (every N turns or when context reaches a threshold), summarize the conversation so far and replace detailed history with the summary.

How it works: When conversation reaches 20 turns, use the LLM to generate a concise summary. Future requests include only the summary plus recent messages.

Summary prompt design: Ask for specific elements: key topics discussed, decisions made, user preferences expressed, important facts mentioned, open questions remaining.

Advantages: Bounded context growth. Preserves important information across long conversations.

Disadvantages: Summarization loses detail. The summary itself costs tokens to generate. Summarization latency adds to response time.

Hierarchical Summarization

Rather than one summary, maintain summaries at multiple granularities:

Recent messages: Full detail, last 5-10 turns Session summary: Key points from this conversation session Long-term summary: Persistent themes and preferences across sessions

This hierarchy provides detail for recent context while preserving high-level information from older interactions.

Incremental Summarization

Instead of periodic batch summarization, update the summary incrementally with each turn:

After each exchange, briefly update the running summary with any new important information. This spreads summarization cost across turns and avoids latency spikes.

Advantages: Smoother latency profile. Always-current summary.

Disadvantages: More complex to implement. Each turn incurs summarization overhead.

Subconscious Memory Formation

Emerging approaches process conversations after they occur rather than during:

Background processing: After a conversation ends (or during idle periods), a background process analyzes the conversation, extracts patterns, and updates memory.

Advantages: No latency impact on conversations. Can do more sophisticated analysis.

Disadvantages: Memory updates aren't immediately available. Requires background processing infrastructure.

Memory Architectures

Beyond simple history management, sophisticated memory architectures enable more nuanced conversation state.

Memory Types

Cognitive architectures distinguish several memory types:

Sensory memory: The immediate input—the current user message and API request.

Short-term memory (Working memory): The active context window. Information currently being processed. Cleared after each session.

Long-term memory: Persistent storage across sessions. Requires external storage (databases, vector stores) since it exceeds what fits in context.

Short-Term Memory Management

Short-term memory is what's in the current context window:

Conversation buffer: Recent messages in full detail Active context: Retrieved information relevant to current query Session state: Variables and flags tracking conversation state

The challenge is deciding what stays in short-term memory and what gets compressed or moved to long-term storage.

Long-Term Memory Implementation

Long-term memory persists across sessions and requires external storage:

Vector stores: Store conversation snippets as embeddings. Retrieve semantically relevant history when needed.

Structured databases: Store facts, preferences, and entities extracted from conversations. Query by entity or relationship.

Graph databases: Model relationships between conversation elements. Enable complex queries about conversation history.

Multi-Level Hierarchies

Advanced systems use multi-level memory hierarchies:

Core memory: Always-included fundamental information (user identity, critical preferences) Episodic memory: Specific past interactions, retrievable by similarity Semantic memory: Extracted facts and knowledge, organized conceptually Procedural memory: Learned patterns and behaviors

Systems like MIRIX and MemoryOS implement these hierarchies, achieving significant improvements over flat memory approaches.

Context Window Management

Even with sophisticated memory, context window management remains essential.

Strategic Context Assembly

Each request should assemble context strategically:

Fixed elements: System prompt, core instructions (always included) Dynamic context: Relevant retrieved information Conversation history: Recent messages or summarized history Current query: The user's new message

Budget tokens across these categories. If context is tight, compress history rather than cutting retrieved information.

Priority-Based Inclusion

Not all context is equally important. Prioritize:

Current query (must include)
System instructions (must include)
Directly relevant history (references to current topic)
Retrieved context (RAG results)
Recent history (last few turns)
Older history (earlier conversation)

When trimming, cut from the bottom of the priority list.

Relevance Filtering

Rather than including all recent history, filter for relevance:

Semantic similarity: Include only history semantically similar to the current query Entity matching: Include history mentioning entities in the current query Topic detection: Include history from the same conversation topic

This is essentially RAG applied to conversation history.

The "Lost in the Middle" Problem

Models struggle to attend to information in the middle of long contexts. Information at the beginning and end is better utilized than information in the middle.

Mitigation strategies:

Place critical information at the beginning or end
Use markers or formatting to highlight important sections
Periodically "refresh" important context by moving it to recent positions

2025 Memory Systems

Specialized memory systems have emerged for production LLM applications.

Mem0

Mem0 provides intelligent memory management for LLM applications:

Adaptive memory: Automatically determines what to remember and what to forget Multi-level storage: Short-term and long-term memory with automatic promotion Semantic retrieval: Retrieves relevant memories based on query meaning

Mem0 reports 91% lower P95 latency and >90% token cost reduction compared to full-context approaches.

LangMem

LangMem from LangChain provides memory primitives:

Hot path memory: Updates during conversation for critical information Background memory: Reflective processing after conversations Memory tools: LLM-callable tools for memory operations

Reflective Memory Management (RMM)

Recent research introduces reflective memory management:

Adaptive granularity: Memory stored at appropriate levels (utterance, turn, session, topic) Feedback-driven refinement: Uses response citations to improve memory retrieval Online learning: Continuously improves memory relevance through reinforcement learning

RMM achieves 26% relative improvement on LLM judge metrics compared to non-reflective approaches.

Session and User Management

Production applications must manage state across multiple users and sessions.

Session Isolation

Each conversation session should have isolated state:

Session ID: Unique identifier for each conversation Session storage: State associated with the session (history, memory, preferences) Session lifecycle: Clear creation, continuation, and termination

Prevent state leakage between sessions—one user's conversation shouldn't influence another's.

User-Level Memory

Persistent information across sessions for the same user:

User profile: Name, preferences, established context Interaction history: Summary of past sessions Learned preferences: Patterns observed across interactions

This enables continuity: "Last time we discussed X" or "Based on your preference for Y..."

Multi-Tenant Considerations

For applications serving multiple organizations:

Tenant isolation: Strict separation of data between tenants Memory policies: Different retention and access policies per tenant Compliance: Audit trails, data deletion, export capabilities

Security is critical—memory systems are high-value targets containing user conversations.

Redis for Conversation State: Deep Dive

Redis has become the standard infrastructure for LLM conversation state. Its sub-millisecond latency, flexible data structures, and built-in expiration make it ideal for session management, conversation history, and memory retrieval.

Why Redis for Conversation State

Redis addresses the core requirements of conversation state management:

Ultra-fast read/write operations: Redis delivers sub-millisecond latency (<1ms typical). When every LLM request needs to fetch conversation history, this speed is essential for responsive user experiences.

Flexible data structures: Redis supports strings, hashes, lists, sets, sorted sets, and streams. Conversation state naturally maps to these structures—hashes for session metadata, lists for message history, sorted sets for time-ordered retrieval.

Built-in TTL expiration: Conversation sessions should expire automatically. Redis's native TTL support means abandoned sessions clean up without manual intervention—set a 24-hour TTL and sessions expire automatically.

Persistence options: Redis can persist to disk (RDB snapshots, AOF logs), ensuring conversation state survives restarts. For critical conversations, persistence prevents data loss.

Pub/Sub for real-time updates: Redis Pub/Sub enables real-time notifications when conversation state changes—useful for multi-device sync or agent collaboration.

Redis Data Structures for Conversations

Different data structures serve different conversation state needs:

Hashes for session metadata: Store structured session data—user ID, session start time, current topic, preferences, slot values. Hashes allow atomic updates to individual fields without rewriting the entire session.

Lists for message history: Store conversation messages as a list. LPUSH adds new messages; LRANGE retrieves recent messages. Lists maintain insertion order naturally.

Sorted sets for time-ordered retrieval: When you need messages by timestamp (for summarization windows or relevance filtering), sorted sets with timestamp scores enable efficient range queries.

Strings for simple state: Session-level flags, counters, or small JSON blobs. Atomic operations (INCR, APPEND) enable safe concurrent updates.

Streams for conversation events: For complex applications, Redis Streams provide append-only logs of conversation events with consumer group support. Useful for multi-consumer scenarios or audit logging.

RedisVL Session Management

RedisVL provides purpose-built session management for LLM applications:

StandardSessionManager: Stores messages with role (system, user, assistant) and content fields aligned with LLM API formats. Add messages individually or in batches; retrieve by count or token limit.

SemanticSessionManager: Combines session storage with vector similarity search. Rather than returning all recent messages, returns semantically relevant portions of conversation history. Useful for long conversations where not all history is relevant to the current query.

Context window management: Configure how many messages to retrieve. During each LLM call, only the configured number of recent messages are fetched, effectively implementing sliding window within Redis.

TTL Strategies for Sessions

Different TTL strategies suit different applications:

Fixed session TTL: All sessions expire after a fixed duration (e.g., 24 hours). Simple to implement; suitable when sessions have predictable lifespans.

Activity-based TTL: Reset TTL on each interaction. Sessions expire after inactivity (e.g., 30 minutes idle). Keeps active sessions alive while cleaning up abandoned ones.

Sliding TTL with maximum: Reset TTL on activity up to a maximum total duration. Prevents indefinitely-long sessions while accommodating active usage.

Per-message TTL: Individual messages expire independently. Useful for privacy-sensitive applications where older messages should be deleted even if the session continues.

Scaling Redis for Conversations

As conversation volume grows, Redis scaling strategies include:

Redis Cluster: Shards data across multiple nodes. Conversation keys (with session IDs) distribute naturally across shards. Each conversation stays on one shard, ensuring atomic operations.

Read replicas: For read-heavy workloads (many reads per write), replicas handle read traffic while the primary handles writes. Conversation retrieval often exceeds write volume.

Memory optimization: Use appropriate data structures. Hashes are more memory-efficient than individual keys for related data. Consider compression for large message content.

Connection pooling: Maintain connection pools rather than creating connections per request. Connection establishment adds latency; pooling eliminates this overhead.

LangGraph Integration with Redis

LangGraph integrates with Redis for agent memory and persistence:

Checkpoint savers: Redis stores LangGraph checkpoints, enabling conversation resumption after interruptions. Thread-level "short-term memory" persists across agent steps.

Redis Store for long-term memory: Cross-thread memory that persists across conversations. Enables agents to remember user preferences, learned facts, and interaction patterns.

Vector database integration: Redis serves as both the checkpointer and the vector database for memory retrieval—single infrastructure for multiple memory needs.

Redis Architecture for Production

A production Redis deployment for LLM conversations typically includes:

Session store: Hash-based session metadata with message lists. Per-session TTLs based on activity patterns.

Memory retrieval layer: Vector indexes for semantic memory search. SemanticSessionManager for relevant history retrieval.

Rate limiting: Request counters per user/session. Token bucket implementations for API cost control.

Analytics pipeline: Stream-based event logging for conversation analytics. Consumer groups for processing.

Performance Best Practices

Optimize Redis for conversation workloads:

Pipeline operations: Batch multiple Redis commands into single round-trips. Retrieving session metadata and message history can be a single pipelined request.

Lua scripts: For complex atomic operations (like "add message and trim history to N messages"), Lua scripts execute atomically on the server without round-trips.

Memory monitoring: Track memory usage per session. Alert on unusually large sessions that might indicate bugs or abuse. Redis INFO provides detailed memory statistics.

Backup strategy: Regular RDB snapshots for point-in-time recovery. AOF for durability of recent operations. Test restoration procedures.

Implementation Patterns

Practical patterns for building conversation state management.

State Machine Conversations

Model conversations as state machines:

States: Conversation phases (greeting, information gathering, task execution, closing) Transitions: Conditions that move between states State-specific context: Each state may need different context assembly

This structure helps manage what context is relevant at each conversation phase.

Slot Filling

For goal-oriented conversations, track what information has been gathered:

Slots: Information needed to complete the task (name, date, preference) Filled status: Which slots have values Gathering strategy: How to ask for missing information

Include slot status in context so the model knows what's still needed.

Entity Tracking

Track entities (people, products, concepts) mentioned in conversation:

Entity extraction: Identify entities as they're mentioned Entity context: Store relevant information about each entity Entity inclusion: Include information about entities relevant to current query

This enables coherent discussion of complex topics with multiple entities.

Checkpoint and Recovery

For long conversations or agent workflows:

Checkpoints: Save conversation state at regular intervals Recovery: Restore from checkpoint if processing fails Resumption: Enable conversations to resume after interruptions

Performance Optimization

Memory operations can add significant latency if not optimized.

Caching Strategies

Embedding cache: Cache embeddings for messages to avoid recomputation Retrieval cache: Cache retrieval results for similar queries Summary cache: Reuse summaries until conversation changes significantly

Async Memory Operations

Don't block on memory operations:

Async retrieval: Start memory retrieval before it's needed (while processing previous response) Background updates: Update memory in background rather than during request Parallel operations: Execute multiple memory operations simultaneously

Memory Pruning

Prevent unbounded memory growth:

TTL (Time-to-Live): Automatically remove old memories Relevance decay: Reduce retrieval score for older, unused memories Explicit forgetting: Allow users or systems to delete specific memories

Security Considerations

Memory systems introduce security risks.

Data Exposure Risks

Research shows stored memory can be vulnerable:

MEXTRA attacks: Prompt attacks that extract stored memories through retrieval manipulation Cross-session leakage: Memory from one session affecting another inappropriately Inference attacks: Deducing sensitive information from memory patterns

Mitigation Strategies

Memory de-identification: Remove or anonymize PII in stored memories User/session isolation: Strict access controls on memory retrieval Input sanitization: Filter prompts that attempt memory extraction Output monitoring: Detect when responses reveal unexpected memory content

Compliance Requirements

Memory systems may need:

Audit logging: Track all memory access and modifications Data retention policies: Automatic deletion after retention periods Export capabilities: Allow users to retrieve their stored data Deletion capabilities: Support "right to be forgotten" requests

Table of Contents