LLM Memory Systems: From MemGPT to Long-Term Agent Memory
Understanding memory architectures for LLM agents—MemGPT's hierarchical memory, Letta's agent framework, and patterns for building agents that learn and remember across conversations.
Table of Contents
The Stateless Problem
LLMs are fundamentally stateless. Each conversation starts fresh—no memory of past interactions, no learning from experience. This "conversational amnesia" limits their usefulness for:
- Long-running assistants
- Personalized applications
- Agents that improve over time
- Multi-session workflows
From research: "The explosive growth of large language models (LLMs) has reshaped the AI landscape. Yet their core design is still fundamentally stateless. An LLM can only operate within a limited context window and loses more signal as that window grows, making it unable to reliably carry information forward across extended interactions. This limitation remains the key blocker to building truly persistent, collaborative, and personalized AI agents."
Understanding Memory Types: From Cognitive Science to LLMs
Before diving into technical implementations, it's essential to understand the different types of memory and how they map from human cognition to AI systems. This conceptual foundation will help you design more effective memory systems for your agents.
The Cognitive Science Foundation
Human memory isn't a single system—it's a collection of interconnected systems, each serving different purposes. Cognitive scientists have identified several distinct memory types, and remarkably, the same categorizations apply to effective LLM agent design.
┌─────────────────────────────────────────────────────────────────────────┐
│ MEMORY TYPES: HUMANS VS LLMs │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Cognitive Term Human Brain LLM Implementation │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Short-term Memory Immediate awareness Context window │
│ ~7 items, seconds ~128K tokens, one call │
│ │
│ Working Memory Active manipulation Scratchpad / reasoning │
│ Problem-solving space Chain-of-thought state │
│ │
│ Long-term Memory Permanent storage Vector DB / external │
│ Lifetime capacity Unlimited capacity │
│ │
│ Episodic Memory Personal experiences Conversation history │
│ "What happened" Session logs │
│ │
│ Semantic Memory Facts and knowledge Knowledge base / RAG │
│ "What I know" Retrieved documents │
│ │
│ Procedural Memory Skills and habits Fine-tuned weights │
│ "How to do" Learned behaviors │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Let's examine each memory type in detail.
Short-Term Memory: The Immediate Present
In humans: Short-term memory holds information you're currently aware of—the sentence you just read, the number someone told you, the name you're trying to remember. It's extremely limited (famously "7 ± 2 items") and decays within seconds unless actively maintained through rehearsal. When you're introduced to someone at a party and forget their name moments later, that's short-term memory failing. When you repeat a phone number to yourself while walking to find a pen, you're actively maintaining it in short-term memory.
In LLMs: The context window is the direct equivalent of short-term memory. It contains everything the model can "see" right now—the system prompt, conversation history, retrieved documents, and the current query. Once the context window is full, older information must be removed or summarized. When a conversation ends, this memory vanishes entirely—the model has no recollection that the conversation ever happened.
The fundamental constraint: Just as humans can't hold an entire book in short-term memory while reading, LLMs can't hold unlimited conversation history in their context window. This constraint shapes every memory system design. Every architectural decision in LLM memory systems ultimately traces back to this limitation.
Why Short-Term Memory Matters
Consider what happens in a typical conversation with an AI assistant:
- Turn 1: User asks about Python data structures
- Turn 5: User mentions they're building a web scraper
- Turn 12: User asks "Can you help me with that project?"
For the assistant to understand "that project" in turn 12, it needs to remember the web scraper mentioned in turn 5. This is short-term memory at work—maintaining conversational context across turns.
But what if the conversation has 100 turns? Or 1,000? At some point, early turns fall outside the context window. The model literally cannot see them anymore. This is the cliff edge of short-term memory—information doesn't gradually fade like human memory; it's either fully present or completely gone.
The Capacity Illusion
Modern LLMs advertise impressive context windows—128K tokens for GPT-4, 200K for Claude. This sounds like a lot, but consider:
- A typical conversation turn is 100-500 tokens
- A retrieved document chunk is 500-2,000 tokens
- A code file is 1,000-10,000 tokens
- System prompts with instructions can be 500-2,000 tokens
A 128K context window sounds enormous until you're building an agent that needs conversation history, retrieved documents, tool outputs, and a detailed system prompt. Suddenly you're making hard choices about what to keep and what to discard.
Characteristics of LLM short-term memory:
| Property | Description |
|---|---|
| Capacity | Fixed by model (4K to 200K+ tokens) |
| Duration | Single API call / session |
| Access speed | Instant (it's the input) |
| Fidelity | Perfect within window, zero outside |
| Cost | Tokens × price per token |
The Attention Problem: Lost in the Middle
Even within the context window, LLMs don't attend equally to all information. Research from Stanford and Berkeley ("Lost in the Middle: How Language Models Use Long Contexts") revealed a striking pattern: models perform best when relevant information appears at the very beginning or very end of the context. Information buried in the middle is often ignored or underweighted.
This "lost in the middle" phenomenon has profound implications:
- Document order matters: If you retrieve 10 documents, the order you present them affects whether the model uses them correctly
- Position-based strategies: Some systems deliberately place critical information at the start and end
- Retrieval quantity trade-offs: More retrieved context isn't always better—adding marginally relevant documents can actually hurt performance by pushing important information into the "dead zone"
Think of it like a student cramming for an exam. They remember the first topics they studied (primacy effect) and the last topics (recency effect), but everything in the middle blurs together. LLMs exhibit similar behavior, despite their mechanical nature.
Managing Short-Term Memory
Short-term memory isn't something you "implement"—it's a constraint you work within. But you can manage it intelligently:
Eviction strategies determine what gets removed when the context fills up:
- FIFO (First-In-First-Out): Remove the oldest messages. Simple but may lose important early context.
- Importance-based: Score messages by relevance to current task, keep the important ones.
- Summary-based: Compress old messages into a summary before evicting.
- Sliding window with anchors: Keep a sliding window of recent messages plus "anchored" important ones.
Compression strategies reduce token usage without losing information:
- Summarization: Replace verbose exchanges with concise summaries.
- Entity extraction: Pull out key facts, discard the conversation that revealed them.
- Selective retention: Keep user messages, summarize assistant messages (or vice versa).
class ShortTermMemory:
"""
Short-term memory in LLMs is simply the context window.
This class manages what goes into that window.
"""
def __init__(self, max_tokens: int = 8000):
self.max_tokens = max_tokens
self.messages: list[dict] = []
def add(self, role: str, content: str) -> None:
"""Add a message to short-term memory."""
self.messages.append({"role": role, "content": content})
# Short-term memory has hard limits
# When exceeded, oldest memories must go
while self._count_tokens() > self.max_tokens:
self._evict_oldest()
def _evict_oldest(self) -> None:
"""
Remove oldest non-system message.
This is the simplest eviction strategy.
More sophisticated approaches might:
- Summarize before evicting
- Move to long-term memory
- Prioritize by importance
"""
for i, msg in enumerate(self.messages):
if msg["role"] != "system":
self.messages.pop(i)
break
def get_context(self) -> list[dict]:
"""Return current short-term memory contents."""
return self.messages.copy()
def _count_tokens(self) -> int:
# Simplified token counting
return sum(len(m["content"]) // 4 for m in self.messages)
Key insight: Short-term memory is not something you "implement"—it's a constraint you work within. Every other memory type exists to compensate for short-term memory's limitations.
Working Memory: The Mental Workspace
In humans: Working memory is not just storage—it's an active workspace where you manipulate information. When you do mental arithmetic (what's 47 × 8?), you're using working memory to hold intermediate results. When you plan a route, you're holding the current location, destination, and candidate paths in working memory simultaneously. It's where thinking happens.
The classic demonstration of working memory is the "N-back" task: you're shown a sequence of items and must identify when the current item matches the one from N steps ago. This requires simultaneously storing recent items, comparing them, and updating your mental register—all hallmarks of working memory in action.
In LLMs: Working memory manifests in two primary forms:
-
Chain-of-thought reasoning: When an LLM "thinks step by step," it's using its output as a scratchpad, writing down intermediate results that inform subsequent reasoning. Each "Let me think about this..." or "First, I'll consider..." is the model externalizing its working memory.
-
Explicit scratchpad: A dedicated section of the prompt where the agent can write notes, track state, and maintain awareness of its current goals. This is working memory made visible and persistent.
The Difference Between Working and Short-Term Memory
These terms are often confused, but the distinction matters:
- Short-term memory is about storage—holding information temporarily
- Working memory is about processing—actively manipulating information
An analogy: Short-term memory is your desk—a limited surface where you can place things. Working memory is what you're doing at that desk—the calculations, comparisons, and reasoning you perform with the items there.
For LLMs, the context window is short-term memory (the desk). Working memory is the subset of that context dedicated to active reasoning—the scratchpad, the current goal tracking, the hypotheses being evaluated.
Why Working Memory Matters for Agents
Consider an agent tasked with debugging a complex production issue:
- It needs to hold the error message in mind
- While also remembering the hypothesis it's currently testing
- And tracking which files it has already examined
- And maintaining awareness of what it tried that didn't work
- And keeping the user's original request in focus
Without explicit working memory, agents exhibit a frustrating pattern: they lose track of what they're doing mid-task. They might re-examine a file they already looked at. They might forget a hypothesis they were testing. They might lose sight of the original goal while chasing a tangent.
This is the "goldfish problem" in agent systems—without working memory, every step feels like starting fresh. The agent has the information in its context (short-term memory) but isn't actively using it to guide behavior (working memory).
Components of LLM Working Memory
A well-designed working memory system tracks several types of information:
Goal State:
- Current high-level objective ("Debug the authentication failure")
- Active sub-goals ("Check the token validation logic")
- Completed steps ("Verified database connection is working")
- Blocked paths ("Ruled out network issues")
Observations:
- Recent tool outputs (file contents, API responses, search results)
- Environmental state (current directory, open files, running processes)
- User feedback (corrections, clarifications, preferences)
Reasoning State:
- Active hypotheses ("Token might be expiring prematurely")
- Evidence for/against each hypothesis
- Confidence levels
- Decision criteria
Meta-cognitive State:
- How many steps have been taken
- Time/budget remaining
- When to escalate or ask for help
- What strategies have been tried
The Token Budget Problem
Working memory consumes context window tokens. Every goal you track, every observation you record, every hypothesis you maintain—it all costs tokens. This creates a fundamental tension:
┌─────────────────────────────────────────────────────────────────────────┐
│ CONTEXT WINDOW BUDGET │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Fixed costs: │
│ ├── System prompt: ~1,000 tokens │
│ ├── Tool definitions: ~2,000 tokens │
│ └── User's current message: ~200 tokens │
│ │
│ Variable costs (competing for remaining space): │
│ ├── Working memory: 500-3,000 tokens │
│ ├── Retrieved context: 1,000-5,000 tokens │
│ └── Conversation history: 1,000-10,000 tokens │
│ │
│ Total budget: 8,000-128,000 tokens (depending on model) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
More working memory means better reasoning but less room for conversation history and retrieved documents. This is a design decision with no universal right answer—it depends on whether your agent needs to reason deeply (expand working memory) or remember extensively (preserve history).
Working Memory Decay and Refresh
Human working memory decays rapidly—information fades within seconds without active maintenance (that's why you repeat the phone number to yourself). LLM working memory doesn't decay within a session, but it faces a different challenge: staleness.
An agent might start with the goal "Fix the login bug" but after 20 steps of investigation, that goal might no longer be accurate. Maybe the "login bug" is actually a database configuration issue. The working memory goal is now misleading.
Effective working memory systems include mechanisms for:
- Periodic review: Revisit goals and hypotheses regularly
- Contradiction detection: Flag when observations conflict with hypotheses
- Goal refinement: Update objectives as understanding deepens
- State compression: Summarize completed work to free up space
The relationship between working and short-term memory: Working memory uses short-term memory as its substrate. The scratchpad contents consume context window tokens. This creates a tension: more working memory for complex reasoning means less space for conversation history or retrieved documents.
Working memory management strategies:
| Strategy | When to Use | Trade-off |
|---|---|---|
| Full state | Complex multi-step tasks | Uses many tokens |
| Compressed state | Long-running tasks | May lose detail |
| Key-value notes | Flexible exploration | Less structured |
| Goal stack only | Simple sequential tasks | Limited reasoning |
class WorkingMemory:
"""
Working memory is the agent's scratchpad for active reasoning.
It tracks current goals, observations, and intermediate thoughts.
"""
def __init__(self):
# Current task state
self.current_goal: str = ""
self.sub_goals: list[str] = []
self.completed_steps: list[str] = []
# Observations from the environment
self.observations: list[dict] = []
# Scratchpad for reasoning
self.notes: dict[str, str] = {}
# Hypotheses being considered
self.hypotheses: list[dict] = []
def set_goal(self, goal: str) -> None:
"""Set the current high-level goal."""
self.current_goal = goal
self.sub_goals = []
self.completed_steps = []
def decompose_goal(self, sub_goals: list[str]) -> None:
"""Break down the goal into sub-goals."""
self.sub_goals = sub_goals
def complete_step(self, step: str, result: str) -> None:
"""Mark a step as completed with its result."""
self.completed_steps.append(f"{step}: {result}")
if step in self.sub_goals:
self.sub_goals.remove(step)
def add_observation(self, source: str, content: str) -> None:
"""Record an observation from a tool or environment."""
self.observations.append({
"source": source,
"content": content[:500], # Truncate to manage space
"timestamp": datetime.now().isoformat()
})
# Keep only recent observations in working memory
if len(self.observations) > 10:
self.observations.pop(0)
def note(self, key: str, value: str) -> None:
"""Store a note for later reference."""
self.notes[key] = value
def add_hypothesis(self, hypothesis: str, confidence: float) -> None:
"""Track a hypothesis being considered."""
self.hypotheses.append({
"hypothesis": hypothesis,
"confidence": confidence,
"evidence_for": [],
"evidence_against": []
})
def to_prompt(self) -> str:
"""
Serialize working memory for inclusion in prompt.
This is how working memory enters short-term memory.
"""
sections = []
if self.current_goal:
sections.append(f"## Current Goal\n{self.current_goal}")
if self.sub_goals:
goals_str = "\n".join(f"- [ ] {g}" for g in self.sub_goals)
completed_str = "\n".join(f"- [x] {s}" for s in self.completed_steps[-3:])
sections.append(f"## Progress\n{completed_str}\n{goals_str}")
if self.observations:
recent = self.observations[-3:]
obs_str = "\n".join(f"- [{o['source']}]: {o['content'][:200]}" for o in recent)
sections.append(f"## Recent Observations\n{obs_str}")
if self.notes:
notes_str = "\n".join(f"- {k}: {v}" for k, v in self.notes.items())
sections.append(f"## Notes\n{notes_str}")
if self.hypotheses:
hyp_str = "\n".join(
f"- {h['hypothesis']} (confidence: {h['confidence']:.0%})"
for h in self.hypotheses
)
sections.append(f"## Working Hypotheses\n{hyp_str}")
return "\n\n".join(sections)
Working memory management strategies:
| Strategy | When to Use | Trade-off |
|---|---|---|
| Full state | Complex multi-step tasks | Uses many tokens |
| Compressed state | Long-running tasks | May lose detail |
| Key-value notes | Flexible exploration | Less structured |
| Goal stack only | Simple sequential tasks | Limited reasoning |
Long-Term Memory: Persistent Knowledge
In humans: Long-term memory stores information for extended periods—from hours to a lifetime. It has effectively unlimited capacity but requires effort to encode (form memories) and retrieve (recall them). Memories can be strengthened through repetition and emotional significance, or weakened through disuse.
You don't remember everything you've ever experienced—long-term memory is selective. Emotionally significant events are encoded more strongly. Information you've retrieved multiple times becomes easier to access. And memories you never revisit gradually become harder to recall, though they may still exist somewhere in the neural network.
In LLMs: Long-term memory requires external storage—databases, vector stores, file systems. The LLM itself has no persistent state between API calls. Everything it "remembers" long-term must be explicitly stored outside the model and retrieved when relevant.
This is a profound difference from human memory. Humans form long-term memories automatically—sleep consolidates important experiences, emotional events are encoded strongly, repetition strengthens recall. LLMs form no memories at all without explicit intervention. Every long-term memory must be deliberately created, stored, indexed, and retrieved by external systems.
Why Long-Term Memory Transforms Agents
Consider the difference between these two agent experiences:
Without long-term memory:
- Every conversation starts fresh
- "Who am I talking to? What have we discussed before? What are their preferences?"
- The agent is perpetually a stranger, even to a user it has "talked to" hundreds of times
With long-term memory:
- "Ah, this is Alex. They work at a healthcare startup, prefer detailed explanations, and last time we discussed migrating their database to PostgreSQL."
- The agent has continuity—a persistent relationship with the user
This is the difference between a vending machine and a trusted colleague. Both can provide assistance, but only one can say "Remember when we tried that last quarter? It didn't work because..."
The Four Challenges of Long-Term Memory
The hardest problems in LLM long-term memory aren't storage (databases are a solved problem). The challenges are:
1. What to remember (Encoding Selection)
Not everything is worth storing. An agent that stores every utterance will have retrieval problems later—the important information gets buried under trivial chatter.
Consider a 30-minute conversation. What's worth remembering?
- "The user's name is Alex" → Worth storing
- "The user said 'hmm, let me think'" → Not worth storing
- "The user prefers concise explanations" → Worth storing
- "The user asked about the weather while waiting" → Probably not worth storing
- "The user's project deadline is March 15" → Definitely worth storing
Deciding what to remember requires understanding significance—which is itself a complex judgment. Some systems use LLMs to assess importance. Others use heuristics (store all user preferences, all decisions, all facts). The optimal approach depends on your use case.
2. How to encode (Representation)
The representation matters enormously. The same information can be stored in many ways:
- Raw text: "User said they work at Google on the Search team"
- Structured fact:
{subject: "user", predicate: "works_at", object: "Google Search team"} - Summary: "User is a Google Search engineer"
- Embedding only:
[0.12, -0.34, 0.56, ...](no human-readable form)
Each representation has trade-offs:
- Raw text preserves nuance but is verbose
- Structured facts enable precise queries but lose context
- Summaries balance brevity and meaning but may lose detail
- Embeddings enable semantic search but are opaque
3. When to retrieve (Relevance Detection)
How does the agent know when past information is relevant to the current query?
If the user asks "What's 2+2?", should the agent search its memory? Probably not—this is a simple factual question. But if the user asks "Should we proceed with the migration?", the agent should recall previous discussions about the migration, the user's concerns, and any relevant decisions.
This is the retrieval trigger problem. Options include:
- Always retrieve: Every query triggers memory search (simple but wasteful)
- Keyword matching: Retrieve when query contains certain terms (fast but brittle)
- LLM-driven: Ask the LLM "Is memory relevant here?" (accurate but adds latency)
- Classifier: Train a model to predict retrieval necessity (balanced approach)
4. How to integrate (Context Assembly)
Retrieved memories must be incorporated into the context without overwhelming it. If you retrieve 20 relevant memories, you can't just dump them all into the prompt—you'll crowd out space for the actual conversation.
Integration strategies include:
- Selective inclusion: Only include the top-K most relevant memories
- Summarization: Combine multiple related memories into a digest
- Hierarchical: Include summaries with pointers to details
- Dynamic allocation: Adjust memory budget based on query complexity
Long-Term Memory is Not "Big Short-Term Memory"
A common misconception is that long-term memory is simply short-term memory with more capacity. This is wrong in important ways:
| Property | Short-Term (Context) | Long-Term (External) |
|---|---|---|
| Presence | Always in context | Must be retrieved |
| Access | Instant, guaranteed | Search-based, probabilistic |
| Fidelity | Perfect within window | May return wrong or partial info |
| Cost | Tokens per API call | Storage + retrieval per query |
| Failure mode | Cliff edge (falls out) | Gradual degradation |
The probabilistic nature of long-term memory is crucial to understand. When you search a vector database, you get approximate matches ranked by similarity. The most relevant memory might be ranked #3. The #1 result might be tangentially related but not quite right. Sometimes relevant memories aren't retrieved at all.
This means systems using long-term memory must be designed for imperfect retrieval. They should gracefully handle missing context, ask clarifying questions when uncertain, and avoid overconfident claims based on potentially incomplete memory.
The Forgetting Problem
Human memory forgets, and this is a feature, not a bug. Forgetting prevents cognitive overload, clears out outdated information, and keeps retrieval fast by reducing the search space.
LLM long-term memory needs forgetting too, but it doesn't happen automatically. Without deliberate forgetting:
- Storage costs grow unboundedly
- Retrieval quality degrades (more noise, harder to find signal)
- Outdated information pollutes results
- Contradictory facts accumulate (old and new versions both exist)
Forgetting strategies include:
- Time-based decay: Delete memories not accessed in N days
- Importance thresholds: Remove memories below importance cutoff
- Contradiction resolution: When facts conflict, keep the newer one
- Capacity limits: When storage exceeds threshold, prune lowest-value memories
- User-initiated: Allow users to say "Forget that" or "That's outdated"
class LongTermMemory:
"""
Long-term memory persists across sessions and has unlimited capacity.
It requires explicit storage and retrieval operations.
"""
def __init__(self, vector_store, embedding_model):
self.vector_store = vector_store
self.embedding_model = embedding_model
# Track memory statistics
self.total_memories = 0
self.retrieval_stats = {"hits": 0, "misses": 0}
def store(
self,
content: str,
memory_type: str,
importance: float = 0.5,
metadata: dict = None
) -> str:
"""
Store a memory for long-term retention.
The encoding process:
1. Generate embedding for semantic search
2. Extract structured metadata for filtering
3. Assign importance score for retrieval ranking
4. Store with timestamp for temporal queries
"""
# Generate embedding
embedding = self.embedding_model.embed(content)
# Create memory record
memory_id = str(uuid.uuid4())
memory = {
"id": memory_id,
"content": content,
"embedding": embedding,
"memory_type": memory_type, # fact, episode, preference, etc.
"importance": importance,
"created_at": datetime.now().isoformat(),
"last_accessed": datetime.now().isoformat(),
"access_count": 0,
"metadata": metadata or {}
}
self.vector_store.insert(memory)
self.total_memories += 1
return memory_id
def retrieve(
self,
query: str,
k: int = 5,
memory_types: list[str] = None,
min_importance: float = 0.0,
recency_weight: float = 0.1
) -> list[dict]:
"""
Retrieve relevant memories for a query.
Retrieval is the critical operation:
- Too few results: Agent lacks necessary context
- Too many results: Overwhelms short-term memory
- Wrong results: Agent uses irrelevant information
We combine multiple signals:
1. Semantic similarity (embedding distance)
2. Importance score (pre-assigned weight)
3. Recency (recently accessed = more relevant)
4. Access frequency (frequently used = important)
"""
# Embed the query
query_embedding = self.embedding_model.embed(query)
# Search vector store
candidates = self.vector_store.search(
query_embedding,
k=k * 3, # Over-retrieve then re-rank
filter={"memory_type": {"$in": memory_types}} if memory_types else None
)
if not candidates:
self.retrieval_stats["misses"] += 1
return []
# Re-rank with multiple signals
now = datetime.now()
scored_memories = []
for memory in candidates:
# Base score from semantic similarity
score = memory["similarity"]
# Importance weighting
score *= (0.5 + 0.5 * memory["importance"])
# Recency boost
last_accessed = datetime.fromisoformat(memory["last_accessed"])
days_ago = (now - last_accessed).days
recency_score = math.exp(-recency_weight * days_ago)
score *= (0.7 + 0.3 * recency_score)
# Filter by minimum importance
if memory["importance"] >= min_importance:
scored_memories.append((memory, score))
# Sort by final score
scored_memories.sort(key=lambda x: x[1], reverse=True)
# Update access statistics for retrieved memories
results = []
for memory, score in scored_memories[:k]:
self._update_access(memory["id"])
results.append(memory)
self.retrieval_stats["hits"] += 1
return results
def _update_access(self, memory_id: str) -> None:
"""Update access time and count for a retrieved memory."""
self.vector_store.update(memory_id, {
"last_accessed": datetime.now().isoformat(),
"access_count": {"$inc": 1}
})
def forget(self, threshold_days: int = 90, min_importance: float = 0.3) -> int:
"""
Remove old, low-importance, unused memories.
Forgetting is essential for long-term memory systems:
- Prevents unbounded growth
- Improves retrieval quality (less noise)
- Reduces storage costs
We forget memories that are:
- Old (not accessed recently)
- Low importance (not marked as significant)
- Rarely accessed (not frequently useful)
"""
cutoff = datetime.now() - timedelta(days=threshold_days)
# Find candidates for forgetting
candidates = self.vector_store.query({
"last_accessed": {"$lt": cutoff.isoformat()},
"importance": {"$lt": min_importance},
"access_count": {"$lt": 3}
})
forgotten = 0
for memory in candidates:
self.vector_store.delete(memory["id"])
forgotten += 1
self.total_memories -= forgotten
return forgotten
Long-term memory design decisions:
| Decision | Options | Considerations |
|---|---|---|
| Storage format | Raw text, summaries, structured facts | Trade-off between fidelity and efficiency |
| Embedding model | OpenAI, Cohere, local models | Cost, latency, quality |
| Retrieval method | Pure vector, hybrid (vector + keyword), filtered | Accuracy vs. complexity |
| Importance scoring | Manual, LLM-assessed, usage-based | Effort vs. accuracy |
| Forgetting policy | Time-based, importance-based, capacity-based | Memory hygiene |
Episodic Memory: Remembering Experiences
In humans: Episodic memory stores autobiographical experiences—not just facts, but the experience of events. You remember not just that you visited Paris, but the feeling of seeing the Eiffel Tower, who you were with, what you ate. These memories are inherently temporal and contextual—they're stories with a beginning, middle, and end.
The distinction between episodic and semantic memory was first articulated by psychologist Endel Tulving in 1972. Semantic memory stores facts ("Paris is the capital of France"). Episodic memory stores experiences ("I remember visiting Paris in 2019..."). Both are long-term, but they serve different purposes and are even processed by different brain regions.
In LLMs: Episodic memory stores conversation sessions, task executions, and interactions as coherent episodes. Unlike semantic memory (which stores isolated facts), episodic memory preserves the narrative structure—what happened, in what order, what was the context, and what was the outcome.
Why Narrative Structure Matters
Consider these two ways of storing the same information:
Semantic (facts only):
- User prefers PostgreSQL over MySQL
- User's project is a healthcare startup
- User had a database migration issue
- Migration was successful after fixing charset
Episodic (narrative):
- "In our conversation on March 15, the user was struggling with a database migration. They explained that their healthcare startup needed to move from MySQL to PostgreSQL for HIPAA compliance. We tried several approaches—first the direct migration failed due to charset issues, then we discovered the problem was UTF-8 vs UTF-16 handling in patient names with accents. After implementing proper charset conversion, the migration succeeded. The user was relieved and mentioned they'd been stuck on this for days."
The episodic version captures:
- Temporal context: When it happened
- Causal structure: What led to what
- Emotional significance: The user's frustration and relief
- Problem-solving journey: What worked, what didn't, why
- Lessons learned: The charset issue for future reference
An agent with only semantic memory knows the user prefers PostgreSQL. An agent with episodic memory can say "I remember helping you with that charset issue during the migration—it was tricky because of the accented patient names."
Why Episodic Memory Matters for Agents
Episodic memory enables several capabilities that semantic memory alone cannot provide:
1. Experiential Learning
Without episodic memory, an agent is condemned to repeat mistakes. With it, the agent can recall "Last time we tried approach X, it failed because of Y—let's try something different."
This is especially valuable for coding agents, debugging assistants, and any agent that needs to learn from trial and error. The ability to remember not just outcomes but the full trajectory (what was tried, in what order, why it failed) enables genuine learning from experience.
2. Relationship Continuity
Humans build relationships through shared experiences. "Remember when we..." is the foundation of ongoing relationships. Episodic memory allows agents to reference shared history, building rapport and trust.
"You mentioned last month that you were nervous about the demo" is fundamentally different from "User has experienced nervousness." The first acknowledges shared history; the second is a clinical observation.
3. Contextualized Recommendations
With episodic memory, an agent can say "Based on how the last project went, you might want to budget more time for testing." This draws on the full narrative of past experience, not just extracted facts.
4. Personalized Problem Solving
Knowing how a user approached problems in the past helps predict how they'll want to approach them now. Did they prefer to understand the theory first, or dive into implementation? Did they want multiple options, or a single recommendation? Episodic memory captures these preferences as they manifest in actual behavior, not just stated preferences.
Episodic vs. Conversation History
A critical distinction: raw conversation history is not episodic memory. Conversation history is logs—every message stored chronologically. Episodic memory involves processing those logs to extract meaningful episodes.
| Aspect | Conversation History | Episodic Memory |
|---|---|---|
| Content | Raw messages | Processed narratives |
| Structure | Chronological log | Meaningful episodes |
| Size | Grows without bound | Summarized, bounded |
| Searchable | By time, keyword | By similarity, theme |
| Value | Full detail, high noise | Distilled meaning, low noise |
The transformation from conversation history to episodic memory typically involves:
- Episode boundary detection: Identifying where one coherent interaction ends and another begins
- Summarization: Distilling the key narrative from verbose conversation
- Key moment extraction: Identifying pivotal points (decisions, discoveries, problems)
- Outcome labeling: Did the interaction succeed? Partially? What was accomplished?
- Lesson extraction: What should be remembered for future similar situations?
The Episode Structure
A well-formed episode typically contains:
┌─────────────────────────────────────────────────────────────────────────┐
│ EPISODE STRUCTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ METADATA │
│ ├── Timestamp: When did this happen? │
│ ├── Duration: How long was the interaction? │
│ ├── Participants: Who was involved? │
│ └── Context: What was the setting/situation? │
│ │
│ NARRATIVE │
│ ├── Summary: 2-3 sentence overview │
│ ├── Goal: What was the user trying to accomplish? │
│ ├── Journey: Key steps and turns in the interaction │
│ └── Outcome: What was achieved? │
│ │
│ KEY MOMENTS │
│ ├── Decisions: Choices made and why │
│ ├── Discoveries: New information learned │
│ ├── Problems: Issues encountered │
│ └── Solutions: How problems were resolved │
│ │
│ LESSONS │
│ ├── What worked well? │
│ ├── What didn't work? │
│ └── What should be done differently next time? │
│ │
│ RETRIEVAL AIDS │
│ ├── Embedding: For semantic search │
│ ├── Tags: For categorical filtering │
│ └── Importance score: For prioritizing in retrieval │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Episodic Memory Retrieval Patterns
How do you find the right episodes when they're needed?
Similarity-based retrieval: "This situation reminds me of..." Find episodes semantically similar to the current context. Useful when the current problem might benefit from past experience with similar problems.
User-specific retrieval: "With this particular user, I remember..." Find all episodes involving a specific user, building a history of interactions. Essential for personalization and relationship continuity.
Temporal retrieval: "Last week we discussed..." Find episodes from a specific time period. Useful when users reference past conversations by time ("What did we talk about on Monday?").
Outcome-based retrieval: "When we succeeded/failed at this before..." Find episodes with specific outcomes. Useful for learning from past successes and avoiding past failures.
Thematic retrieval: "Episodes about database migrations..." Find episodes involving specific topics or task types. Useful for domain-specific experience recall.
class EpisodicMemory:
"""
Episodic memory stores experiences as coherent narratives.
Each episode captures what happened, when, with what outcome.
"""
def __init__(self, storage, llm):
self.storage = storage
self.llm = llm # For summarization and extraction
def create_episode(
self,
conversation: list[dict],
task: str = None,
outcome: str = None,
user_id: str = None
) -> dict:
"""
Transform a conversation into an episode.
This is the encoding process for episodic memory:
1. Summarize the overall interaction
2. Extract key moments (decisions, discoveries, problems)
3. Identify the outcome (success, failure, partial)
4. Note lessons learned
"""
# Generate episode summary
summary = self._summarize_conversation(conversation)
# Extract key moments
key_moments = self._extract_key_moments(conversation)
# Identify emotional/importance peaks
importance_score = self._assess_importance(conversation, outcome)
# Extract lessons learned
lessons = self._extract_lessons(conversation, outcome)
episode = {
"id": str(uuid.uuid4()),
"timestamp": datetime.now().isoformat(),
"duration_minutes": self._estimate_duration(conversation),
"user_id": user_id,
# The narrative structure
"task": task,
"summary": summary,
"key_moments": key_moments,
"outcome": outcome,
"lessons": lessons,
# For retrieval
"embedding": self._embed_episode(summary, key_moments),
"importance": importance_score,
# Raw data (optional, for detailed recall)
"message_count": len(conversation),
"full_conversation": conversation # Or store separately
}
self.storage.insert(episode)
return episode
def _summarize_conversation(self, conversation: list[dict]) -> str:
"""Generate a narrative summary of the conversation."""
messages_text = "\n".join(
f"{m['role']}: {m['content'][:200]}"
for m in conversation
)
prompt = f"""Summarize this conversation as a brief narrative.
Focus on: what was discussed, what was accomplished, any problems encountered.
Write in past tense, 2-3 sentences.
Conversation:
{messages_text}
Summary:"""
return self.llm.generate(prompt)
def _extract_key_moments(self, conversation: list[dict]) -> list[dict]:
"""
Identify the pivotal moments in the conversation.
Key moments are:
- Decisions made
- Problems discovered
- Solutions found
- User preferences revealed
- Misunderstandings corrected
"""
messages_text = "\n".join(
f"[{i}] {m['role']}: {m['content'][:300]}"
for i, m in enumerate(conversation)
)
prompt = f"""Identify 2-4 key moments in this conversation.
For each moment, note:
- What happened
- Why it was significant
- The message index
Format as JSON array.
Conversation:
{messages_text}
Key moments:"""
response = self.llm.generate(prompt)
try:
return json.loads(response)
except:
return []
def _extract_lessons(self, conversation: list[dict], outcome: str) -> list[str]:
"""
Extract lessons learned from this episode.
Lessons help the agent improve over time:
- "User prefers detailed explanations"
- "This approach led to errors, try alternative"
- "Always confirm before making changes"
"""
prompt = f"""Based on this conversation and its outcome, what lessons should be remembered?
Outcome: {outcome}
List 1-3 actionable lessons:"""
response = self.llm.generate(prompt)
return [l.strip() for l in response.split("\n") if l.strip()]
def recall_similar_episodes(
self,
query: str,
k: int = 3,
user_id: str = None
) -> list[dict]:
"""
Find past episodes similar to the current situation.
This enables experiential reasoning:
"Last time we encountered a similar problem..."
"Based on past interactions with this user..."
"""
query_embedding = self._embed_query(query)
filters = {}
if user_id:
filters["user_id"] = user_id
episodes = self.storage.search(
query_embedding,
k=k,
filter=filters
)
return episodes
def recall_by_timeframe(
self,
start: datetime,
end: datetime,
user_id: str = None
) -> list[dict]:
"""Recall episodes from a specific time period."""
query = {"timestamp": {"$gte": start.isoformat(), "$lte": end.isoformat()}}
if user_id:
query["user_id"] = user_id
return self.storage.query(query)
def get_user_history(self, user_id: str, limit: int = 10) -> list[dict]:
"""Get recent episodes for a specific user."""
return self.storage.query(
{"user_id": user_id},
sort={"timestamp": -1},
limit=limit
)
Episodic memory enables powerful capabilities:
| Capability | Without Episodic Memory | With Episodic Memory |
|---|---|---|
| Continuity | "I don't recall our previous conversations" | "Last time we discussed your project timeline..." |
| Learning | Repeats same mistakes | "That approach didn't work before, let's try..." |
| Personalization | Generic responses | "You mentioned preferring detailed explanations" |
| Accountability | No history of actions | "Here's what I did and why" |
Semantic Memory: Facts and Knowledge
In humans: Semantic memory stores general knowledge about the world—facts, concepts, meanings. Unlike episodic memory, semantic memories are divorced from personal experience. You know that Paris is the capital of France without remembering when you learned it. You know what a "database" is without recalling the first time you encountered one.
Semantic memory is organized conceptually, not temporally. Facts are connected by meaning and relationship, forming a web of knowledge. "Paris → capital of → France → country in → Europe → continent" represents how semantic memory links concepts together.
In LLMs: The model's training data provides baseline semantic memory—it "knows" facts from training. But this knowledge is frozen at the training cutoff and can't be updated. External semantic memory (knowledge bases, RAG systems) supplements this with current, domain-specific, or private knowledge.
The Dual Nature of LLM Semantic Memory
LLMs have two sources of semantic knowledge:
1. Parametric Knowledge (in the weights)
Everything the model learned during training is encoded in its parameters. This includes:
- General world knowledge ("Paris is the capital of France")
- Language understanding ("syntax," "grammar")
- Domain knowledge ("how databases work")
- Common patterns ("typical React component structure")
This knowledge is:
- Always available (no retrieval needed)
- Fast to access (just run inference)
- Frozen at training time (cannot be updated)
- Potentially outdated or incorrect (hallucinations)
- Generic (not personalized to any user)
2. External Knowledge (retrieved at runtime)
Knowledge stored in external databases and retrieved when relevant:
- User-specific facts ("Alex works at Google")
- Current information ("Bitcoin price today")
- Private data ("Company internal documentation")
- Specialized knowledge ("Domain-specific terminology")
This knowledge is:
- Must be retrieved (adds latency)
- Can be updated (always current)
- Can be personalized (per-user facts)
- Requires storage infrastructure
- Subject to retrieval failures
Why Semantic Memory Matters for Personalization
Consider an agent that has learned these facts about a user over multiple conversations:
User Profile (Semantic Memory):
- Name: Alex
- Company: HealthStart Inc.
- Role: CTO
- Team size: 12 developers
- Tech stack: Python, PostgreSQL, React
- Preferences: Prefers detailed explanations, likes code examples
- Timezone: PST
- Current project: HIPAA-compliant data pipeline
This semantic memory transforms every interaction:
Without semantic memory:
User: "How should we handle the data validation?" Agent: "Could you tell me more about your project and requirements?"
With semantic memory:
User: "How should we handle the data validation?" Agent: "For your HIPAA-compliant data pipeline, I'd recommend validating PHI fields with strict regex patterns before they hit PostgreSQL. Given your team's Python stack, you could use Pydantic models with custom validators. Want me to show a code example?"
The agent knows the context, the constraints, and the user's preferences—enabling responses that feel genuinely personalized rather than generic.
The Episodic-to-Semantic Transition
A fascinating aspect of human memory: episodic memories gradually become semantic over time. You might remember the specific episode when you learned Paris is the capital of France (episodic). But after years, you just know the fact without any episodic context (semantic).
The same transition happens in LLM memory systems. Consider a series of conversations:
Episode 1: "User mentioned they work at a startup" Episode 2: "User said the startup is in healthcare" Episode 3: "User mentioned they're building a data pipeline for HIPAA" Episode 4: "User said they recently hired three more developers"
Over time, these episodes can be consolidated into semantic facts:
- User works at a healthcare startup
- User is building HIPAA-compliant systems
- User's team is growing
This consolidation serves several purposes:
- Reduces storage (facts are more compact than episodes)
- Improves retrieval (facts are easier to match)
- Enables inference (combining facts yields new insights)
- Maintains privacy (can retain facts while forgetting specific conversations)
Structured vs. Unstructured Semantic Memory
Semantic memory can be stored in different formats:
Unstructured (text):
"Alex is the CTO of HealthStart Inc., a healthcare startup
building HIPAA-compliant data pipelines."
Semi-structured (key-value):
{
"user": {
"name": "Alex",
"role": "CTO",
"company": "HealthStart Inc.",
"industry": "healthcare"
}
}
Structured (knowledge graph/triples):
(Alex) --[is_role]--> (CTO)
(Alex) --[works_at]--> (HealthStart Inc.)
(HealthStart Inc.) --[is_in_industry]--> (Healthcare)
(HealthStart Inc.) --[is_building]--> (HIPAA Data Pipeline)
Each format has trade-offs:
| Format | Pros | Cons |
|---|---|---|
| Unstructured | Preserves nuance, easy to store | Hard to query precisely, verbose |
| Semi-structured | Queryable, flexible schema | No relationship modeling |
| Knowledge graph | Explicit relationships, enables reasoning | Complex to build and maintain |
Many production systems use a combination: unstructured text for flexible storage, with structured overlays for common query patterns.
Fact Lifecycle Management
Semantic facts have a lifecycle that must be managed:
Creation: When is a fact worth creating?
- Explicit user statements ("My name is Alex")
- Inferred from behavior (user always chooses detailed explanations)
- Extracted from episodes (consolidation)
Updates: Facts change over time
- User changes jobs ("I left Google, I'm at a startup now")
- Preferences evolve ("Actually, I'd prefer shorter responses")
- Corrections ("No, that's not right—the deadline is March, not February")
Conflicts: What happens when facts contradict?
- Old fact: "User works at Google"
- New fact: "User works at HealthStart"
- Resolution: Mark old fact as superseded, keep audit trail
Expiration: Some facts have natural lifespans
- "Project deadline is March 15" (expires after March 15)
- "Currently debugging authentication" (expires when done)
- "User's name is Alex" (permanent, or until corrected)
The relationship between semantic and episodic memory: Over time, episodic memories can become semantic. After many conversations about a user's project, the agent might distill "User is working on a healthcare startup" as a semantic fact, even without remembering which specific conversation revealed this. This consolidation is how agents develop stable "knowledge" about users from accumulated experiences.
class SemanticMemory:
"""
Semantic memory stores facts, concepts, and knowledge.
Unlike episodic memory, these are decontextualized truths.
"""
def __init__(self, storage, embedding_model, llm):
self.storage = storage
self.embedding_model = embedding_model
self.llm = llm
def store_fact(
self,
subject: str,
predicate: str,
object: str,
confidence: float = 1.0,
source: str = None
) -> str:
"""
Store a semantic fact as a triple.
Examples:
- ("user", "works_at", "Acme Corp")
- ("project", "uses_framework", "React")
- ("deadline", "is", "March 15")
"""
# Create natural language representation
fact_text = f"{subject} {predicate.replace('_', ' ')} {object}"
fact = {
"id": str(uuid.uuid4()),
"subject": subject,
"predicate": predicate,
"object": object,
"fact_text": fact_text,
"embedding": self.embedding_model.embed(fact_text),
"confidence": confidence,
"source": source,
"created_at": datetime.now().isoformat(),
"updated_at": datetime.now().isoformat(),
"contradicted_by": None
}
# Check for existing facts about same subject-predicate
existing = self._find_existing(subject, predicate)
if existing:
# Handle potential contradiction
self._handle_update(existing, fact)
else:
self.storage.insert(fact)
return fact["id"]
def _handle_update(self, existing: dict, new: dict) -> None:
"""
Handle updating an existing fact.
Facts can change:
- User changes jobs
- Project deadlines shift
- Preferences evolve
We don't just overwrite—we may want to track history.
"""
if existing["object"] != new["object"]:
# Mark old fact as superseded
self.storage.update(existing["id"], {
"contradicted_by": new["id"],
"current": False
})
# Store new fact
new["previous_value"] = existing["object"]
self.storage.insert(new)
else:
# Same fact, update confidence/timestamp
self.storage.update(existing["id"], {
"confidence": max(existing["confidence"], new["confidence"]),
"updated_at": new["updated_at"]
})
def extract_facts_from_conversation(
self,
conversation: list[dict],
entity: str = "user"
) -> list[dict]:
"""
Extract semantic facts from a conversation.
This is the process of converting episodic to semantic memory:
- "I work at Google" → (user, works_at, Google)
- "We're using Python" → (project, uses_language, Python)
- "Call me Alex" → (user, preferred_name, Alex)
"""
messages_text = "\n".join(
f"{m['role']}: {m['content']}" for m in conversation
)
prompt = f"""Extract factual information from this conversation about {entity}.
Format as JSON array of objects with: subject, predicate, object, confidence (0-1)
Only extract explicitly stated facts, not inferences.
Conversation:
{messages_text}
Facts:"""
response = self.llm.generate(prompt)
try:
facts = json.loads(response)
# Store each extracted fact
for fact in facts:
self.store_fact(**fact)
return facts
except:
return []
def query_facts(
self,
subject: str = None,
predicate: str = None,
semantic_query: str = None
) -> list[dict]:
"""
Query semantic memory.
Can query by:
- Exact subject/predicate: "What does user work_at?"
- Semantic similarity: "What do we know about user's job?"
"""
if subject and predicate:
# Exact query
return self.storage.query({
"subject": subject,
"predicate": predicate,
"current": {"$ne": False}
})
if semantic_query:
# Semantic search
query_embedding = self.embedding_model.embed(semantic_query)
return self.storage.search(
query_embedding,
k=10,
filter={"current": {"$ne": False}}
)
if subject:
# All facts about subject
return self.storage.query({
"subject": subject,
"current": {"$ne": False}
})
return []
def get_entity_profile(self, entity: str) -> dict:
"""
Compile all known facts about an entity into a profile.
This creates a structured summary:
{
"name": "Alex",
"works_at": "Acme Corp",
"role": "Engineer",
"preferences": {...}
}
"""
facts = self.query_facts(subject=entity)
profile = {"entity": entity}
for fact in facts:
profile[fact["predicate"]] = fact["object"]
return profile
Procedural Memory: Learned Behaviors
In humans: Procedural memory stores skills and habits—how to ride a bike, type on a keyboard, or solve a particular type of math problem. These memories are implicit; you don't consciously recall them, you just execute them.
The hallmark of procedural memory is that it operates below conscious awareness. When you type, you don't think "press the 'T' key with my left index finger"—your fingers just move. When you ride a bike, you don't consciously calculate balance adjustments—your body just knows. This "knowing how" is distinct from "knowing that" (semantic memory).
Procedural memories are also remarkably durable. You might forget facts you learned in school, but skills like riding a bike or swimming tend to persist for life. The phrase "it's like riding a bike" captures this durability—procedural memories resist forgetting in ways that other memory types don't.
In LLMs: Procedural memory is primarily encoded in the model weights through training. A model "knows how" to write Python code or format JSON not through explicit memory, but through learned patterns. Fine-tuning adds new procedural knowledge.
When a model writes syntactically correct Python, it's not consulting explicit rules—the patterns are embedded in billions of parameters. When it formats a response with proper Markdown, it's not following stored instructions—it has "learned" the format through exposure to millions of examples.
The Fundamental Difference: Can't Add at Runtime
Here's what makes procedural memory fundamentally different from the other memory types we've discussed:
Semantic, Episodic memory: "Hey LLM, remember that Alex works at Google." → Store it in a database, retrieve when relevant. Done.
Procedural memory: "Hey LLM, learn how to write Go code." → You can't do this at inference time. The model either knows Go from training, or it doesn't.
This asymmetry has profound implications. You can give an agent unlimited semantic memory (facts about users), unlimited episodic memory (past conversations), but its procedural memory is fixed at training time. If the model was trained on Python but not Rust, no amount of runtime memory will teach it Rust at the same fluency level.
Why This Matters for Agent Capabilities
Consider an agent helping a user with a task:
Scenario 1: User asks about their company's vacation policy.
- Agent retrieves the policy document (semantic)
- Agent recalls previous discussions about time off (episodic)
- Agent synthesizes an answer using language skills (procedural)
- Works great—the novel information was semantic, procedural skills are generic
Scenario 2: User asks to write code in an internal DSL.
- Agent retrieves documentation about the DSL (semantic)
- Agent recalls previous code in this DSL (episodic)
- Agent attempts to write DSL code (procedural)
- May struggle—even with documentation, the model lacks fluency in the DSL
The difference is that some tasks require genuine procedural skill—fluent, automatic execution—not just factual knowledge about how to do something.
Approximating Procedural Memory at Runtime
Since we can't truly add procedural memories at inference time, we use approximations:
Few-Shot Examples (In-Context Learning)
Show the model examples of the desired behavior directly in the prompt:
Here's how we format error messages in our codebase:
Example 1:
Input: File not found
Output: ERROR_FILE_001: Requested file could not be located in path
Example 2:
Input: Permission denied
Output: ERROR_AUTH_002: Insufficient permissions for requested operation
Now format this:
Input: Connection timeout
This works surprisingly well for many tasks. The model doesn't "learn" the procedure permanently, but it can follow the pattern within this conversation.
Limitations:
- Uses precious context tokens
- Only works for patterns demonstrable in few examples
- Performance degrades for complex procedures
- Knowledge doesn't persist to next conversation
Structured Instructions
Provide detailed, step-by-step instructions for the procedure:
When reviewing code, follow these steps:
1. First, check for syntax errors
2. Then, look for potential security vulnerabilities
3. Next, evaluate code style and naming conventions
4. Finally, assess performance implications
For each issue found, format your feedback as:
- Line number
- Issue type (error/warning/suggestion)
- Description
- Suggested fix
Limitations:
- Models may not follow perfectly
- Complex procedures are hard to specify completely
- No implicit knowledge transfer (just explicit rules)
Tool Use as External Procedural Memory
Instead of the model executing a procedure, define a tool that executes it:
# Instead of teaching the model to calculate shipping costs
# (complex procedure with zones, weights, discounts)...
# Define a tool that does it:
def calculate_shipping(weight: float, zone: str, priority: bool) -> float:
"""Calculate shipping cost based on complex internal rules."""
# All the procedural knowledge is in this function
pass
This externalizes procedural memory into code. The model doesn't need to "know how" to calculate shipping—it just needs to know when to call the tool and how to interpret results.
Limitations:
- Requires implementing every procedure as code
- Can't handle truly novel procedures
- Model still needs procedural skill to use tools correctly
Fine-Tuning (True Procedural Learning)
Train the model on examples of the desired behavior:
Training data:
{"input": "Write a SQL query to find users", "output": "SELECT * FROM users;"}
{"input": "Write a SQL query to count orders", "output": "SELECT COUNT(*) FROM orders;"}
... thousands more examples ...
This actually modifies the model's weights, adding genuine procedural memory.
Limitations:
- Expensive (compute and data)
- Static (can't update quickly)
- Requires many examples
- Risk of catastrophic forgetting
- May require re-training for updates
The Skill Hierarchy
Not all procedural skills are equally learnable at runtime. A rough hierarchy:
┌─────────────────────────────────────────────────────────────────────────┐
│ PROCEDURAL SKILL HIERARCHY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EASILY APPROXIMATED AT RUNTIME (few-shot or instructions) │
│ ├── Output formatting (JSON, Markdown, specific templates) │
│ ├── Simple transformations (reformat date, change case) │
│ └── Pattern-based generation (follow a style guide) │
│ │
│ MODERATELY DIFFICULT (may need many examples or fine-tuning) │
│ ├── Domain-specific writing styles (legal, medical, technical) │
│ ├── Code in familiar languages with project conventions │
│ └── Multi-step reasoning in specialized domains │
│ │
│ DIFFICULT (usually requires fine-tuning) │
│ ├── Fluent code in rare/internal languages │
│ ├── Complex domain-specific reasoning (tax law, drug interactions) │
│ └── Maintaining long-range consistency in specialized outputs │
│ │
│ VERY DIFFICULT (may require specialized training) │
│ ├── Novel task types not seen in training │
│ ├── Combining multiple complex skills simultaneously │
│ └── Tasks requiring implicit knowledge hard to articulate │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Runtime Procedural Memory Systems
Despite the limitations, we can build systems that approximate procedural memory:
Procedure Libraries
Store procedures as retrievable documents:
Procedure: Code Review
Trigger: When user asks for code review
Steps:
1. Read the code carefully
2. Check for bugs
3. Evaluate style
...
Examples:
[Example 1]
[Example 2]
When a task matches a stored procedure, retrieve it and inject into the prompt. The model doesn't "know" the procedure permanently, but it has access when needed.
Execution Tracking
Track how well procedures work and refine them:
- Store procedure definition
- Record each execution
- Note success/failure and user feedback
- Periodically update procedures that perform poorly
This creates a feedback loop that improves procedures over time, even though the model itself isn't learning.
Workarounds for runtime procedural memory:
| Approach | How It Works | Limitations |
|---|---|---|
| Few-shot examples | Show examples in prompt | Uses context tokens |
| Instructions | Describe the procedure | May not follow perfectly |
| Tool definitions | Define tools the agent can use | Requires implementation |
| Fine-tuning | Train on examples | Expensive, static |
class ProceduralMemory:
"""
Procedural memory stores 'how to' knowledge.
In LLMs, this is approximated through examples and instructions.
"""
def __init__(self, storage):
self.storage = storage
self.procedures: dict[str, dict] = {}
def store_procedure(
self,
name: str,
description: str,
steps: list[str],
examples: list[dict],
trigger_patterns: list[str]
) -> None:
"""
Store a procedure for later retrieval.
A procedure includes:
- What it does (description)
- How to do it (steps)
- Examples of doing it (few-shot)
- When to use it (trigger patterns)
"""
procedure = {
"name": name,
"description": description,
"steps": steps,
"examples": examples,
"trigger_patterns": trigger_patterns,
"usage_count": 0,
"success_rate": None
}
self.procedures[name] = procedure
self.storage.insert(procedure)
def get_relevant_procedures(self, task: str) -> list[dict]:
"""
Find procedures relevant to the current task.
Matches based on:
- Semantic similarity to description
- Trigger pattern matching
"""
relevant = []
for name, proc in self.procedures.items():
# Check trigger patterns
for pattern in proc["trigger_patterns"]:
if pattern.lower() in task.lower():
relevant.append(proc)
break
return relevant
def format_procedure_for_prompt(self, procedure: dict) -> str:
"""
Format a procedure for inclusion in the prompt.
This is how procedural memory enters short-term memory
at inference time.
"""
sections = [
f"## Procedure: {procedure['name']}",
f"\n{procedure['description']}",
"\n### Steps:",
"\n".join(f"{i+1}. {step}" for i, step in enumerate(procedure['steps']))
]
if procedure['examples']:
sections.append("\n### Examples:")
for ex in procedure['examples'][:2]: # Limit to save tokens
sections.append(f"\nInput: {ex['input']}")
sections.append(f"Output: {ex['output']}")
return "\n".join(sections)
def record_execution(self, procedure_name: str, success: bool) -> None:
"""Track procedure execution for learning."""
if procedure_name in self.procedures:
proc = self.procedures[procedure_name]
proc["usage_count"] += 1
# Update success rate
if proc["success_rate"] is None:
proc["success_rate"] = 1.0 if success else 0.0
else:
# Exponential moving average
proc["success_rate"] = 0.9 * proc["success_rate"] + 0.1 * (1.0 if success else 0.0)
Integrating Memory Types: A Complete System
A production agent typically uses multiple memory types together. Here's how they interact:
┌─────────────────────────────────────────────────────────────────────────┐
│ INTEGRATED MEMORY SYSTEM │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ CONTEXT WINDOW │ │
│ │ (Short-Term Memory) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │
│ │ │ System │ │ Working │ │ Recent Messages │ │ │
│ │ │ Prompt │ │ Memory │ │ (Conversation) │ │ │
│ │ │ │ │ Scratchpad │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────┐│ │
│ │ │ Retrieved Context ││ │
│ │ │ (Loaded from long-term memory based on relevance) ││ │
│ │ │ ││ │
│ │ │ • Semantic facts about user ││ │
│ │ │ • Relevant past episodes ││ │
│ │ │ • Applicable procedures ││ │
│ │ └─────────────────────────────────────────────────────────────┘│ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ▲ │ │
│ │ Retrieve │ Store │
│ │ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ LONG-TERM STORAGE │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Semantic │ │ Episodic │ │ Procedural │ │ │
│ │ │ Memory │ │ Memory │ │ Memory │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ Facts about │ │ Past │ │ How-to │ │ │
│ │ │ user, world │ │ conversations│ │ knowledge │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
class IntegratedMemorySystem:
"""
A complete memory system integrating all memory types.
"""
def __init__(
self,
llm,
embedding_model,
vector_store,
max_context_tokens: int = 8000
):
self.llm = llm
self.max_context_tokens = max_context_tokens
# Initialize all memory types
self.short_term = ShortTermMemory(max_tokens=max_context_tokens)
self.working = WorkingMemory()
self.long_term = LongTermMemory(vector_store, embedding_model)
self.episodic = EpisodicMemory(vector_store, llm)
self.semantic = SemanticMemory(vector_store, embedding_model, llm)
self.procedural = ProceduralMemory(vector_store)
def prepare_context(self, user_message: str, user_id: str = None) -> list[dict]:
"""
Prepare the full context for an LLM call.
This is where memory integration happens:
1. Start with system prompt
2. Add working memory state
3. Retrieve relevant long-term memories
4. Add recent conversation
5. Add the new user message
All while respecting token limits.
"""
context = []
token_budget = self.max_context_tokens
# 1. System prompt (always included)
system_prompt = self._build_system_prompt()
context.append({"role": "system", "content": system_prompt})
token_budget -= self._count_tokens(system_prompt)
# 2. Working memory (high priority)
working_content = self.working.to_prompt()
if working_content:
context.append({"role": "system", "content": f"Current state:\n{working_content}"})
token_budget -= self._count_tokens(working_content)
# 3. Retrieve relevant long-term memories
retrieved = self._retrieve_relevant_memories(user_message, user_id)
retrieved_content = self._format_retrieved_memories(retrieved)
if retrieved_content and token_budget > 1000:
# Reserve space for retrieved context
retrieved_tokens = min(self._count_tokens(retrieved_content), token_budget // 3)
context.append({"role": "system", "content": f"Relevant context:\n{retrieved_content}"})
token_budget -= retrieved_tokens
# 4. Recent conversation history
recent_messages = self.short_term.get_context()
for msg in recent_messages:
msg_tokens = self._count_tokens(msg["content"])
if token_budget > msg_tokens + 500: # Reserve for new message
context.append(msg)
token_budget -= msg_tokens
# 5. New user message
context.append({"role": "user", "content": user_message})
return context
def _retrieve_relevant_memories(self, query: str, user_id: str = None) -> dict:
"""Retrieve from all long-term memory types."""
return {
"semantic": self.semantic.query_facts(semantic_query=query)[:3],
"episodic": self.episodic.recall_similar_episodes(query, k=2, user_id=user_id),
"procedural": self.procedural.get_relevant_procedures(query)[:2]
}
def _format_retrieved_memories(self, retrieved: dict) -> str:
"""Format retrieved memories for inclusion in context."""
sections = []
if retrieved["semantic"]:
facts = "\n".join(f"- {f['fact_text']}" for f in retrieved["semantic"])
sections.append(f"Known facts:\n{facts}")
if retrieved["episodic"]:
episodes = "\n".join(f"- {e['summary']}" for e in retrieved["episodic"])
sections.append(f"Relevant past interactions:\n{episodes}")
if retrieved["procedural"]:
procs = "\n".join(
self.procedural.format_procedure_for_prompt(p)
for p in retrieved["procedural"]
)
sections.append(f"Applicable procedures:\n{procs}")
return "\n\n".join(sections)
def process_response(
self,
user_message: str,
assistant_response: str,
user_id: str = None
) -> None:
"""
Process a completed interaction for memory storage.
This happens after the response is generated:
1. Add to short-term memory
2. Extract facts for semantic memory
3. Update working memory
"""
# Update short-term memory
self.short_term.add("user", user_message)
self.short_term.add("assistant", assistant_response)
# Extract semantic facts (background processing)
conversation = [
{"role": "user", "content": user_message},
{"role": "assistant", "content": assistant_response}
]
self.semantic.extract_facts_from_conversation(conversation)
def end_session(
self,
user_id: str = None,
task: str = None,
outcome: str = None
) -> None:
"""
End the current session and consolidate memories.
This is when episodic memories are formed.
"""
# Create episode from session
conversation = self.short_term.get_context()
if conversation:
self.episodic.create_episode(
conversation=conversation,
task=task,
outcome=outcome,
user_id=user_id
)
# Clear short-term and working memory
self.short_term = ShortTermMemory(max_tokens=self.max_context_tokens)
self.working = WorkingMemory()
def _count_tokens(self, text: str) -> int:
return len(text) // 4 # Simplified
def _build_system_prompt(self) -> str:
return "You are a helpful assistant with access to memory of past interactions."
Summary: Memory Types at a Glance
| Memory Type | Duration | Capacity | Access | LLM Implementation |
|---|---|---|---|---|
| Short-term | Seconds to minutes | Very limited (~128K tokens max) | Instant | Context window |
| Working | Current task | Limited (part of context) | Instant | Scratchpad in prompt |
| Long-term | Permanent | Unlimited | Requires retrieval | Vector DB / external storage |
| Episodic | Permanent | Unlimited | Search by similarity/time | Processed conversation logs |
| Semantic | Permanent | Unlimited | Query by entity/fact | Knowledge base / facts DB |
| Procedural | Permanent | Model-limited | Implicit or via examples | Training / few-shot |
Key principles:
- Short-term memory is the bottleneck—all other memory types exist to work around its limits
- Retrieval quality matters more than storage—storing everything is easy; finding the right thing is hard
- Memory types serve different purposes—don't try to use one type for everything
- Forgetting is a feature—without it, retrieval degrades over time
- Integration is complex—balancing multiple memory sources in limited context requires careful design
Memory Architecture Patterns
The Operating System Analogy
MemGPT introduced a powerful analogy: treat LLM memory like an operating system:
From research: "MemGPT (Memory-GPT) is a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow."
Traditional Computer → LLM Agent
─────────────────────────────────────────────
RAM (fast, limited) → Context Window
Disk (slow, unlimited) → External Storage
OS Memory Manager → MemGPT Agent
Hierarchical Memory
From MemGPT research: "MemGPT treats context windows as a constrained memory resource and implements a memory hierarchy similar to operating systems. Agents can move data between in-context core memory (analogous to RAM) and externally stored archival and recall memory (analogous to disk storage), creating an illusion of unlimited memory while working within fixed context limits."
Memory Tiers:
| Tier | Analogy | Characteristics |
|---|---|---|
| Core Memory | RAM | In-context, immediately accessible, size-limited |
| Recall Memory | Recent files | Searchable conversation history |
| Archival Memory | Disk storage | Long-term, vector-indexed, unlimited |
MemGPT Architecture
The Original Design
Why MemGPT represents a paradigm shift: Before MemGPT, LLM memory meant stuffing conversation history into the context window until it filled up, then either truncating or summarizing. MemGPT treats this as a systems problem: context windows are limited resources that need active management, just like RAM in an operating system. This reframing opens up new possibilities—rather than passively accepting context limits, the agent actively manages what information is "in memory" at any given moment.
The OS analogy is more than metaphor: Operating systems face the same fundamental challenge: programs need more memory than physically exists. The solution is virtual memory—create the illusion of unlimited memory by intelligently moving data between fast RAM and slow disk. MemGPT does the same for LLMs: the agent "sees" a small context window, but can access vast amounts of information by explicitly retrieving it from external storage. The key insight is that the LLM itself can manage these memory operations.
From research: "At its core, MemGPT features a hierarchical memory architecture closely mirroring that of a traditional OS: Primary Context (RAM) — The fixed-size prompt that the LLM can directly 'see' during inference. It consists of three partitions: static system prompt containing base instructions and function schemas; dynamic working context serving as a scratchpad for reasoning steps and intermediate results; and FIFO message buffer holding the most recent conversational turns. External Context (Disk Storage) — An effectively infinite, out-of-context layer inaccessible to the model without explicit retrieval. It includes Recall Storage (a searchable database containing the full historical record of interactions) and Archival Storage (a long-term, vector-based memory for large documents retrievable via semantic search)."
Self-Directed Memory Management
The key innovation: the LLM manages its own memory:
From research: "What makes MemGPT particularly innovative is its use of the LLM itself as the memory manager. Through self-directed memory editing via tool calling, the system can actively manage its own memory contents, deciding what to store, what to summarize, and what to forget."
Memory Tools:
# Core memory editing
memory_replace(section, old_content, new_content) # Update memory block
memory_insert(section, content) # Add to memory
memory_rethink(section, new_content) # Revise understanding
# Archival memory
archival_memory_insert(content) # Store for long-term
archival_memory_search(query, k=10) # Semantic retrieval
# Conversation search
conversation_search(query) # Search past messages
conversation_search_date(start_date, end_date) # Temporal search
Understanding each memory tool:
memory_replace and memory_rethink modify the agent's core beliefs about the user or situation. When a user says "Actually, my name is Alex, not Alexander," the agent calls memory_replace to update its stored name. memory_rethink is for deeper revisions—reconsidering an understanding based on new evidence, like updating a user profile when their preferences have clearly changed.
archival_memory_insert and archival_memory_search handle long-term storage. When the agent encounters information worth preserving indefinitely—a user's detailed project requirements, important facts from a long document—it inserts into archival memory. Later, when that information might be relevant, the agent searches archival memory semantically. This is essentially a vector database that the agent controls.
conversation_search and conversation_search_date let the agent recall past conversations. "What did we discuss about the budget last week?" triggers a search through conversation history, returning relevant messages that the agent can then use to inform its response.
The agent decides when to use these tools: This is the crucial difference from simpler memory systems where an external process manages memory. The LLM itself, during generation, decides "I need to remember this" or "I should look up what we discussed before." This self-direction enables more intelligent memory management but also introduces failure modes—the agent might forget to save important information or search for relevant context.
Multi-Step Reasoning with Heartbeats
From Letta: "MemGPT supports multi-step reasoning (allowing the agent to take multiple steps in sequence) via the concept of 'heartbeats'. Whenever the LLM outputs a tool call, it has the option to request a heartbeat by setting the keyword argument request_heartbeat to true. If the LLM requests a heartbeat, the LLM OS continues execution in a loop, allowing the LLM to 'think' again."
Letta: The MemGPT Framework
From Research to Production
From Letta: "As of September 2024, MemGPT is part of Letta. While MemGPT refers to the agent design pattern with two tiers of memory introduced in the research paper, Letta is an open-source agent framework that helps developers build persistent agents."
Letta V1 Architecture
From Letta: "At Letta, they're transitioning from the previous MemGPT-style architecture to a new Letta V1 architecture (letta_v1_agent) that follows modern patterns. In this architecture, heartbeats and the send_message tool are deprecated. Only native reasoning and direct assistant message generations from the models are supported."
Recommended for: GPT-5, Claude 4.5 Sonnet, and other advanced reasoning models.
Building with Letta
from letta import create_client, LLMConfig, EmbeddingConfig
# Create Letta client
client = create_client()
# Configure memory
agent = client.create_agent(
name="personal_assistant",
llm_config=LLMConfig(model="gpt-4o"),
embedding_config=EmbeddingConfig(model="text-embedding-3-small"),
system="You are a personal assistant with long-term memory.",
memory_blocks=[
{"label": "human", "value": "Name: Unknown\nPreferences: Unknown"},
{"label": "persona", "value": "I am a helpful assistant that remembers past conversations."}
]
)
# Interact with agent
response = client.send_message(
agent_id=agent.id,
message="My name is Alex and I prefer concise answers."
)
# Agent updates its memory automatically
# Next conversation will remember this preference
What happens under the hood: When the agent receives "My name is Alex and I prefer concise answers," it processes this through its system prompt which instructs it to update its memory blocks when learning new user information. The agent calls internal tools to modify the "human" memory block from "Name: Unknown" to "Name: Alex" and adds "Preferences: Concise answers." These updated blocks become part of the context for all future interactions with this agent.
Persistence is the key feature: The agent.id represents a persistent agent identity. All memory—conversation history, memory blocks, archival storage—is associated with this ID. When you call send_message again with the same agent_id, the agent has full access to everything it learned in previous conversations. This enables truly continuous relationships between users and AI agents.
Memory blocks vs. conversation history: Memory blocks are structured, edited information about the user ("Name: Alex, Preferences: Concise"). Conversation history is the raw record of messages exchanged. Both contribute to context, but they serve different purposes. Memory blocks capture distilled understanding; conversation history provides detailed evidence and context for that understanding.
Alternative Memory Approaches
LangMem (LangChain)
LangChain's approach to agent memory:
From LangMem: "Long-term memory allows agents to remember important information across conversations. LangMem provides ways to extract meaningful details from chats, store them, and use them to improve future interactions."
Hot Path vs Subconscious:
From research: "'Hot path' active memory formation happens during the conversation, enabling immediate updates when critical context emerges. This approach is easy to implement and lets the agent itself choose how to store and update its memory. However, it adds perceptible latency to user interactions."
From research: "'Subconscious' memory formation refers to prompting an LLM to reflect on a conversation after it occurs, finding patterns and extracting insights without slowing down the immediate interaction."
A-MEM (Agentic Memory)
Highly efficient memory system:
From research: "A-MEM achieves an 85-93% reduction in token usage compared to baseline methods (LoCoMo and MemGPT with 16,900 tokens) through selective top-k retrieval mechanism. This substantial token reduction directly translates to lower operational costs, with each memory operation costing less than $0.0003 when using commercial API services."
Zep
From research: "Zep: A Temporal Knowledge Graph Architecture for Agent Memory (February 2025)"—focuses on temporal relationships between memories.
Implementation Patterns
These patterns represent progressively more sophisticated approaches to LLM memory. Start with the simplest pattern that meets your needs—complexity adds engineering burden without always adding user value.
Pattern 1: Conversation Buffer with Summary
Simplest approach—summarize when context gets too long.
When to use this pattern: This is your starting point for any conversational AI. It handles the most common need—maintaining conversation context without exceeding token limits—with minimal complexity. Use this when conversations don't need to persist across sessions and when you don't need to remember specific facts long-term.
The compression strategy matters: The code compresses by keeping the 10 most recent messages and summarizing everything older. This preserves immediate context while maintaining awareness of earlier discussion. The summarization prompt should focus on preserving important facts, decisions made, and user preferences—not on maintaining narrative flow.
class ConversationMemory:
def __init__(self, max_tokens=4000, summarizer=None):
self.messages = []
self.summary = ""
self.max_tokens = max_tokens
self.summarizer = summarizer or default_summarizer
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
# Check if we need to summarize
if self.count_tokens() > self.max_tokens:
self._compress()
def _compress(self):
# Keep recent messages, summarize older ones
recent = self.messages[-10:]
old = self.messages[:-10]
if old:
new_summary = self.summarizer(self.summary, old)
self.summary = new_summary
self.messages = recent
def get_context(self):
context = []
if self.summary:
context.append({"role": "system", "content": f"Previous conversation summary: {self.summary}"})
context.extend(self.messages)
return context
Key implementation details:
The count_tokens() method (not shown) should use the tokenizer for your target model. Token counts vary between models—a 4000-token limit for GPT-4 is different from 4000 tokens for Claude. Use tiktoken for OpenAI models or the appropriate tokenizer for others.
The _compress() method is called when limits are exceeded, not proactively. This lazy compression means most interactions don't pay the summarization cost. The trade-off: if a conversation suddenly exceeds limits by a large margin, the summarization might be slow. For latency-sensitive applications, consider proactive compression when approaching limits.
The summary is prepended as a system message. This gives it "authoritative" status in the conversation—the LLM treats it as background knowledge rather than something it said itself. Alternative approaches include prepending as an assistant message or including in the system prompt.
Pattern 2: Entity Memory
Track entities mentioned across conversations.
When to use this pattern: When your agent needs to remember specific things about users, projects, or topics across multiple sessions. A customer service agent remembering past issues, a personal assistant remembering project details, or a tutor remembering student progress all benefit from entity memory.
class EntityMemory:
def __init__(self, db):
self.db = db # Vector database
def extract_and_store(self, conversation: list[dict]):
# Extract entities with LLM
entities = self.extract_entities(conversation)
for entity in entities:
# Check if entity exists
existing = self.db.search(entity.name, k=1)
if existing and existing[0].score > 0.9:
# Update existing entity
self.merge_entity(existing[0], entity)
else:
# Create new entity
self.db.insert(entity)
def get_relevant_entities(self, query: str, k=5):
return self.db.search(query, k=k)
Pattern 3: Semantic Memory with Forgetting
More sophisticated—mimics human memory:
class SemanticMemory:
def __init__(self, db, decay_rate=0.1):
self.db = db
self.decay_rate = decay_rate
def remember(self, content: str, importance: float = 0.5):
embedding = self.embed(content)
self.db.insert({
"content": content,
"embedding": embedding,
"importance": importance,
"last_accessed": datetime.now(),
"access_count": 1
})
def recall(self, query: str, k=10):
candidates = self.db.search(query, k=k*2)
# Score by relevance + recency + importance
scored = []
now = datetime.now()
for c in candidates:
age = (now - c.last_accessed).days
recency_score = math.exp(-self.decay_rate * age)
final_score = c.similarity * c.importance * recency_score
scored.append((c, final_score))
# Update access time
c.last_accessed = now
c.access_count += 1
scored.sort(key=lambda x: x[1], reverse=True)
return [c for c, _ in scored[:k]]
def forget(self, threshold=0.1):
"""Remove low-value memories"""
all_memories = self.db.get_all()
now = datetime.now()
for m in all_memories:
age = (now - m.last_accessed).days
value = m.importance * math.exp(-self.decay_rate * age)
if value < threshold:
self.db.delete(m.id)
Pattern 4: Episodic Memory
Store complete episodes/sessions:
class EpisodicMemory:
def __init__(self, db):
self.db = db
def end_episode(self, conversation: list[dict], metadata: dict):
# Summarize episode
summary = self.summarize(conversation)
# Extract key moments
key_moments = self.extract_key_moments(conversation)
# Store episode
episode = {
"summary": summary,
"key_moments": key_moments,
"full_conversation": conversation,
"metadata": metadata,
"embedding": self.embed(summary),
"timestamp": datetime.now()
}
self.db.insert(episode)
def recall_similar_episodes(self, query: str, k=3):
return self.db.search(query, k=k)
def recall_by_time(self, start: datetime, end: datetime):
return self.db.query({"timestamp": {"$gte": start, "$lte": end}})
Memory Formation Strategies
Active (Hot Path)
Form memories during conversation:
Pros:
- Immediate availability
- Agent-controlled
- Natural flow
Cons:
- Adds latency
- Uses tokens during interaction
# In agent loop
response = llm.generate(messages)
# Check for memory updates
if should_update_memory(response):
memory_update = llm.generate([
{"role": "system", "content": "Extract key facts to remember"},
{"role": "user", "content": str(messages[-5:])}
])
memory.store(memory_update)
Passive (Background)
Form memories after conversation ends:
Pros:
- No user-facing latency
- More reflection time
- Batch processing
Cons:
- Delayed availability
- Separate processing pipeline
# After conversation ends
async def process_conversation(conversation_id: str):
conversation = db.get_conversation(conversation_id)
# Extract memories in background
memories = await extract_memories(conversation)
# Store for future sessions
for memory in memories:
memory_store.insert(memory)
Production Considerations
Storage Options
| Option | Best For | Considerations |
|---|---|---|
| Vector DB (Pinecone, Qdrant) | Semantic search | Cost at scale |
| PostgreSQL + pgvector | Integrated solution | Self-hosted complexity |
| Redis | Session memory | Persistence config |
| SQLite | Local/edge | Limited concurrency |
Memory Retrieval Latency
Memory retrieval adds latency to every request:
# Measure and optimize
import time
async def get_relevant_context(query: str):
start = time.time()
# Parallel retrieval
results = await asyncio.gather(
entity_memory.search(query),
episodic_memory.search(query),
recent_messages.get()
)
latency = time.time() - start
logger.info(f"Memory retrieval: {latency:.3f}s")
return merge_results(results)
Privacy and Data Retention
- User consent for memory storage
- Data retention policies
- Right to be forgotten (memory deletion)
- Encryption at rest
Conclusion
Effective memory transforms LLMs from stateless responders to persistent agents:
- MemGPT/Letta pioneered hierarchical memory with self-management
- Multiple memory types (core, archival, episodic) serve different needs
- Hot path vs background processing trades latency for immediacy
- Efficient retrieval (A-MEM's 85-93% token reduction) enables scale
Start simple (conversation buffer + summary), add complexity (entity memory, episodic) as your use case demands.
Frequently Asked Questions
Related Articles
Building Agentic AI Systems: A Complete Implementation Guide
A comprehensive guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.
Mastering LLM Context Windows: Strategies for Long-Context Applications
Practical techniques for managing context windows in production LLM applications—from compression to hierarchical processing to infinite context architectures.
Building Deep Research AI: From Query to Comprehensive Report
How to build AI systems that conduct thorough, multi-source research and produce comprehensive reports rivaling human analysts.