Skip to main content
Back to Blog

LLM Memory Systems: From MemGPT to Long-Term Agent Memory

Understanding memory architectures for LLM agents—MemGPT's hierarchical memory, Letta's agent framework, and patterns for building agents that learn and remember across conversations.

30 min read
Share:

The Stateless Problem

LLMs are fundamentally stateless. Each conversation starts fresh—no memory of past interactions, no learning from experience. This "conversational amnesia" limits their usefulness for:

  • Long-running assistants
  • Personalized applications
  • Agents that improve over time
  • Multi-session workflows

From research: "The explosive growth of large language models (LLMs) has reshaped the AI landscape. Yet their core design is still fundamentally stateless. An LLM can only operate within a limited context window and loses more signal as that window grows, making it unable to reliably carry information forward across extended interactions. This limitation remains the key blocker to building truly persistent, collaborative, and personalized AI agents."


Understanding Memory Types: From Cognitive Science to LLMs

Before diving into technical implementations, it's essential to understand the different types of memory and how they map from human cognition to AI systems. This conceptual foundation will help you design more effective memory systems for your agents.

The Cognitive Science Foundation

Human memory isn't a single system—it's a collection of interconnected systems, each serving different purposes. Cognitive scientists have identified several distinct memory types, and remarkably, the same categorizations apply to effective LLM agent design.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MEMORY TYPES: HUMANS VS LLMs                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Cognitive Term      Human Brain              LLM Implementation         │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Short-term Memory   Immediate awareness      Context window             │
│                      ~7 items, seconds        ~128K tokens, one call     │
│                                                                          │
│  Working Memory      Active manipulation      Scratchpad / reasoning     │
│                      Problem-solving space    Chain-of-thought state     │
│                                                                          │
│  Long-term Memory    Permanent storage        Vector DB / external       │
│                      Lifetime capacity        Unlimited capacity         │
│                                                                          │
│  Episodic Memory     Personal experiences     Conversation history       │
│                      "What happened"          Session logs               │
│                                                                          │
│  Semantic Memory     Facts and knowledge      Knowledge base / RAG       │
│                      "What I know"            Retrieved documents        │
│                                                                          │
│  Procedural Memory   Skills and habits        Fine-tuned weights         │
│                      "How to do"              Learned behaviors          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Let's examine each memory type in detail.


Short-Term Memory: The Immediate Present

In humans: Short-term memory holds information you're currently aware of—the sentence you just read, the number someone told you, the name you're trying to remember. It's extremely limited (famously "7 ± 2 items") and decays within seconds unless actively maintained through rehearsal. When you're introduced to someone at a party and forget their name moments later, that's short-term memory failing. When you repeat a phone number to yourself while walking to find a pen, you're actively maintaining it in short-term memory.

In LLMs: The context window is the direct equivalent of short-term memory. It contains everything the model can "see" right now—the system prompt, conversation history, retrieved documents, and the current query. Once the context window is full, older information must be removed or summarized. When a conversation ends, this memory vanishes entirely—the model has no recollection that the conversation ever happened.

The fundamental constraint: Just as humans can't hold an entire book in short-term memory while reading, LLMs can't hold unlimited conversation history in their context window. This constraint shapes every memory system design. Every architectural decision in LLM memory systems ultimately traces back to this limitation.

Why Short-Term Memory Matters

Consider what happens in a typical conversation with an AI assistant:

  1. Turn 1: User asks about Python data structures
  2. Turn 5: User mentions they're building a web scraper
  3. Turn 12: User asks "Can you help me with that project?"

For the assistant to understand "that project" in turn 12, it needs to remember the web scraper mentioned in turn 5. This is short-term memory at work—maintaining conversational context across turns.

But what if the conversation has 100 turns? Or 1,000? At some point, early turns fall outside the context window. The model literally cannot see them anymore. This is the cliff edge of short-term memory—information doesn't gradually fade like human memory; it's either fully present or completely gone.

The Capacity Illusion

Modern LLMs advertise impressive context windows—128K tokens for GPT-4, 200K for Claude. This sounds like a lot, but consider:

  • A typical conversation turn is 100-500 tokens
  • A retrieved document chunk is 500-2,000 tokens
  • A code file is 1,000-10,000 tokens
  • System prompts with instructions can be 500-2,000 tokens

A 128K context window sounds enormous until you're building an agent that needs conversation history, retrieved documents, tool outputs, and a detailed system prompt. Suddenly you're making hard choices about what to keep and what to discard.

Characteristics of LLM short-term memory:

PropertyDescription
CapacityFixed by model (4K to 200K+ tokens)
DurationSingle API call / session
Access speedInstant (it's the input)
FidelityPerfect within window, zero outside
CostTokens × price per token

The Attention Problem: Lost in the Middle

Even within the context window, LLMs don't attend equally to all information. Research from Stanford and Berkeley ("Lost in the Middle: How Language Models Use Long Contexts") revealed a striking pattern: models perform best when relevant information appears at the very beginning or very end of the context. Information buried in the middle is often ignored or underweighted.

This "lost in the middle" phenomenon has profound implications:

  • Document order matters: If you retrieve 10 documents, the order you present them affects whether the model uses them correctly
  • Position-based strategies: Some systems deliberately place critical information at the start and end
  • Retrieval quantity trade-offs: More retrieved context isn't always better—adding marginally relevant documents can actually hurt performance by pushing important information into the "dead zone"

Think of it like a student cramming for an exam. They remember the first topics they studied (primacy effect) and the last topics (recency effect), but everything in the middle blurs together. LLMs exhibit similar behavior, despite their mechanical nature.

Managing Short-Term Memory

Short-term memory isn't something you "implement"—it's a constraint you work within. But you can manage it intelligently:

Eviction strategies determine what gets removed when the context fills up:

  • FIFO (First-In-First-Out): Remove the oldest messages. Simple but may lose important early context.
  • Importance-based: Score messages by relevance to current task, keep the important ones.
  • Summary-based: Compress old messages into a summary before evicting.
  • Sliding window with anchors: Keep a sliding window of recent messages plus "anchored" important ones.

Compression strategies reduce token usage without losing information:

  • Summarization: Replace verbose exchanges with concise summaries.
  • Entity extraction: Pull out key facts, discard the conversation that revealed them.
  • Selective retention: Keep user messages, summarize assistant messages (or vice versa).
Python
class ShortTermMemory:
    """
    Short-term memory in LLMs is simply the context window.
    This class manages what goes into that window.
    """

    def __init__(self, max_tokens: int = 8000):
        self.max_tokens = max_tokens
        self.messages: list[dict] = []

    def add(self, role: str, content: str) -> None:
        """Add a message to short-term memory."""
        self.messages.append({"role": role, "content": content})

        # Short-term memory has hard limits
        # When exceeded, oldest memories must go
        while self._count_tokens() > self.max_tokens:
            self._evict_oldest()

    def _evict_oldest(self) -> None:
        """
        Remove oldest non-system message.
        This is the simplest eviction strategy.
        More sophisticated approaches might:
        - Summarize before evicting
        - Move to long-term memory
        - Prioritize by importance
        """
        for i, msg in enumerate(self.messages):
            if msg["role"] != "system":
                self.messages.pop(i)
                break

    def get_context(self) -> list[dict]:
        """Return current short-term memory contents."""
        return self.messages.copy()

    def _count_tokens(self) -> int:
        # Simplified token counting
        return sum(len(m["content"]) // 4 for m in self.messages)

Key insight: Short-term memory is not something you "implement"—it's a constraint you work within. Every other memory type exists to compensate for short-term memory's limitations.


Working Memory: The Mental Workspace

In humans: Working memory is not just storage—it's an active workspace where you manipulate information. When you do mental arithmetic (what's 47 × 8?), you're using working memory to hold intermediate results. When you plan a route, you're holding the current location, destination, and candidate paths in working memory simultaneously. It's where thinking happens.

The classic demonstration of working memory is the "N-back" task: you're shown a sequence of items and must identify when the current item matches the one from N steps ago. This requires simultaneously storing recent items, comparing them, and updating your mental register—all hallmarks of working memory in action.

In LLMs: Working memory manifests in two primary forms:

  1. Chain-of-thought reasoning: When an LLM "thinks step by step," it's using its output as a scratchpad, writing down intermediate results that inform subsequent reasoning. Each "Let me think about this..." or "First, I'll consider..." is the model externalizing its working memory.

  2. Explicit scratchpad: A dedicated section of the prompt where the agent can write notes, track state, and maintain awareness of its current goals. This is working memory made visible and persistent.

The Difference Between Working and Short-Term Memory

These terms are often confused, but the distinction matters:

  • Short-term memory is about storage—holding information temporarily
  • Working memory is about processing—actively manipulating information

An analogy: Short-term memory is your desk—a limited surface where you can place things. Working memory is what you're doing at that desk—the calculations, comparisons, and reasoning you perform with the items there.

For LLMs, the context window is short-term memory (the desk). Working memory is the subset of that context dedicated to active reasoning—the scratchpad, the current goal tracking, the hypotheses being evaluated.

Why Working Memory Matters for Agents

Consider an agent tasked with debugging a complex production issue:

  1. It needs to hold the error message in mind
  2. While also remembering the hypothesis it's currently testing
  3. And tracking which files it has already examined
  4. And maintaining awareness of what it tried that didn't work
  5. And keeping the user's original request in focus

Without explicit working memory, agents exhibit a frustrating pattern: they lose track of what they're doing mid-task. They might re-examine a file they already looked at. They might forget a hypothesis they were testing. They might lose sight of the original goal while chasing a tangent.

This is the "goldfish problem" in agent systems—without working memory, every step feels like starting fresh. The agent has the information in its context (short-term memory) but isn't actively using it to guide behavior (working memory).

Components of LLM Working Memory

A well-designed working memory system tracks several types of information:

Goal State:

  • Current high-level objective ("Debug the authentication failure")
  • Active sub-goals ("Check the token validation logic")
  • Completed steps ("Verified database connection is working")
  • Blocked paths ("Ruled out network issues")

Observations:

  • Recent tool outputs (file contents, API responses, search results)
  • Environmental state (current directory, open files, running processes)
  • User feedback (corrections, clarifications, preferences)

Reasoning State:

  • Active hypotheses ("Token might be expiring prematurely")
  • Evidence for/against each hypothesis
  • Confidence levels
  • Decision criteria

Meta-cognitive State:

  • How many steps have been taken
  • Time/budget remaining
  • When to escalate or ask for help
  • What strategies have been tried

The Token Budget Problem

Working memory consumes context window tokens. Every goal you track, every observation you record, every hypothesis you maintain—it all costs tokens. This creates a fundamental tension:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                     CONTEXT WINDOW BUDGET                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Fixed costs:                                                            │
│  ├── System prompt:           ~1,000 tokens                             │
│  ├── Tool definitions:        ~2,000 tokens                             │
│  └── User's current message:  ~200 tokens                               │
│                                                                          │
│  Variable costs (competing for remaining space):                        │
│  ├── Working memory:          500-3,000 tokens                          │
│  ├── Retrieved context:       1,000-5,000 tokens                        │
│  └── Conversation history:    1,000-10,000 tokens                       │
│                                                                          │
│  Total budget: 8,000-128,000 tokens (depending on model)                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

More working memory means better reasoning but less room for conversation history and retrieved documents. This is a design decision with no universal right answer—it depends on whether your agent needs to reason deeply (expand working memory) or remember extensively (preserve history).

Working Memory Decay and Refresh

Human working memory decays rapidly—information fades within seconds without active maintenance (that's why you repeat the phone number to yourself). LLM working memory doesn't decay within a session, but it faces a different challenge: staleness.

An agent might start with the goal "Fix the login bug" but after 20 steps of investigation, that goal might no longer be accurate. Maybe the "login bug" is actually a database configuration issue. The working memory goal is now misleading.

Effective working memory systems include mechanisms for:

  • Periodic review: Revisit goals and hypotheses regularly
  • Contradiction detection: Flag when observations conflict with hypotheses
  • Goal refinement: Update objectives as understanding deepens
  • State compression: Summarize completed work to free up space

The relationship between working and short-term memory: Working memory uses short-term memory as its substrate. The scratchpad contents consume context window tokens. This creates a tension: more working memory for complex reasoning means less space for conversation history or retrieved documents.

Working memory management strategies:

StrategyWhen to UseTrade-off
Full stateComplex multi-step tasksUses many tokens
Compressed stateLong-running tasksMay lose detail
Key-value notesFlexible explorationLess structured
Goal stack onlySimple sequential tasksLimited reasoning
Python
class WorkingMemory:
    """
    Working memory is the agent's scratchpad for active reasoning.
    It tracks current goals, observations, and intermediate thoughts.
    """

    def __init__(self):
        # Current task state
        self.current_goal: str = ""
        self.sub_goals: list[str] = []
        self.completed_steps: list[str] = []

        # Observations from the environment
        self.observations: list[dict] = []

        # Scratchpad for reasoning
        self.notes: dict[str, str] = {}

        # Hypotheses being considered
        self.hypotheses: list[dict] = []

    def set_goal(self, goal: str) -> None:
        """Set the current high-level goal."""
        self.current_goal = goal
        self.sub_goals = []
        self.completed_steps = []

    def decompose_goal(self, sub_goals: list[str]) -> None:
        """Break down the goal into sub-goals."""
        self.sub_goals = sub_goals

    def complete_step(self, step: str, result: str) -> None:
        """Mark a step as completed with its result."""
        self.completed_steps.append(f"{step}: {result}")
        if step in self.sub_goals:
            self.sub_goals.remove(step)

    def add_observation(self, source: str, content: str) -> None:
        """Record an observation from a tool or environment."""
        self.observations.append({
            "source": source,
            "content": content[:500],  # Truncate to manage space
            "timestamp": datetime.now().isoformat()
        })
        # Keep only recent observations in working memory
        if len(self.observations) > 10:
            self.observations.pop(0)

    def note(self, key: str, value: str) -> None:
        """Store a note for later reference."""
        self.notes[key] = value

    def add_hypothesis(self, hypothesis: str, confidence: float) -> None:
        """Track a hypothesis being considered."""
        self.hypotheses.append({
            "hypothesis": hypothesis,
            "confidence": confidence,
            "evidence_for": [],
            "evidence_against": []
        })

    def to_prompt(self) -> str:
        """
        Serialize working memory for inclusion in prompt.
        This is how working memory enters short-term memory.
        """
        sections = []

        if self.current_goal:
            sections.append(f"## Current Goal\n{self.current_goal}")

        if self.sub_goals:
            goals_str = "\n".join(f"- [ ] {g}" for g in self.sub_goals)
            completed_str = "\n".join(f"- [x] {s}" for s in self.completed_steps[-3:])
            sections.append(f"## Progress\n{completed_str}\n{goals_str}")

        if self.observations:
            recent = self.observations[-3:]
            obs_str = "\n".join(f"- [{o['source']}]: {o['content'][:200]}" for o in recent)
            sections.append(f"## Recent Observations\n{obs_str}")

        if self.notes:
            notes_str = "\n".join(f"- {k}: {v}" for k, v in self.notes.items())
            sections.append(f"## Notes\n{notes_str}")

        if self.hypotheses:
            hyp_str = "\n".join(
                f"- {h['hypothesis']} (confidence: {h['confidence']:.0%})"
                for h in self.hypotheses
            )
            sections.append(f"## Working Hypotheses\n{hyp_str}")

        return "\n\n".join(sections)

Working memory management strategies:

StrategyWhen to UseTrade-off
Full stateComplex multi-step tasksUses many tokens
Compressed stateLong-running tasksMay lose detail
Key-value notesFlexible explorationLess structured
Goal stack onlySimple sequential tasksLimited reasoning

Long-Term Memory: Persistent Knowledge

In humans: Long-term memory stores information for extended periods—from hours to a lifetime. It has effectively unlimited capacity but requires effort to encode (form memories) and retrieve (recall them). Memories can be strengthened through repetition and emotional significance, or weakened through disuse.

You don't remember everything you've ever experienced—long-term memory is selective. Emotionally significant events are encoded more strongly. Information you've retrieved multiple times becomes easier to access. And memories you never revisit gradually become harder to recall, though they may still exist somewhere in the neural network.

In LLMs: Long-term memory requires external storage—databases, vector stores, file systems. The LLM itself has no persistent state between API calls. Everything it "remembers" long-term must be explicitly stored outside the model and retrieved when relevant.

This is a profound difference from human memory. Humans form long-term memories automatically—sleep consolidates important experiences, emotional events are encoded strongly, repetition strengthens recall. LLMs form no memories at all without explicit intervention. Every long-term memory must be deliberately created, stored, indexed, and retrieved by external systems.

Why Long-Term Memory Transforms Agents

Consider the difference between these two agent experiences:

Without long-term memory:

  • Every conversation starts fresh
  • "Who am I talking to? What have we discussed before? What are their preferences?"
  • The agent is perpetually a stranger, even to a user it has "talked to" hundreds of times

With long-term memory:

  • "Ah, this is Alex. They work at a healthcare startup, prefer detailed explanations, and last time we discussed migrating their database to PostgreSQL."
  • The agent has continuity—a persistent relationship with the user

This is the difference between a vending machine and a trusted colleague. Both can provide assistance, but only one can say "Remember when we tried that last quarter? It didn't work because..."

The Four Challenges of Long-Term Memory

The hardest problems in LLM long-term memory aren't storage (databases are a solved problem). The challenges are:

1. What to remember (Encoding Selection)

Not everything is worth storing. An agent that stores every utterance will have retrieval problems later—the important information gets buried under trivial chatter.

Consider a 30-minute conversation. What's worth remembering?

  • "The user's name is Alex" → Worth storing
  • "The user said 'hmm, let me think'" → Not worth storing
  • "The user prefers concise explanations" → Worth storing
  • "The user asked about the weather while waiting" → Probably not worth storing
  • "The user's project deadline is March 15" → Definitely worth storing

Deciding what to remember requires understanding significance—which is itself a complex judgment. Some systems use LLMs to assess importance. Others use heuristics (store all user preferences, all decisions, all facts). The optimal approach depends on your use case.

2. How to encode (Representation)

The representation matters enormously. The same information can be stored in many ways:

  • Raw text: "User said they work at Google on the Search team"
  • Structured fact: {subject: "user", predicate: "works_at", object: "Google Search team"}
  • Summary: "User is a Google Search engineer"
  • Embedding only: [0.12, -0.34, 0.56, ...] (no human-readable form)

Each representation has trade-offs:

  • Raw text preserves nuance but is verbose
  • Structured facts enable precise queries but lose context
  • Summaries balance brevity and meaning but may lose detail
  • Embeddings enable semantic search but are opaque

3. When to retrieve (Relevance Detection)

How does the agent know when past information is relevant to the current query?

If the user asks "What's 2+2?", should the agent search its memory? Probably not—this is a simple factual question. But if the user asks "Should we proceed with the migration?", the agent should recall previous discussions about the migration, the user's concerns, and any relevant decisions.

This is the retrieval trigger problem. Options include:

  • Always retrieve: Every query triggers memory search (simple but wasteful)
  • Keyword matching: Retrieve when query contains certain terms (fast but brittle)
  • LLM-driven: Ask the LLM "Is memory relevant here?" (accurate but adds latency)
  • Classifier: Train a model to predict retrieval necessity (balanced approach)

4. How to integrate (Context Assembly)

Retrieved memories must be incorporated into the context without overwhelming it. If you retrieve 20 relevant memories, you can't just dump them all into the prompt—you'll crowd out space for the actual conversation.

Integration strategies include:

  • Selective inclusion: Only include the top-K most relevant memories
  • Summarization: Combine multiple related memories into a digest
  • Hierarchical: Include summaries with pointers to details
  • Dynamic allocation: Adjust memory budget based on query complexity

Long-Term Memory is Not "Big Short-Term Memory"

A common misconception is that long-term memory is simply short-term memory with more capacity. This is wrong in important ways:

PropertyShort-Term (Context)Long-Term (External)
PresenceAlways in contextMust be retrieved
AccessInstant, guaranteedSearch-based, probabilistic
FidelityPerfect within windowMay return wrong or partial info
CostTokens per API callStorage + retrieval per query
Failure modeCliff edge (falls out)Gradual degradation

The probabilistic nature of long-term memory is crucial to understand. When you search a vector database, you get approximate matches ranked by similarity. The most relevant memory might be ranked #3. The #1 result might be tangentially related but not quite right. Sometimes relevant memories aren't retrieved at all.

This means systems using long-term memory must be designed for imperfect retrieval. They should gracefully handle missing context, ask clarifying questions when uncertain, and avoid overconfident claims based on potentially incomplete memory.

The Forgetting Problem

Human memory forgets, and this is a feature, not a bug. Forgetting prevents cognitive overload, clears out outdated information, and keeps retrieval fast by reducing the search space.

LLM long-term memory needs forgetting too, but it doesn't happen automatically. Without deliberate forgetting:

  • Storage costs grow unboundedly
  • Retrieval quality degrades (more noise, harder to find signal)
  • Outdated information pollutes results
  • Contradictory facts accumulate (old and new versions both exist)

Forgetting strategies include:

  • Time-based decay: Delete memories not accessed in N days
  • Importance thresholds: Remove memories below importance cutoff
  • Contradiction resolution: When facts conflict, keep the newer one
  • Capacity limits: When storage exceeds threshold, prune lowest-value memories
  • User-initiated: Allow users to say "Forget that" or "That's outdated"
Python
class LongTermMemory:
    """
    Long-term memory persists across sessions and has unlimited capacity.
    It requires explicit storage and retrieval operations.
    """

    def __init__(self, vector_store, embedding_model):
        self.vector_store = vector_store
        self.embedding_model = embedding_model

        # Track memory statistics
        self.total_memories = 0
        self.retrieval_stats = {"hits": 0, "misses": 0}

    def store(
        self,
        content: str,
        memory_type: str,
        importance: float = 0.5,
        metadata: dict = None
    ) -> str:
        """
        Store a memory for long-term retention.

        The encoding process:
        1. Generate embedding for semantic search
        2. Extract structured metadata for filtering
        3. Assign importance score for retrieval ranking
        4. Store with timestamp for temporal queries
        """
        # Generate embedding
        embedding = self.embedding_model.embed(content)

        # Create memory record
        memory_id = str(uuid.uuid4())
        memory = {
            "id": memory_id,
            "content": content,
            "embedding": embedding,
            "memory_type": memory_type,  # fact, episode, preference, etc.
            "importance": importance,
            "created_at": datetime.now().isoformat(),
            "last_accessed": datetime.now().isoformat(),
            "access_count": 0,
            "metadata": metadata or {}
        }

        self.vector_store.insert(memory)
        self.total_memories += 1

        return memory_id

    def retrieve(
        self,
        query: str,
        k: int = 5,
        memory_types: list[str] = None,
        min_importance: float = 0.0,
        recency_weight: float = 0.1
    ) -> list[dict]:
        """
        Retrieve relevant memories for a query.

        Retrieval is the critical operation:
        - Too few results: Agent lacks necessary context
        - Too many results: Overwhelms short-term memory
        - Wrong results: Agent uses irrelevant information

        We combine multiple signals:
        1. Semantic similarity (embedding distance)
        2. Importance score (pre-assigned weight)
        3. Recency (recently accessed = more relevant)
        4. Access frequency (frequently used = important)
        """
        # Embed the query
        query_embedding = self.embedding_model.embed(query)

        # Search vector store
        candidates = self.vector_store.search(
            query_embedding,
            k=k * 3,  # Over-retrieve then re-rank
            filter={"memory_type": {"$in": memory_types}} if memory_types else None
        )

        if not candidates:
            self.retrieval_stats["misses"] += 1
            return []

        # Re-rank with multiple signals
        now = datetime.now()
        scored_memories = []

        for memory in candidates:
            # Base score from semantic similarity
            score = memory["similarity"]

            # Importance weighting
            score *= (0.5 + 0.5 * memory["importance"])

            # Recency boost
            last_accessed = datetime.fromisoformat(memory["last_accessed"])
            days_ago = (now - last_accessed).days
            recency_score = math.exp(-recency_weight * days_ago)
            score *= (0.7 + 0.3 * recency_score)

            # Filter by minimum importance
            if memory["importance"] >= min_importance:
                scored_memories.append((memory, score))

        # Sort by final score
        scored_memories.sort(key=lambda x: x[1], reverse=True)

        # Update access statistics for retrieved memories
        results = []
        for memory, score in scored_memories[:k]:
            self._update_access(memory["id"])
            results.append(memory)

        self.retrieval_stats["hits"] += 1
        return results

    def _update_access(self, memory_id: str) -> None:
        """Update access time and count for a retrieved memory."""
        self.vector_store.update(memory_id, {
            "last_accessed": datetime.now().isoformat(),
            "access_count": {"$inc": 1}
        })

    def forget(self, threshold_days: int = 90, min_importance: float = 0.3) -> int:
        """
        Remove old, low-importance, unused memories.

        Forgetting is essential for long-term memory systems:
        - Prevents unbounded growth
        - Improves retrieval quality (less noise)
        - Reduces storage costs

        We forget memories that are:
        - Old (not accessed recently)
        - Low importance (not marked as significant)
        - Rarely accessed (not frequently useful)
        """
        cutoff = datetime.now() - timedelta(days=threshold_days)

        # Find candidates for forgetting
        candidates = self.vector_store.query({
            "last_accessed": {"$lt": cutoff.isoformat()},
            "importance": {"$lt": min_importance},
            "access_count": {"$lt": 3}
        })

        forgotten = 0
        for memory in candidates:
            self.vector_store.delete(memory["id"])
            forgotten += 1

        self.total_memories -= forgotten
        return forgotten

Long-term memory design decisions:

DecisionOptionsConsiderations
Storage formatRaw text, summaries, structured factsTrade-off between fidelity and efficiency
Embedding modelOpenAI, Cohere, local modelsCost, latency, quality
Retrieval methodPure vector, hybrid (vector + keyword), filteredAccuracy vs. complexity
Importance scoringManual, LLM-assessed, usage-basedEffort vs. accuracy
Forgetting policyTime-based, importance-based, capacity-basedMemory hygiene

Episodic Memory: Remembering Experiences

In humans: Episodic memory stores autobiographical experiences—not just facts, but the experience of events. You remember not just that you visited Paris, but the feeling of seeing the Eiffel Tower, who you were with, what you ate. These memories are inherently temporal and contextual—they're stories with a beginning, middle, and end.

The distinction between episodic and semantic memory was first articulated by psychologist Endel Tulving in 1972. Semantic memory stores facts ("Paris is the capital of France"). Episodic memory stores experiences ("I remember visiting Paris in 2019..."). Both are long-term, but they serve different purposes and are even processed by different brain regions.

In LLMs: Episodic memory stores conversation sessions, task executions, and interactions as coherent episodes. Unlike semantic memory (which stores isolated facts), episodic memory preserves the narrative structure—what happened, in what order, what was the context, and what was the outcome.

Why Narrative Structure Matters

Consider these two ways of storing the same information:

Semantic (facts only):

  • User prefers PostgreSQL over MySQL
  • User's project is a healthcare startup
  • User had a database migration issue
  • Migration was successful after fixing charset

Episodic (narrative):

  • "In our conversation on March 15, the user was struggling with a database migration. They explained that their healthcare startup needed to move from MySQL to PostgreSQL for HIPAA compliance. We tried several approaches—first the direct migration failed due to charset issues, then we discovered the problem was UTF-8 vs UTF-16 handling in patient names with accents. After implementing proper charset conversion, the migration succeeded. The user was relieved and mentioned they'd been stuck on this for days."

The episodic version captures:

  • Temporal context: When it happened
  • Causal structure: What led to what
  • Emotional significance: The user's frustration and relief
  • Problem-solving journey: What worked, what didn't, why
  • Lessons learned: The charset issue for future reference

An agent with only semantic memory knows the user prefers PostgreSQL. An agent with episodic memory can say "I remember helping you with that charset issue during the migration—it was tricky because of the accented patient names."

Why Episodic Memory Matters for Agents

Episodic memory enables several capabilities that semantic memory alone cannot provide:

1. Experiential Learning

Without episodic memory, an agent is condemned to repeat mistakes. With it, the agent can recall "Last time we tried approach X, it failed because of Y—let's try something different."

This is especially valuable for coding agents, debugging assistants, and any agent that needs to learn from trial and error. The ability to remember not just outcomes but the full trajectory (what was tried, in what order, why it failed) enables genuine learning from experience.

2. Relationship Continuity

Humans build relationships through shared experiences. "Remember when we..." is the foundation of ongoing relationships. Episodic memory allows agents to reference shared history, building rapport and trust.

"You mentioned last month that you were nervous about the demo" is fundamentally different from "User has experienced nervousness." The first acknowledges shared history; the second is a clinical observation.

3. Contextualized Recommendations

With episodic memory, an agent can say "Based on how the last project went, you might want to budget more time for testing." This draws on the full narrative of past experience, not just extracted facts.

4. Personalized Problem Solving

Knowing how a user approached problems in the past helps predict how they'll want to approach them now. Did they prefer to understand the theory first, or dive into implementation? Did they want multiple options, or a single recommendation? Episodic memory captures these preferences as they manifest in actual behavior, not just stated preferences.

Episodic vs. Conversation History

A critical distinction: raw conversation history is not episodic memory. Conversation history is logs—every message stored chronologically. Episodic memory involves processing those logs to extract meaningful episodes.

AspectConversation HistoryEpisodic Memory
ContentRaw messagesProcessed narratives
StructureChronological logMeaningful episodes
SizeGrows without boundSummarized, bounded
SearchableBy time, keywordBy similarity, theme
ValueFull detail, high noiseDistilled meaning, low noise

The transformation from conversation history to episodic memory typically involves:

  1. Episode boundary detection: Identifying where one coherent interaction ends and another begins
  2. Summarization: Distilling the key narrative from verbose conversation
  3. Key moment extraction: Identifying pivotal points (decisions, discoveries, problems)
  4. Outcome labeling: Did the interaction succeed? Partially? What was accomplished?
  5. Lesson extraction: What should be remembered for future similar situations?

The Episode Structure

A well-formed episode typically contains:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                         EPISODE STRUCTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  METADATA                                                                │
│  ├── Timestamp: When did this happen?                                   │
│  ├── Duration: How long was the interaction?                            │
│  ├── Participants: Who was involved?                                    │
│  └── Context: What was the setting/situation?                           │
│                                                                          │
│  NARRATIVE                                                               │
│  ├── Summary: 2-3 sentence overview                                     │
│  ├── Goal: What was the user trying to accomplish?                      │
│  ├── Journey: Key steps and turns in the interaction                    │
│  └── Outcome: What was achieved?                                        │
│                                                                          │
│  KEY MOMENTS                                                             │
│  ├── Decisions: Choices made and why                                    │
│  ├── Discoveries: New information learned                               │
│  ├── Problems: Issues encountered                                       │
│  └── Solutions: How problems were resolved                              │
│                                                                          │
│  LESSONS                                                                 │
│  ├── What worked well?                                                  │
│  ├── What didn't work?                                                  │
│  └── What should be done differently next time?                         │
│                                                                          │
│  RETRIEVAL AIDS                                                          │
│  ├── Embedding: For semantic search                                     │
│  ├── Tags: For categorical filtering                                    │
│  └── Importance score: For prioritizing in retrieval                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Episodic Memory Retrieval Patterns

How do you find the right episodes when they're needed?

Similarity-based retrieval: "This situation reminds me of..." Find episodes semantically similar to the current context. Useful when the current problem might benefit from past experience with similar problems.

User-specific retrieval: "With this particular user, I remember..." Find all episodes involving a specific user, building a history of interactions. Essential for personalization and relationship continuity.

Temporal retrieval: "Last week we discussed..." Find episodes from a specific time period. Useful when users reference past conversations by time ("What did we talk about on Monday?").

Outcome-based retrieval: "When we succeeded/failed at this before..." Find episodes with specific outcomes. Useful for learning from past successes and avoiding past failures.

Thematic retrieval: "Episodes about database migrations..." Find episodes involving specific topics or task types. Useful for domain-specific experience recall.

Python
class EpisodicMemory:
    """
    Episodic memory stores experiences as coherent narratives.
    Each episode captures what happened, when, with what outcome.
    """

    def __init__(self, storage, llm):
        self.storage = storage
        self.llm = llm  # For summarization and extraction

    def create_episode(
        self,
        conversation: list[dict],
        task: str = None,
        outcome: str = None,
        user_id: str = None
    ) -> dict:
        """
        Transform a conversation into an episode.

        This is the encoding process for episodic memory:
        1. Summarize the overall interaction
        2. Extract key moments (decisions, discoveries, problems)
        3. Identify the outcome (success, failure, partial)
        4. Note lessons learned
        """
        # Generate episode summary
        summary = self._summarize_conversation(conversation)

        # Extract key moments
        key_moments = self._extract_key_moments(conversation)

        # Identify emotional/importance peaks
        importance_score = self._assess_importance(conversation, outcome)

        # Extract lessons learned
        lessons = self._extract_lessons(conversation, outcome)

        episode = {
            "id": str(uuid.uuid4()),
            "timestamp": datetime.now().isoformat(),
            "duration_minutes": self._estimate_duration(conversation),
            "user_id": user_id,

            # The narrative structure
            "task": task,
            "summary": summary,
            "key_moments": key_moments,
            "outcome": outcome,
            "lessons": lessons,

            # For retrieval
            "embedding": self._embed_episode(summary, key_moments),
            "importance": importance_score,

            # Raw data (optional, for detailed recall)
            "message_count": len(conversation),
            "full_conversation": conversation  # Or store separately
        }

        self.storage.insert(episode)
        return episode

    def _summarize_conversation(self, conversation: list[dict]) -> str:
        """Generate a narrative summary of the conversation."""
        messages_text = "\n".join(
            f"{m['role']}: {m['content'][:200]}"
            for m in conversation
        )

        prompt = f"""Summarize this conversation as a brief narrative.
Focus on: what was discussed, what was accomplished, any problems encountered.
Write in past tense, 2-3 sentences.

Conversation:
{messages_text}

Summary:"""

        return self.llm.generate(prompt)

    def _extract_key_moments(self, conversation: list[dict]) -> list[dict]:
        """
        Identify the pivotal moments in the conversation.

        Key moments are:
        - Decisions made
        - Problems discovered
        - Solutions found
        - User preferences revealed
        - Misunderstandings corrected
        """
        messages_text = "\n".join(
            f"[{i}] {m['role']}: {m['content'][:300]}"
            for i, m in enumerate(conversation)
        )

        prompt = f"""Identify 2-4 key moments in this conversation.
For each moment, note:
- What happened
- Why it was significant
- The message index

Format as JSON array.

Conversation:
{messages_text}

Key moments:"""

        response = self.llm.generate(prompt)
        try:
            return json.loads(response)
        except:
            return []

    def _extract_lessons(self, conversation: list[dict], outcome: str) -> list[str]:
        """
        Extract lessons learned from this episode.

        Lessons help the agent improve over time:
        - "User prefers detailed explanations"
        - "This approach led to errors, try alternative"
        - "Always confirm before making changes"
        """
        prompt = f"""Based on this conversation and its outcome, what lessons should be remembered?

Outcome: {outcome}

List 1-3 actionable lessons:"""

        response = self.llm.generate(prompt)
        return [l.strip() for l in response.split("\n") if l.strip()]

    def recall_similar_episodes(
        self,
        query: str,
        k: int = 3,
        user_id: str = None
    ) -> list[dict]:
        """
        Find past episodes similar to the current situation.

        This enables experiential reasoning:
        "Last time we encountered a similar problem..."
        "Based on past interactions with this user..."
        """
        query_embedding = self._embed_query(query)

        filters = {}
        if user_id:
            filters["user_id"] = user_id

        episodes = self.storage.search(
            query_embedding,
            k=k,
            filter=filters
        )

        return episodes

    def recall_by_timeframe(
        self,
        start: datetime,
        end: datetime,
        user_id: str = None
    ) -> list[dict]:
        """Recall episodes from a specific time period."""
        query = {"timestamp": {"$gte": start.isoformat(), "$lte": end.isoformat()}}
        if user_id:
            query["user_id"] = user_id
        return self.storage.query(query)

    def get_user_history(self, user_id: str, limit: int = 10) -> list[dict]:
        """Get recent episodes for a specific user."""
        return self.storage.query(
            {"user_id": user_id},
            sort={"timestamp": -1},
            limit=limit
        )

Episodic memory enables powerful capabilities:

CapabilityWithout Episodic MemoryWith Episodic Memory
Continuity"I don't recall our previous conversations""Last time we discussed your project timeline..."
LearningRepeats same mistakes"That approach didn't work before, let's try..."
PersonalizationGeneric responses"You mentioned preferring detailed explanations"
AccountabilityNo history of actions"Here's what I did and why"

Semantic Memory: Facts and Knowledge

In humans: Semantic memory stores general knowledge about the world—facts, concepts, meanings. Unlike episodic memory, semantic memories are divorced from personal experience. You know that Paris is the capital of France without remembering when you learned it. You know what a "database" is without recalling the first time you encountered one.

Semantic memory is organized conceptually, not temporally. Facts are connected by meaning and relationship, forming a web of knowledge. "Paris → capital of → France → country in → Europe → continent" represents how semantic memory links concepts together.

In LLMs: The model's training data provides baseline semantic memory—it "knows" facts from training. But this knowledge is frozen at the training cutoff and can't be updated. External semantic memory (knowledge bases, RAG systems) supplements this with current, domain-specific, or private knowledge.

The Dual Nature of LLM Semantic Memory

LLMs have two sources of semantic knowledge:

1. Parametric Knowledge (in the weights)

Everything the model learned during training is encoded in its parameters. This includes:

  • General world knowledge ("Paris is the capital of France")
  • Language understanding ("syntax," "grammar")
  • Domain knowledge ("how databases work")
  • Common patterns ("typical React component structure")

This knowledge is:

  • Always available (no retrieval needed)
  • Fast to access (just run inference)
  • Frozen at training time (cannot be updated)
  • Potentially outdated or incorrect (hallucinations)
  • Generic (not personalized to any user)

2. External Knowledge (retrieved at runtime)

Knowledge stored in external databases and retrieved when relevant:

  • User-specific facts ("Alex works at Google")
  • Current information ("Bitcoin price today")
  • Private data ("Company internal documentation")
  • Specialized knowledge ("Domain-specific terminology")

This knowledge is:

  • Must be retrieved (adds latency)
  • Can be updated (always current)
  • Can be personalized (per-user facts)
  • Requires storage infrastructure
  • Subject to retrieval failures

Why Semantic Memory Matters for Personalization

Consider an agent that has learned these facts about a user over multiple conversations:

Code
User Profile (Semantic Memory):
- Name: Alex
- Company: HealthStart Inc.
- Role: CTO
- Team size: 12 developers
- Tech stack: Python, PostgreSQL, React
- Preferences: Prefers detailed explanations, likes code examples
- Timezone: PST
- Current project: HIPAA-compliant data pipeline

This semantic memory transforms every interaction:

Without semantic memory:

User: "How should we handle the data validation?" Agent: "Could you tell me more about your project and requirements?"

With semantic memory:

User: "How should we handle the data validation?" Agent: "For your HIPAA-compliant data pipeline, I'd recommend validating PHI fields with strict regex patterns before they hit PostgreSQL. Given your team's Python stack, you could use Pydantic models with custom validators. Want me to show a code example?"

The agent knows the context, the constraints, and the user's preferences—enabling responses that feel genuinely personalized rather than generic.

The Episodic-to-Semantic Transition

A fascinating aspect of human memory: episodic memories gradually become semantic over time. You might remember the specific episode when you learned Paris is the capital of France (episodic). But after years, you just know the fact without any episodic context (semantic).

The same transition happens in LLM memory systems. Consider a series of conversations:

Episode 1: "User mentioned they work at a startup" Episode 2: "User said the startup is in healthcare" Episode 3: "User mentioned they're building a data pipeline for HIPAA" Episode 4: "User said they recently hired three more developers"

Over time, these episodes can be consolidated into semantic facts:

  • User works at a healthcare startup
  • User is building HIPAA-compliant systems
  • User's team is growing

This consolidation serves several purposes:

  • Reduces storage (facts are more compact than episodes)
  • Improves retrieval (facts are easier to match)
  • Enables inference (combining facts yields new insights)
  • Maintains privacy (can retain facts while forgetting specific conversations)

Structured vs. Unstructured Semantic Memory

Semantic memory can be stored in different formats:

Unstructured (text):

Code
"Alex is the CTO of HealthStart Inc., a healthcare startup
building HIPAA-compliant data pipelines."

Semi-structured (key-value):

JSON
{
  "user": {
    "name": "Alex",
    "role": "CTO",
    "company": "HealthStart Inc.",
    "industry": "healthcare"
  }
}

Structured (knowledge graph/triples):

Code
(Alex) --[is_role]--> (CTO)
(Alex) --[works_at]--> (HealthStart Inc.)
(HealthStart Inc.) --[is_in_industry]--> (Healthcare)
(HealthStart Inc.) --[is_building]--> (HIPAA Data Pipeline)

Each format has trade-offs:

FormatProsCons
UnstructuredPreserves nuance, easy to storeHard to query precisely, verbose
Semi-structuredQueryable, flexible schemaNo relationship modeling
Knowledge graphExplicit relationships, enables reasoningComplex to build and maintain

Many production systems use a combination: unstructured text for flexible storage, with structured overlays for common query patterns.

Fact Lifecycle Management

Semantic facts have a lifecycle that must be managed:

Creation: When is a fact worth creating?

  • Explicit user statements ("My name is Alex")
  • Inferred from behavior (user always chooses detailed explanations)
  • Extracted from episodes (consolidation)

Updates: Facts change over time

  • User changes jobs ("I left Google, I'm at a startup now")
  • Preferences evolve ("Actually, I'd prefer shorter responses")
  • Corrections ("No, that's not right—the deadline is March, not February")

Conflicts: What happens when facts contradict?

  • Old fact: "User works at Google"
  • New fact: "User works at HealthStart"
  • Resolution: Mark old fact as superseded, keep audit trail

Expiration: Some facts have natural lifespans

  • "Project deadline is March 15" (expires after March 15)
  • "Currently debugging authentication" (expires when done)
  • "User's name is Alex" (permanent, or until corrected)

The relationship between semantic and episodic memory: Over time, episodic memories can become semantic. After many conversations about a user's project, the agent might distill "User is working on a healthcare startup" as a semantic fact, even without remembering which specific conversation revealed this. This consolidation is how agents develop stable "knowledge" about users from accumulated experiences.

Python
class SemanticMemory:
    """
    Semantic memory stores facts, concepts, and knowledge.
    Unlike episodic memory, these are decontextualized truths.
    """

    def __init__(self, storage, embedding_model, llm):
        self.storage = storage
        self.embedding_model = embedding_model
        self.llm = llm

    def store_fact(
        self,
        subject: str,
        predicate: str,
        object: str,
        confidence: float = 1.0,
        source: str = None
    ) -> str:
        """
        Store a semantic fact as a triple.

        Examples:
        - ("user", "works_at", "Acme Corp")
        - ("project", "uses_framework", "React")
        - ("deadline", "is", "March 15")
        """
        # Create natural language representation
        fact_text = f"{subject} {predicate.replace('_', ' ')} {object}"

        fact = {
            "id": str(uuid.uuid4()),
            "subject": subject,
            "predicate": predicate,
            "object": object,
            "fact_text": fact_text,
            "embedding": self.embedding_model.embed(fact_text),
            "confidence": confidence,
            "source": source,
            "created_at": datetime.now().isoformat(),
            "updated_at": datetime.now().isoformat(),
            "contradicted_by": None
        }

        # Check for existing facts about same subject-predicate
        existing = self._find_existing(subject, predicate)
        if existing:
            # Handle potential contradiction
            self._handle_update(existing, fact)
        else:
            self.storage.insert(fact)

        return fact["id"]

    def _handle_update(self, existing: dict, new: dict) -> None:
        """
        Handle updating an existing fact.

        Facts can change:
        - User changes jobs
        - Project deadlines shift
        - Preferences evolve

        We don't just overwrite—we may want to track history.
        """
        if existing["object"] != new["object"]:
            # Mark old fact as superseded
            self.storage.update(existing["id"], {
                "contradicted_by": new["id"],
                "current": False
            })

            # Store new fact
            new["previous_value"] = existing["object"]
            self.storage.insert(new)
        else:
            # Same fact, update confidence/timestamp
            self.storage.update(existing["id"], {
                "confidence": max(existing["confidence"], new["confidence"]),
                "updated_at": new["updated_at"]
            })

    def extract_facts_from_conversation(
        self,
        conversation: list[dict],
        entity: str = "user"
    ) -> list[dict]:
        """
        Extract semantic facts from a conversation.

        This is the process of converting episodic to semantic memory:
        - "I work at Google" → (user, works_at, Google)
        - "We're using Python" → (project, uses_language, Python)
        - "Call me Alex" → (user, preferred_name, Alex)
        """
        messages_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in conversation
        )

        prompt = f"""Extract factual information from this conversation about {entity}.

Format as JSON array of objects with: subject, predicate, object, confidence (0-1)

Only extract explicitly stated facts, not inferences.

Conversation:
{messages_text}

Facts:"""

        response = self.llm.generate(prompt)
        try:
            facts = json.loads(response)
            # Store each extracted fact
            for fact in facts:
                self.store_fact(**fact)
            return facts
        except:
            return []

    def query_facts(
        self,
        subject: str = None,
        predicate: str = None,
        semantic_query: str = None
    ) -> list[dict]:
        """
        Query semantic memory.

        Can query by:
        - Exact subject/predicate: "What does user work_at?"
        - Semantic similarity: "What do we know about user's job?"
        """
        if subject and predicate:
            # Exact query
            return self.storage.query({
                "subject": subject,
                "predicate": predicate,
                "current": {"$ne": False}
            })

        if semantic_query:
            # Semantic search
            query_embedding = self.embedding_model.embed(semantic_query)
            return self.storage.search(
                query_embedding,
                k=10,
                filter={"current": {"$ne": False}}
            )

        if subject:
            # All facts about subject
            return self.storage.query({
                "subject": subject,
                "current": {"$ne": False}
            })

        return []

    def get_entity_profile(self, entity: str) -> dict:
        """
        Compile all known facts about an entity into a profile.

        This creates a structured summary:
        {
            "name": "Alex",
            "works_at": "Acme Corp",
            "role": "Engineer",
            "preferences": {...}
        }
        """
        facts = self.query_facts(subject=entity)

        profile = {"entity": entity}
        for fact in facts:
            profile[fact["predicate"]] = fact["object"]

        return profile

Procedural Memory: Learned Behaviors

In humans: Procedural memory stores skills and habits—how to ride a bike, type on a keyboard, or solve a particular type of math problem. These memories are implicit; you don't consciously recall them, you just execute them.

The hallmark of procedural memory is that it operates below conscious awareness. When you type, you don't think "press the 'T' key with my left index finger"—your fingers just move. When you ride a bike, you don't consciously calculate balance adjustments—your body just knows. This "knowing how" is distinct from "knowing that" (semantic memory).

Procedural memories are also remarkably durable. You might forget facts you learned in school, but skills like riding a bike or swimming tend to persist for life. The phrase "it's like riding a bike" captures this durability—procedural memories resist forgetting in ways that other memory types don't.

In LLMs: Procedural memory is primarily encoded in the model weights through training. A model "knows how" to write Python code or format JSON not through explicit memory, but through learned patterns. Fine-tuning adds new procedural knowledge.

When a model writes syntactically correct Python, it's not consulting explicit rules—the patterns are embedded in billions of parameters. When it formats a response with proper Markdown, it's not following stored instructions—it has "learned" the format through exposure to millions of examples.

The Fundamental Difference: Can't Add at Runtime

Here's what makes procedural memory fundamentally different from the other memory types we've discussed:

Semantic, Episodic memory: "Hey LLM, remember that Alex works at Google." → Store it in a database, retrieve when relevant. Done.

Procedural memory: "Hey LLM, learn how to write Go code." → You can't do this at inference time. The model either knows Go from training, or it doesn't.

This asymmetry has profound implications. You can give an agent unlimited semantic memory (facts about users), unlimited episodic memory (past conversations), but its procedural memory is fixed at training time. If the model was trained on Python but not Rust, no amount of runtime memory will teach it Rust at the same fluency level.

Why This Matters for Agent Capabilities

Consider an agent helping a user with a task:

Scenario 1: User asks about their company's vacation policy.

  • Agent retrieves the policy document (semantic)
  • Agent recalls previous discussions about time off (episodic)
  • Agent synthesizes an answer using language skills (procedural)
  • Works great—the novel information was semantic, procedural skills are generic

Scenario 2: User asks to write code in an internal DSL.

  • Agent retrieves documentation about the DSL (semantic)
  • Agent recalls previous code in this DSL (episodic)
  • Agent attempts to write DSL code (procedural)
  • May struggle—even with documentation, the model lacks fluency in the DSL

The difference is that some tasks require genuine procedural skill—fluent, automatic execution—not just factual knowledge about how to do something.

Approximating Procedural Memory at Runtime

Since we can't truly add procedural memories at inference time, we use approximations:

Few-Shot Examples (In-Context Learning)

Show the model examples of the desired behavior directly in the prompt:

Code
Here's how we format error messages in our codebase:

Example 1:
Input: File not found
Output: ERROR_FILE_001: Requested file could not be located in path

Example 2:
Input: Permission denied
Output: ERROR_AUTH_002: Insufficient permissions for requested operation

Now format this:
Input: Connection timeout

This works surprisingly well for many tasks. The model doesn't "learn" the procedure permanently, but it can follow the pattern within this conversation.

Limitations:

  • Uses precious context tokens
  • Only works for patterns demonstrable in few examples
  • Performance degrades for complex procedures
  • Knowledge doesn't persist to next conversation

Structured Instructions

Provide detailed, step-by-step instructions for the procedure:

Code
When reviewing code, follow these steps:
1. First, check for syntax errors
2. Then, look for potential security vulnerabilities
3. Next, evaluate code style and naming conventions
4. Finally, assess performance implications

For each issue found, format your feedback as:
- Line number
- Issue type (error/warning/suggestion)
- Description
- Suggested fix

Limitations:

  • Models may not follow perfectly
  • Complex procedures are hard to specify completely
  • No implicit knowledge transfer (just explicit rules)

Tool Use as External Procedural Memory

Instead of the model executing a procedure, define a tool that executes it:

Python
# Instead of teaching the model to calculate shipping costs
# (complex procedure with zones, weights, discounts)...

# Define a tool that does it:
def calculate_shipping(weight: float, zone: str, priority: bool) -> float:
    """Calculate shipping cost based on complex internal rules."""
    # All the procedural knowledge is in this function
    pass

This externalizes procedural memory into code. The model doesn't need to "know how" to calculate shipping—it just needs to know when to call the tool and how to interpret results.

Limitations:

  • Requires implementing every procedure as code
  • Can't handle truly novel procedures
  • Model still needs procedural skill to use tools correctly

Fine-Tuning (True Procedural Learning)

Train the model on examples of the desired behavior:

Code
Training data:
{"input": "Write a SQL query to find users", "output": "SELECT * FROM users;"}
{"input": "Write a SQL query to count orders", "output": "SELECT COUNT(*) FROM orders;"}
... thousands more examples ...

This actually modifies the model's weights, adding genuine procedural memory.

Limitations:

  • Expensive (compute and data)
  • Static (can't update quickly)
  • Requires many examples
  • Risk of catastrophic forgetting
  • May require re-training for updates

The Skill Hierarchy

Not all procedural skills are equally learnable at runtime. A rough hierarchy:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PROCEDURAL SKILL HIERARCHY                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EASILY APPROXIMATED AT RUNTIME (few-shot or instructions)              │
│  ├── Output formatting (JSON, Markdown, specific templates)             │
│  ├── Simple transformations (reformat date, change case)                │
│  └── Pattern-based generation (follow a style guide)                    │
│                                                                          │
│  MODERATELY DIFFICULT (may need many examples or fine-tuning)           │
│  ├── Domain-specific writing styles (legal, medical, technical)         │
│  ├── Code in familiar languages with project conventions                │
│  └── Multi-step reasoning in specialized domains                        │
│                                                                          │
│  DIFFICULT (usually requires fine-tuning)                               │
│  ├── Fluent code in rare/internal languages                             │
│  ├── Complex domain-specific reasoning (tax law, drug interactions)     │
│  └── Maintaining long-range consistency in specialized outputs          │
│                                                                          │
│  VERY DIFFICULT (may require specialized training)                      │
│  ├── Novel task types not seen in training                              │
│  ├── Combining multiple complex skills simultaneously                   │
│  └── Tasks requiring implicit knowledge hard to articulate              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Runtime Procedural Memory Systems

Despite the limitations, we can build systems that approximate procedural memory:

Procedure Libraries

Store procedures as retrievable documents:

Code
Procedure: Code Review
Trigger: When user asks for code review
Steps:
1. Read the code carefully
2. Check for bugs
3. Evaluate style
...
Examples:
[Example 1]
[Example 2]

When a task matches a stored procedure, retrieve it and inject into the prompt. The model doesn't "know" the procedure permanently, but it has access when needed.

Execution Tracking

Track how well procedures work and refine them:

  • Store procedure definition
  • Record each execution
  • Note success/failure and user feedback
  • Periodically update procedures that perform poorly

This creates a feedback loop that improves procedures over time, even though the model itself isn't learning.

Workarounds for runtime procedural memory:

ApproachHow It WorksLimitations
Few-shot examplesShow examples in promptUses context tokens
InstructionsDescribe the procedureMay not follow perfectly
Tool definitionsDefine tools the agent can useRequires implementation
Fine-tuningTrain on examplesExpensive, static
Python
class ProceduralMemory:
    """
    Procedural memory stores 'how to' knowledge.
    In LLMs, this is approximated through examples and instructions.
    """

    def __init__(self, storage):
        self.storage = storage
        self.procedures: dict[str, dict] = {}

    def store_procedure(
        self,
        name: str,
        description: str,
        steps: list[str],
        examples: list[dict],
        trigger_patterns: list[str]
    ) -> None:
        """
        Store a procedure for later retrieval.

        A procedure includes:
        - What it does (description)
        - How to do it (steps)
        - Examples of doing it (few-shot)
        - When to use it (trigger patterns)
        """
        procedure = {
            "name": name,
            "description": description,
            "steps": steps,
            "examples": examples,
            "trigger_patterns": trigger_patterns,
            "usage_count": 0,
            "success_rate": None
        }

        self.procedures[name] = procedure
        self.storage.insert(procedure)

    def get_relevant_procedures(self, task: str) -> list[dict]:
        """
        Find procedures relevant to the current task.

        Matches based on:
        - Semantic similarity to description
        - Trigger pattern matching
        """
        relevant = []

        for name, proc in self.procedures.items():
            # Check trigger patterns
            for pattern in proc["trigger_patterns"]:
                if pattern.lower() in task.lower():
                    relevant.append(proc)
                    break

        return relevant

    def format_procedure_for_prompt(self, procedure: dict) -> str:
        """
        Format a procedure for inclusion in the prompt.

        This is how procedural memory enters short-term memory
        at inference time.
        """
        sections = [
            f"## Procedure: {procedure['name']}",
            f"\n{procedure['description']}",
            "\n### Steps:",
            "\n".join(f"{i+1}. {step}" for i, step in enumerate(procedure['steps']))
        ]

        if procedure['examples']:
            sections.append("\n### Examples:")
            for ex in procedure['examples'][:2]:  # Limit to save tokens
                sections.append(f"\nInput: {ex['input']}")
                sections.append(f"Output: {ex['output']}")

        return "\n".join(sections)

    def record_execution(self, procedure_name: str, success: bool) -> None:
        """Track procedure execution for learning."""
        if procedure_name in self.procedures:
            proc = self.procedures[procedure_name]
            proc["usage_count"] += 1

            # Update success rate
            if proc["success_rate"] is None:
                proc["success_rate"] = 1.0 if success else 0.0
            else:
                # Exponential moving average
                proc["success_rate"] = 0.9 * proc["success_rate"] + 0.1 * (1.0 if success else 0.0)

Integrating Memory Types: A Complete System

A production agent typically uses multiple memory types together. Here's how they interact:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    INTEGRATED MEMORY SYSTEM                              │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                     CONTEXT WINDOW                               │   │
│  │                   (Short-Term Memory)                            │   │
│  │                                                                  │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │   │
│  │  │   System    │  │   Working   │  │    Recent Messages      │ │   │
│  │  │   Prompt    │  │   Memory    │  │    (Conversation)       │ │   │
│  │  │             │  │  Scratchpad │  │                         │ │   │
│  │  └─────────────┘  └─────────────┘  └─────────────────────────┘ │   │
│  │                                                                  │   │
│  │  ┌─────────────────────────────────────────────────────────────┐│   │
│  │  │              Retrieved Context                               ││   │
│  │  │  (Loaded from long-term memory based on relevance)          ││   │
│  │  │                                                              ││   │
│  │  │  • Semantic facts about user                                 ││   │
│  │  │  • Relevant past episodes                                    ││   │
│  │  │  • Applicable procedures                                     ││   │
│  │  └─────────────────────────────────────────────────────────────┘│   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│         ▲                           │                                    │
│         │ Retrieve                  │ Store                              │
│         │                           ▼                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                   LONG-TERM STORAGE                              │   │
│  │                                                                  │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐            │   │
│  │  │   Semantic   │ │   Episodic   │ │  Procedural  │            │   │
│  │  │    Memory    │ │    Memory    │ │   Memory     │            │   │
│  │  │              │ │              │ │              │            │   │
│  │  │  Facts about │ │    Past      │ │   How-to     │            │   │
│  │  │  user, world │ │ conversations│ │  knowledge   │            │   │
│  │  └──────────────┘ └──────────────┘ └──────────────┘            │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘
Python
class IntegratedMemorySystem:
    """
    A complete memory system integrating all memory types.
    """

    def __init__(
        self,
        llm,
        embedding_model,
        vector_store,
        max_context_tokens: int = 8000
    ):
        self.llm = llm
        self.max_context_tokens = max_context_tokens

        # Initialize all memory types
        self.short_term = ShortTermMemory(max_tokens=max_context_tokens)
        self.working = WorkingMemory()
        self.long_term = LongTermMemory(vector_store, embedding_model)
        self.episodic = EpisodicMemory(vector_store, llm)
        self.semantic = SemanticMemory(vector_store, embedding_model, llm)
        self.procedural = ProceduralMemory(vector_store)

    def prepare_context(self, user_message: str, user_id: str = None) -> list[dict]:
        """
        Prepare the full context for an LLM call.

        This is where memory integration happens:
        1. Start with system prompt
        2. Add working memory state
        3. Retrieve relevant long-term memories
        4. Add recent conversation
        5. Add the new user message

        All while respecting token limits.
        """
        context = []
        token_budget = self.max_context_tokens

        # 1. System prompt (always included)
        system_prompt = self._build_system_prompt()
        context.append({"role": "system", "content": system_prompt})
        token_budget -= self._count_tokens(system_prompt)

        # 2. Working memory (high priority)
        working_content = self.working.to_prompt()
        if working_content:
            context.append({"role": "system", "content": f"Current state:\n{working_content}"})
            token_budget -= self._count_tokens(working_content)

        # 3. Retrieve relevant long-term memories
        retrieved = self._retrieve_relevant_memories(user_message, user_id)
        retrieved_content = self._format_retrieved_memories(retrieved)
        if retrieved_content and token_budget > 1000:
            # Reserve space for retrieved context
            retrieved_tokens = min(self._count_tokens(retrieved_content), token_budget // 3)
            context.append({"role": "system", "content": f"Relevant context:\n{retrieved_content}"})
            token_budget -= retrieved_tokens

        # 4. Recent conversation history
        recent_messages = self.short_term.get_context()
        for msg in recent_messages:
            msg_tokens = self._count_tokens(msg["content"])
            if token_budget > msg_tokens + 500:  # Reserve for new message
                context.append(msg)
                token_budget -= msg_tokens

        # 5. New user message
        context.append({"role": "user", "content": user_message})

        return context

    def _retrieve_relevant_memories(self, query: str, user_id: str = None) -> dict:
        """Retrieve from all long-term memory types."""
        return {
            "semantic": self.semantic.query_facts(semantic_query=query)[:3],
            "episodic": self.episodic.recall_similar_episodes(query, k=2, user_id=user_id),
            "procedural": self.procedural.get_relevant_procedures(query)[:2]
        }

    def _format_retrieved_memories(self, retrieved: dict) -> str:
        """Format retrieved memories for inclusion in context."""
        sections = []

        if retrieved["semantic"]:
            facts = "\n".join(f"- {f['fact_text']}" for f in retrieved["semantic"])
            sections.append(f"Known facts:\n{facts}")

        if retrieved["episodic"]:
            episodes = "\n".join(f"- {e['summary']}" for e in retrieved["episodic"])
            sections.append(f"Relevant past interactions:\n{episodes}")

        if retrieved["procedural"]:
            procs = "\n".join(
                self.procedural.format_procedure_for_prompt(p)
                for p in retrieved["procedural"]
            )
            sections.append(f"Applicable procedures:\n{procs}")

        return "\n\n".join(sections)

    def process_response(
        self,
        user_message: str,
        assistant_response: str,
        user_id: str = None
    ) -> None:
        """
        Process a completed interaction for memory storage.

        This happens after the response is generated:
        1. Add to short-term memory
        2. Extract facts for semantic memory
        3. Update working memory
        """
        # Update short-term memory
        self.short_term.add("user", user_message)
        self.short_term.add("assistant", assistant_response)

        # Extract semantic facts (background processing)
        conversation = [
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": assistant_response}
        ]
        self.semantic.extract_facts_from_conversation(conversation)

    def end_session(
        self,
        user_id: str = None,
        task: str = None,
        outcome: str = None
    ) -> None:
        """
        End the current session and consolidate memories.

        This is when episodic memories are formed.
        """
        # Create episode from session
        conversation = self.short_term.get_context()
        if conversation:
            self.episodic.create_episode(
                conversation=conversation,
                task=task,
                outcome=outcome,
                user_id=user_id
            )

        # Clear short-term and working memory
        self.short_term = ShortTermMemory(max_tokens=self.max_context_tokens)
        self.working = WorkingMemory()

    def _count_tokens(self, text: str) -> int:
        return len(text) // 4  # Simplified

    def _build_system_prompt(self) -> str:
        return "You are a helpful assistant with access to memory of past interactions."

Summary: Memory Types at a Glance

Memory TypeDurationCapacityAccessLLM Implementation
Short-termSeconds to minutesVery limited (~128K tokens max)InstantContext window
WorkingCurrent taskLimited (part of context)InstantScratchpad in prompt
Long-termPermanentUnlimitedRequires retrievalVector DB / external storage
EpisodicPermanentUnlimitedSearch by similarity/timeProcessed conversation logs
SemanticPermanentUnlimitedQuery by entity/factKnowledge base / facts DB
ProceduralPermanentModel-limitedImplicit or via examplesTraining / few-shot

Key principles:

  1. Short-term memory is the bottleneck—all other memory types exist to work around its limits
  2. Retrieval quality matters more than storage—storing everything is easy; finding the right thing is hard
  3. Memory types serve different purposes—don't try to use one type for everything
  4. Forgetting is a feature—without it, retrieval degrades over time
  5. Integration is complex—balancing multiple memory sources in limited context requires careful design

Memory Architecture Patterns

The Operating System Analogy

MemGPT introduced a powerful analogy: treat LLM memory like an operating system:

From research: "MemGPT (Memory-GPT) is a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow."

Code
Traditional Computer          →    LLM Agent
─────────────────────────────────────────────
RAM (fast, limited)           →    Context Window
Disk (slow, unlimited)        →    External Storage
OS Memory Manager             →    MemGPT Agent

Hierarchical Memory

From MemGPT research: "MemGPT treats context windows as a constrained memory resource and implements a memory hierarchy similar to operating systems. Agents can move data between in-context core memory (analogous to RAM) and externally stored archival and recall memory (analogous to disk storage), creating an illusion of unlimited memory while working within fixed context limits."

Memory Tiers:

TierAnalogyCharacteristics
Core MemoryRAMIn-context, immediately accessible, size-limited
Recall MemoryRecent filesSearchable conversation history
Archival MemoryDisk storageLong-term, vector-indexed, unlimited

MemGPT Architecture

The Original Design

Why MemGPT represents a paradigm shift: Before MemGPT, LLM memory meant stuffing conversation history into the context window until it filled up, then either truncating or summarizing. MemGPT treats this as a systems problem: context windows are limited resources that need active management, just like RAM in an operating system. This reframing opens up new possibilities—rather than passively accepting context limits, the agent actively manages what information is "in memory" at any given moment.

The OS analogy is more than metaphor: Operating systems face the same fundamental challenge: programs need more memory than physically exists. The solution is virtual memory—create the illusion of unlimited memory by intelligently moving data between fast RAM and slow disk. MemGPT does the same for LLMs: the agent "sees" a small context window, but can access vast amounts of information by explicitly retrieving it from external storage. The key insight is that the LLM itself can manage these memory operations.

From research: "At its core, MemGPT features a hierarchical memory architecture closely mirroring that of a traditional OS: Primary Context (RAM) — The fixed-size prompt that the LLM can directly 'see' during inference. It consists of three partitions: static system prompt containing base instructions and function schemas; dynamic working context serving as a scratchpad for reasoning steps and intermediate results; and FIFO message buffer holding the most recent conversational turns. External Context (Disk Storage) — An effectively infinite, out-of-context layer inaccessible to the model without explicit retrieval. It includes Recall Storage (a searchable database containing the full historical record of interactions) and Archival Storage (a long-term, vector-based memory for large documents retrievable via semantic search)."

Self-Directed Memory Management

The key innovation: the LLM manages its own memory:

From research: "What makes MemGPT particularly innovative is its use of the LLM itself as the memory manager. Through self-directed memory editing via tool calling, the system can actively manage its own memory contents, deciding what to store, what to summarize, and what to forget."

Memory Tools:

Python
# Core memory editing
memory_replace(section, old_content, new_content)  # Update memory block
memory_insert(section, content)                     # Add to memory
memory_rethink(section, new_content)               # Revise understanding

# Archival memory
archival_memory_insert(content)                    # Store for long-term
archival_memory_search(query, k=10)                # Semantic retrieval

# Conversation search
conversation_search(query)                          # Search past messages
conversation_search_date(start_date, end_date)     # Temporal search

Understanding each memory tool:

memory_replace and memory_rethink modify the agent's core beliefs about the user or situation. When a user says "Actually, my name is Alex, not Alexander," the agent calls memory_replace to update its stored name. memory_rethink is for deeper revisions—reconsidering an understanding based on new evidence, like updating a user profile when their preferences have clearly changed.

archival_memory_insert and archival_memory_search handle long-term storage. When the agent encounters information worth preserving indefinitely—a user's detailed project requirements, important facts from a long document—it inserts into archival memory. Later, when that information might be relevant, the agent searches archival memory semantically. This is essentially a vector database that the agent controls.

conversation_search and conversation_search_date let the agent recall past conversations. "What did we discuss about the budget last week?" triggers a search through conversation history, returning relevant messages that the agent can then use to inform its response.

The agent decides when to use these tools: This is the crucial difference from simpler memory systems where an external process manages memory. The LLM itself, during generation, decides "I need to remember this" or "I should look up what we discussed before." This self-direction enables more intelligent memory management but also introduces failure modes—the agent might forget to save important information or search for relevant context.

Multi-Step Reasoning with Heartbeats

From Letta: "MemGPT supports multi-step reasoning (allowing the agent to take multiple steps in sequence) via the concept of 'heartbeats'. Whenever the LLM outputs a tool call, it has the option to request a heartbeat by setting the keyword argument request_heartbeat to true. If the LLM requests a heartbeat, the LLM OS continues execution in a loop, allowing the LLM to 'think' again."

Letta: The MemGPT Framework

From Research to Production

From Letta: "As of September 2024, MemGPT is part of Letta. While MemGPT refers to the agent design pattern with two tiers of memory introduced in the research paper, Letta is an open-source agent framework that helps developers build persistent agents."

Letta V1 Architecture

From Letta: "At Letta, they're transitioning from the previous MemGPT-style architecture to a new Letta V1 architecture (letta_v1_agent) that follows modern patterns. In this architecture, heartbeats and the send_message tool are deprecated. Only native reasoning and direct assistant message generations from the models are supported."

Recommended for: GPT-5, Claude 4.5 Sonnet, and other advanced reasoning models.

Building with Letta

Python
from letta import create_client, LLMConfig, EmbeddingConfig

# Create Letta client
client = create_client()

# Configure memory
agent = client.create_agent(
    name="personal_assistant",
    llm_config=LLMConfig(model="gpt-4o"),
    embedding_config=EmbeddingConfig(model="text-embedding-3-small"),
    system="You are a personal assistant with long-term memory.",
    memory_blocks=[
        {"label": "human", "value": "Name: Unknown\nPreferences: Unknown"},
        {"label": "persona", "value": "I am a helpful assistant that remembers past conversations."}
    ]
)

# Interact with agent
response = client.send_message(
    agent_id=agent.id,
    message="My name is Alex and I prefer concise answers."
)

# Agent updates its memory automatically
# Next conversation will remember this preference

What happens under the hood: When the agent receives "My name is Alex and I prefer concise answers," it processes this through its system prompt which instructs it to update its memory blocks when learning new user information. The agent calls internal tools to modify the "human" memory block from "Name: Unknown" to "Name: Alex" and adds "Preferences: Concise answers." These updated blocks become part of the context for all future interactions with this agent.

Persistence is the key feature: The agent.id represents a persistent agent identity. All memory—conversation history, memory blocks, archival storage—is associated with this ID. When you call send_message again with the same agent_id, the agent has full access to everything it learned in previous conversations. This enables truly continuous relationships between users and AI agents.

Memory blocks vs. conversation history: Memory blocks are structured, edited information about the user ("Name: Alex, Preferences: Concise"). Conversation history is the raw record of messages exchanged. Both contribute to context, but they serve different purposes. Memory blocks capture distilled understanding; conversation history provides detailed evidence and context for that understanding.

Alternative Memory Approaches

LangMem (LangChain)

LangChain's approach to agent memory:

From LangMem: "Long-term memory allows agents to remember important information across conversations. LangMem provides ways to extract meaningful details from chats, store them, and use them to improve future interactions."

Hot Path vs Subconscious:

From research: "'Hot path' active memory formation happens during the conversation, enabling immediate updates when critical context emerges. This approach is easy to implement and lets the agent itself choose how to store and update its memory. However, it adds perceptible latency to user interactions."

From research: "'Subconscious' memory formation refers to prompting an LLM to reflect on a conversation after it occurs, finding patterns and extracting insights without slowing down the immediate interaction."

A-MEM (Agentic Memory)

Highly efficient memory system:

From research: "A-MEM achieves an 85-93% reduction in token usage compared to baseline methods (LoCoMo and MemGPT with 16,900 tokens) through selective top-k retrieval mechanism. This substantial token reduction directly translates to lower operational costs, with each memory operation costing less than $0.0003 when using commercial API services."

Zep

From research: "Zep: A Temporal Knowledge Graph Architecture for Agent Memory (February 2025)"—focuses on temporal relationships between memories.

Implementation Patterns

These patterns represent progressively more sophisticated approaches to LLM memory. Start with the simplest pattern that meets your needs—complexity adds engineering burden without always adding user value.

Pattern 1: Conversation Buffer with Summary

Simplest approach—summarize when context gets too long.

When to use this pattern: This is your starting point for any conversational AI. It handles the most common need—maintaining conversation context without exceeding token limits—with minimal complexity. Use this when conversations don't need to persist across sessions and when you don't need to remember specific facts long-term.

The compression strategy matters: The code compresses by keeping the 10 most recent messages and summarizing everything older. This preserves immediate context while maintaining awareness of earlier discussion. The summarization prompt should focus on preserving important facts, decisions made, and user preferences—not on maintaining narrative flow.

Python
class ConversationMemory:
    def __init__(self, max_tokens=4000, summarizer=None):
        self.messages = []
        self.summary = ""
        self.max_tokens = max_tokens
        self.summarizer = summarizer or default_summarizer

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

        # Check if we need to summarize
        if self.count_tokens() > self.max_tokens:
            self._compress()

    def _compress(self):
        # Keep recent messages, summarize older ones
        recent = self.messages[-10:]
        old = self.messages[:-10]

        if old:
            new_summary = self.summarizer(self.summary, old)
            self.summary = new_summary
            self.messages = recent

    def get_context(self):
        context = []
        if self.summary:
            context.append({"role": "system", "content": f"Previous conversation summary: {self.summary}"})
        context.extend(self.messages)
        return context

Key implementation details:

The count_tokens() method (not shown) should use the tokenizer for your target model. Token counts vary between models—a 4000-token limit for GPT-4 is different from 4000 tokens for Claude. Use tiktoken for OpenAI models or the appropriate tokenizer for others.

The _compress() method is called when limits are exceeded, not proactively. This lazy compression means most interactions don't pay the summarization cost. The trade-off: if a conversation suddenly exceeds limits by a large margin, the summarization might be slow. For latency-sensitive applications, consider proactive compression when approaching limits.

The summary is prepended as a system message. This gives it "authoritative" status in the conversation—the LLM treats it as background knowledge rather than something it said itself. Alternative approaches include prepending as an assistant message or including in the system prompt.

Pattern 2: Entity Memory

Track entities mentioned across conversations.

When to use this pattern: When your agent needs to remember specific things about users, projects, or topics across multiple sessions. A customer service agent remembering past issues, a personal assistant remembering project details, or a tutor remembering student progress all benefit from entity memory.

Python
class EntityMemory:
    def __init__(self, db):
        self.db = db  # Vector database

    def extract_and_store(self, conversation: list[dict]):
        # Extract entities with LLM
        entities = self.extract_entities(conversation)

        for entity in entities:
            # Check if entity exists
            existing = self.db.search(entity.name, k=1)

            if existing and existing[0].score > 0.9:
                # Update existing entity
                self.merge_entity(existing[0], entity)
            else:
                # Create new entity
                self.db.insert(entity)

    def get_relevant_entities(self, query: str, k=5):
        return self.db.search(query, k=k)

Pattern 3: Semantic Memory with Forgetting

More sophisticated—mimics human memory:

Python
class SemanticMemory:
    def __init__(self, db, decay_rate=0.1):
        self.db = db
        self.decay_rate = decay_rate

    def remember(self, content: str, importance: float = 0.5):
        embedding = self.embed(content)
        self.db.insert({
            "content": content,
            "embedding": embedding,
            "importance": importance,
            "last_accessed": datetime.now(),
            "access_count": 1
        })

    def recall(self, query: str, k=10):
        candidates = self.db.search(query, k=k*2)

        # Score by relevance + recency + importance
        scored = []
        now = datetime.now()
        for c in candidates:
            age = (now - c.last_accessed).days
            recency_score = math.exp(-self.decay_rate * age)
            final_score = c.similarity * c.importance * recency_score
            scored.append((c, final_score))

            # Update access time
            c.last_accessed = now
            c.access_count += 1

        scored.sort(key=lambda x: x[1], reverse=True)
        return [c for c, _ in scored[:k]]

    def forget(self, threshold=0.1):
        """Remove low-value memories"""
        all_memories = self.db.get_all()
        now = datetime.now()

        for m in all_memories:
            age = (now - m.last_accessed).days
            value = m.importance * math.exp(-self.decay_rate * age)
            if value < threshold:
                self.db.delete(m.id)

Pattern 4: Episodic Memory

Store complete episodes/sessions:

Python
class EpisodicMemory:
    def __init__(self, db):
        self.db = db

    def end_episode(self, conversation: list[dict], metadata: dict):
        # Summarize episode
        summary = self.summarize(conversation)

        # Extract key moments
        key_moments = self.extract_key_moments(conversation)

        # Store episode
        episode = {
            "summary": summary,
            "key_moments": key_moments,
            "full_conversation": conversation,
            "metadata": metadata,
            "embedding": self.embed(summary),
            "timestamp": datetime.now()
        }
        self.db.insert(episode)

    def recall_similar_episodes(self, query: str, k=3):
        return self.db.search(query, k=k)

    def recall_by_time(self, start: datetime, end: datetime):
        return self.db.query({"timestamp": {"$gte": start, "$lte": end}})

Memory Formation Strategies

Active (Hot Path)

Form memories during conversation:

Pros:

  • Immediate availability
  • Agent-controlled
  • Natural flow

Cons:

  • Adds latency
  • Uses tokens during interaction
Python
# In agent loop
response = llm.generate(messages)

# Check for memory updates
if should_update_memory(response):
    memory_update = llm.generate([
        {"role": "system", "content": "Extract key facts to remember"},
        {"role": "user", "content": str(messages[-5:])}
    ])
    memory.store(memory_update)

Passive (Background)

Form memories after conversation ends:

Pros:

  • No user-facing latency
  • More reflection time
  • Batch processing

Cons:

  • Delayed availability
  • Separate processing pipeline
Python
# After conversation ends
async def process_conversation(conversation_id: str):
    conversation = db.get_conversation(conversation_id)

    # Extract memories in background
    memories = await extract_memories(conversation)

    # Store for future sessions
    for memory in memories:
        memory_store.insert(memory)

Production Considerations

Storage Options

OptionBest ForConsiderations
Vector DB (Pinecone, Qdrant)Semantic searchCost at scale
PostgreSQL + pgvectorIntegrated solutionSelf-hosted complexity
RedisSession memoryPersistence config
SQLiteLocal/edgeLimited concurrency

Memory Retrieval Latency

Memory retrieval adds latency to every request:

Python
# Measure and optimize
import time

async def get_relevant_context(query: str):
    start = time.time()

    # Parallel retrieval
    results = await asyncio.gather(
        entity_memory.search(query),
        episodic_memory.search(query),
        recent_messages.get()
    )

    latency = time.time() - start
    logger.info(f"Memory retrieval: {latency:.3f}s")

    return merge_results(results)

Privacy and Data Retention

  • User consent for memory storage
  • Data retention policies
  • Right to be forgotten (memory deletion)
  • Encryption at rest

Conclusion

Effective memory transforms LLMs from stateless responders to persistent agents:

  1. MemGPT/Letta pioneered hierarchical memory with self-management
  2. Multiple memory types (core, archival, episodic) serve different needs
  3. Hot path vs background processing trades latency for immediacy
  4. Efficient retrieval (A-MEM's 85-93% token reduction) enables scale

Start simple (conversation buffer + summary), add complexity (entity memory, episodic) as your use case demands.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles