Skip to main content
Back to Blog

Building Semantic Memory for LLM Conversations: A Hierarchical RAG Approach

A practical guide to building a semantic search system for your LLM conversation history using hierarchical chunking, HyDE retrieval, knowledge graphs, and agentic research patterns.

6 min read
Share:

Introduction

This post walks through building a semantic search system for your LLM conversation history. We'll combine several techniques—hierarchical chunking, HyDE retrieval, knowledge graphs, and agentic research patterns—into a complete, working system.

Prerequisites: This is an intermediate-to-advanced post. You should be familiar with:

  • Basic RAG concepts (embeddings, vector stores, retrieval). If not, start with Building Production-Ready RAG Systems.
  • How LLM agents work (tool use, reasoning loops). See Building Agentic AI Systems for background.

What you'll learn:

  • Why standard RAG fails for long conversations
  • Hierarchical chunking with summaries + sliding windows
  • HyDE (Hypothetical Document Embedding) for better retrieval
  • Intent classification for query routing
  • Knowledge graphs for relationship-aware search
  • Agentic multi-step research patterns

Tech stack: Python, ChromaDB, Claude API, NetworkX, FastAPI


The Conversation Graveyard Problem

If you're a heavy user of ChatGPT or Claude, you've likely experienced this frustration: you had a brilliant conversation weeks ago—maybe you debugged a complex issue, made an important decision, or learned something valuable—but now you can't find it.

The built-in search only matches exact keywords. You remember the concept but not the exact words you used. And even if you find the right conversation, it's 50 messages long and you have no idea where the relevant part is.

This is the conversation graveyard problem: valuable knowledge buried in chat history, effectively lost because traditional search can't surface it.

Why Standard RAG Fails for Conversations

You might think: "Just embed the conversations and do semantic search!" But naive RAG has a critical flaw with long conversations:

Embedding dilution. When you embed a 50-message conversation as a single chunk, the embedding becomes a murky average of everything discussed. A conversation that covers database design, authentication, deployment, and UI styling produces an embedding that strongly matches none of those topics.

Consider this scenario:

Code
Conversation: "Full Stack App Planning" (50 messages)
├── Messages 1-10:  Database schema (Users, Products, Orders)
├── Messages 11-20: Authentication with JWT
├── Messages 21-30: React vs Next.js decision
├── Messages 31-40: Stripe payment integration
└── Messages 41-50: Nginx deployment + dark mode styling

If you search for "nginx reverse proxy configuration," a naive embedding of the full conversation won't rank highly—the nginx discussion is 1/5 of the content, diluted by unrelated topics.

The Solution: Hierarchical Chunking

The key insight is that conversations need multiple levels of representation. This builds on the hierarchical chunking concepts from production RAG systems, but adapted specifically for conversational data:

LevelContentUse Case
SummaryLLM-generated overview"What was that conversation about X?"
WindowsSliding message windows"Show me the exact discussion of Y"

This dual-level indexing lets you search by high-level topic (summary) OR find specific details (windows).

Implementation: Dual-Level Indexing

Python
from pydantic import BaseModel
from typing import List

class ConversationChunk(BaseModel):
    id: str
    parent_id: str
    type: str  # "summary" or "window"
    text: str
    metadata: dict

def create_sliding_windows(
    messages: List[Message],
    window_size: int = 10,
    overlap: int = 2
) -> List[dict]:
    """
    Creates overlapping windows of messages.
    Window size of 10 with overlap of 2 means:
    - Window 1: messages 0-9
    - Window 2: messages 8-17
    - Window 3: messages 16-25
    ...and so on
    """
    windows = []
    step = window_size - overlap

    for i in range(0, len(messages), step):
        window_msgs = messages[i : i + window_size]
        window_text = "\n".join(
            f"{msg.role}: {msg.content}"
            for msg in window_msgs
        )

        windows.append({
            "text": window_text,
            "start_index": i,
            "end_index": i + len(window_msgs) - 1
        })

        if i + window_size >= len(messages):
            break

    return windows

The overlap is crucial—without it, information at window boundaries gets lost. A question spanning messages 9-11 would be split across two windows. With 2-message overlap, both windows contain the context.

Generating Summaries

For conversations longer than 5 messages, generate a summary using an LLM:

Python
def get_conversation_summary(text: str) -> str:
    """
    Uses Claude to summarize the conversation.
    Falls back to truncation if no API key.
    """
    client = Anthropic()

    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=300,
        temperature=0,
        system="""You are a helpful assistant that summarizes
        conversations for retrieval purposes. Capture the main
        topics, decisions, and key entities.""",
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation:\n\n{text[:100000]}"
        }]
    )

    return response.content[0].text

The summary chunk captures the semantic essence of the conversation, while windows capture specific details. When indexed together, both levels are searchable.

Putting It Together: The Ingestion Pipeline

Python
def ingest_conversation(conversation: dict) -> List[ConversationChunk]:
    """
    Creates hierarchical chunks for a conversation.
    """
    chunks = []
    conv_id = conversation['id']
    title = conversation['title']
    messages = conversation['messages']

    # Full text for summary
    full_text = f"Title: {title}\n" + "\n".join(
        f"{msg['role']}: {msg['content']}"
        for msg in messages
    )

    # 1. Summary Chunk (for conversations > 5 messages)
    if len(messages) > 5:
        summary = get_conversation_summary(full_text)
        chunks.append(ConversationChunk(
            id=f"{conv_id}_summary",
            parent_id=conv_id,
            type="summary",
            text=f"Summary of '{title}':\n{summary}",
            metadata={
                "title": title,
                "type": "summary"
            }
        ))

    # 2. Sliding Window Chunks
    windows = create_sliding_windows(messages)
    for i, window in enumerate(windows):
        chunks.append(ConversationChunk(
            id=f"{conv_id}_window_{i}",
            parent_id=conv_id,
            type="window",
            text=f"Excerpt from '{title}' (msgs {window['start_index']}-{window['end_index']}):\n{window['text']}",
            metadata={
                "title": title,
                "type": "window",
                "start_index": window['start_index'],
                "end_index": window['end_index']
            }
        })

    return chunks

The metadata is essential—it lets the UI highlight exactly which messages matched, not just which conversation.

Advanced Retrieval with HyDE

Basic semantic search has a fundamental problem: the query-document mismatch. A user searching "How do I configure nginx?" expects to find an answer, but the embedding of a question may not be close to the embedding of an answer in vector space.

Hypothetical Document Embedding (HyDE) solves this by generating a hypothetical answer and searching with that instead.

Python
class AdvancedRetriever:
    def __init__(self, vector_store: VectorStore):
        self.vector_store = vector_store
        self.client = Anthropic()

    def expand_query(self, query: str) -> List[str]:
        """
        Generates synonyms and alternative phrasings.
        """
        response = self.client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=100,
            temperature=0.7,
            system="""Generate 3-5 alternative search queries that
            capture the same intent but use different keywords.
            Return ONLY the queries, one per line.""",
            messages=[{
                "role": "user",
                "content": f"Generate variations for: {query}"
            }]
        )

        variations = [
            line.strip()
            for line in response.content[0].text.split('\n')
            if line.strip()
        ]
        return [query] + variations

    def generate_hyde_document(self, query: str) -> str:
        """
        Generates a hypothetical document that WOULD answer the query.
        We then search using this document's embedding.
        """
        response = self.client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=300,
            temperature=0.7,
            system="""Write a hypothetical conversation snippet
            or answer that would satisfy the user's search query.
            Make it look like a real conversation or technical
            explanation.""",
            messages=[{
                "role": "user",
                "content": f"Write a hypothetical passage answering: {query}"
            }]
        )

        return response.content[0].text

    def search(self, query: str, n_results: int = 10) -> List[dict]:
        """
        Search using both the original query and HyDE document.
        """
        # Generate hypothetical document
        hyde_doc = self.generate_hyde_document(query)

        # Search with both
        results_query = self.vector_store.search(query, n_results)
        results_hyde = self.vector_store.search(hyde_doc, n_results)

        # Deduplicate and merge
        seen_ids = set()
        merged = []

        for results in [results_query, results_hyde]:
            for res in results:
                if res['id'] not in seen_ids:
                    merged.append(res)
                    seen_ids.add(res['id'])

        # Re-rank by distance
        merged.sort(key=lambda x: x['distance'])

        return merged[:n_results]

How HyDE improves retrieval:

QueryWithout HyDEWith HyDE
"nginx configuration"Matches questions about nginxMatches actual nginx configs
"Why did I choose React?"Matches React mentionsMatches decision discussions
"database schema design"Matches schema keywordsMatches schema explanations

The hypothetical document moves the query into the same embedding space as the documents that would answer it.

Intent Classification for Better Responses

Not all queries need the same handling. Someone asking "find that chat about databases" wants different output than "how do I configure Stripe webhooks?"

Python
from typing import Literal

IntentType = Literal["RETRIEVAL", "QA", "DECISION", "CODE_LOOKUP", "SYNTHESIS"]

class IntentClassifier:
    def __init__(self):
        self.client = Anthropic()

    def classify(self, query: str) -> IntentType:
        """
        Classifies query intent:
        - RETRIEVAL: Locate a conversation
        - QA: Answer a question
        - DECISION: Explain past reasoning
        - CODE_LOOKUP: Find specific code
        - SYNTHESIS: Summarize/compare across conversations
        """
        response = self.client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=50,
            temperature=0,
            system="""Classify user queries into exactly ONE category.

Categories:
- RETRIEVAL: Finding a conversation ("find that chat about...")
- QA: Asking a question ("how do I...", "what is...")
- DECISION: Asking about decisions ("why did I choose...")
- CODE_LOOKUP: Looking for code ("show me the code for...")
- SYNTHESIS: Summary/comparison ("summarize what I learned...")

Respond with ONLY the category name.""",
            messages=[{
                "role": "user",
                "content": f"Classify this query:\n\n{query}"
            }]
        )

        return response.content[0].text.strip()

Each intent type gets a specialized system prompt for answer generation:

Python
def get_system_prompt(intent: IntentType) -> str:
    prompts = {
        "CODE_LOOKUP": """You are helping retrieve code from past
        conversations. Extract and present code clearly with syntax
        highlighting. Show all relevant examples.""",

        "DECISION": """You are helping recall past decisions and
        reasoning. Focus on explaining WHY choices were made.
        Quote exact reasoning when possible.""",

        "SYNTHESIS": """You are synthesizing knowledge from multiple
        conversations. Create summaries, comparisons, or overviews.
        Use tables or bullet points for clarity.""",

        "QA": """You answer questions based on past conversations.
        Be concise and direct. Quote relevant parts when appropriate.
        If the context doesn't contain the answer, say so clearly.""",

        "RETRIEVAL": """You help locate relevant conversations.
        Summarize what was discussed and highlight the most
        relevant sections."""
    }

    return prompts.get(intent, prompts["QA"])

Conversations aren't isolated—they reference technologies, people, and concepts that relate to each other. A knowledge graph captures these relationships. This is similar to GraphRAG approaches in agentic RAG systems, but focused on extracting structure from conversational data:

Python
class GraphBuilder:
    def __init__(self, graph_store: GraphStore):
        self.graph_store = graph_store
        self.client = Anthropic()

    def process_conversation(self, text: str, conv_id: str):
        """
        Extracts entities and relationships from a conversation.
        """
        response = self.client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=1000,
            temperature=0,
            system="""You are a Knowledge Graph Extractor.
Analyze the conversation and extract key Entities and Relationships.

Entities: Tools, Technologies, People, Decisions, Concepts, Projects.
Relationships: USES, CHOSE, REJECTED, IS_A, PART_OF, RELATED_TO.

Format as JSON:
{
  "entities": [
    {"name": "Next.js", "type": "Technology"},
    {"name": "SEO", "type": "Concept"}
  ],
  "relationships": [
    {"source": "Next.js", "target": "SEO", "relation": "IMPROVES"}
  ]
}""",
            messages=[{
                "role": "user",
                "content": f"Extract knowledge graph:\n\n{text[:50000]}"
            }]
        )

        data = json.loads(response.content[0].text)

        # Update graph
        for entity in data.get("entities", []):
            self.graph_store.add_entity(
                entity["name"],
                entity["type"],
                {"source_conv": conv_id}
            )

        for rel in data.get("relationships", []):
            self.graph_store.add_relationship(
                rel["source"],
                rel["target"],
                rel["relation"],
                {"source_conv": conv_id}
            )

Graph Storage and Traversal

The GraphStore class manages the actual graph using NetworkX:

Python
import networkx as nx
import pickle
import os
from typing import List, Dict, Any

class GraphStore:
    def __init__(self, persistence_path: str = "data/knowledge_graph.pkl"):
        self.graph = nx.MultiDiGraph()
        self.persistence_path = persistence_path
        self.load()

    def add_entity(self, name: str, entity_type: str, metadata: dict = None):
        """Adds a node (entity) to the graph."""
        if not self.graph.has_node(name):
            self.graph.add_node(name, type=entity_type, **(metadata or {}))

    def add_relationship(
        self, source: str, target: str, relation: str, metadata: dict = None
    ):
        """Adds an edge (relationship) between entities."""
        # Auto-create nodes if missing
        if not self.graph.has_node(source):
            self.graph.add_node(source, type="Unknown")
        if not self.graph.has_node(target):
            self.graph.add_node(target, type="Unknown")

        self.graph.add_edge(source, target, relation=relation, **(metadata or {}))

    def search_graph(self, query_entities: List[str]) -> List[str]:
        """
        Finds paths between entities mentioned in the query.
        Returns text descriptions of connections.
        """
        connections = []
        found = [e for e in query_entities if self.graph.has_node(e)]

        # Find paths between pairs of entities
        import itertools
        for start, end in itertools.combinations(found, 2):
            try:
                path = nx.shortest_path(self.graph, start, end)
                desc = f"Connection: {start}"
                for i in range(len(path) - 1):
                    u, v = path[i], path[i + 1]
                    edge = self.graph.get_edge_data(u, v)
                    relation = edge[0].get("relation", "related to")
                    desc += f" --[{relation}]--> {v}"
                connections.append(desc)
            except nx.NetworkXNoPath:
                continue

        # Fallback: show direct neighbors
        if not connections and found:
            for entity in found:
                for neighbor in list(self.graph.neighbors(entity))[:3]:
                    edge = self.graph.get_edge_data(entity, neighbor)
                    rel = edge[0].get("relation", "related to")
                    connections.append(f"{entity} {rel} {neighbor}")

        return connections

    def save(self):
        with open(self.persistence_path, "wb") as f:
            pickle.dump(self.graph, f)

    def load(self):
        if os.path.exists(self.persistence_path):
            with open(self.persistence_path, "rb") as f:
                self.graph = pickle.load(f)

When answering questions, the graph provides additional context:

Code
Query: "Why did I choose Next.js?"

Graph Path Found:
Next.js --[IMPROVES]--> SEO
Next.js --[REPLACES]--> React
React --[REJECTED_FOR]--> SSR limitations

Enhanced Answer: "You chose Next.js because it improves SEO through
server-side rendering. Your conversations mention rejecting plain
React due to SSR limitations..."

Agentic Deep Research

For complex queries that span multiple conversations, a simple retrieve-and-answer isn't enough. An agentic research loop breaks down complex queries into sub-queries, retrieves from multiple sources, and synthesizes a comprehensive answer. This follows the plan-execute-synthesize pattern from agentic AI systems:

Python
class DeepResearchAgent:
    def __init__(self, retriever: AdvancedRetriever):
        self.retriever = retriever
        self.client = Anthropic()

    def research(self, query: str) -> dict:
        """
        Multi-step research for complex queries.
        """
        steps = []
        all_findings = []
        all_sources = set()

        # 1. Generate Research Plan
        plan = self._generate_plan(query)
        steps.append({
            "step": "Planning",
            "details": f"Generated {len(plan)} sub-queries"
        })

        # 2. Execute Each Sub-Query
        for sub_query in plan:
            results = self.retriever.search(sub_query, n_results=5)

            findings = []
            for res in results:
                title = res['metadata']['title']
                snippet = res['document']
                findings.append(f"Source: {title}\n{snippet}")
                all_sources.add(title)

            all_findings.append(
                f"--- Findings for '{sub_query}' ---\n" +
                "\n\n".join(findings)
            )

            steps.append({
                "step": "Search",
                "details": f"'{sub_query}' → {len(results)} results"
            })

        # 3. Synthesize Report
        report = self._synthesize(query, all_findings)
        steps.append({"step": "Synthesis", "details": "Compiled report"})

        return {
            "answer": report,
            "steps": steps,
            "sources": list(all_sources)
        }

    def _generate_plan(self, query: str) -> List[str]:
        """Break complex query into searchable sub-queries."""
        response = self.client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=300,
            temperature=0,
            system="""You are a research planner. Break down the
            user's complex query into 2-4 distinct search queries
            needed to gather all information. Return ONLY the
            queries, one per line.""",
            messages=[{
                "role": "user",
                "content": f"Plan research for: {query}"
            }]
        )

        return [
            line.strip()
            for line in response.content[0].text.split('\n')
            if line.strip()
        ]

    def _synthesize(self, query: str, findings: List[str]) -> str:
        """Synthesize findings into a comprehensive report."""
        context = "\n\n".join(findings)

        response = self.client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=2000,
            temperature=0,
            system="""You are a research assistant. Write a
            comprehensive answer based on the gathered findings.
            Structure it with headings. Cite sources.""",
            messages=[{
                "role": "user",
                "content": f"""Write a research report for: {query}

Based on these findings:
{context}"""
            }]
        )

        return response.content[0].text

Example Deep Research Query:

Code
Query: "Compare my database design decisions with my deployment strategy"

Generated Plan:
1. "database schema design decisions"
2. "deployment strategy infrastructure"
3. "database deployment configuration"

Execution:
- Search 1: Found 3 chunks about PostgreSQL schema, indexes
- Search 2: Found 4 chunks about DigitalOcean, Nginx, Docker
- Search 3: Found 2 chunks about database migrations, backups

Synthesized Report:
"# Database Design vs Deployment Strategy

## Database Decisions
Your conversations show a preference for PostgreSQL with...

## Deployment Choices
You decided on DigitalOcean for hosting because...

## Integration Points
The database deployment uses Docker containers with..."

Complete Architecture

Here's the full system architecture:

Code
┌─────────────────────────────────────────────────────┐
│                 User Interface                       │
│  ┌─────────┐  ┌─────────┐  ┌──────────────┐        │
│  │ Search  │  │   Ask   │  │   Research   │        │
│  └────┬────┘  └────┬────┘  └──────┬───────┘        │
└───────┼────────────┼───────────────┼────────────────┘
        │            │               │
        ▼            ▼               ▼
┌───────────────────────────────────────────────────────┐
│                    FastAPI Backend                    │
├───────────────────────────────────────────────────────┤
│  Intent Classifier → Routes to appropriate handler   │
├───────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────┐ │
│  │            Advanced Retriever                    │ │
│  │  • HyDE Document Generation                     │ │
│  │  • Query Expansion                              │ │
│  │  • Multi-query Search + Deduplication           │ │
│  └─────────────────────────────────────────────────┘ │
├───────────────────────────────────────────────────────┤
│  ┌──────────────────┐  ┌───────────────────────────┐ │
│  │   Vector Store   │  │    Knowledge Graph        │ │
│  │   (ChromaDB)     │  │    (NetworkX)             │ │
│  │                  │  │                           │ │
│  │  • Summary Chunks│  │  • Entity Relationships  │ │
│  │  • Window Chunks │  │  • Graph Traversal       │ │
│  └──────────────────┘  └───────────────────────────┘ │
├───────────────────────────────────────────────────────┤
│  QA Engine / Deep Research Agent                     │
│  • Intent-aware prompting                            │
│  • Multi-step research loops                         │
│  • Source citation                                   │
└───────────────────────────────────────────────────────┘
        ▲
        │
┌───────┴───────────────────────────────────────────────┐
│              Ingestion Pipeline                       │
│  JSON Export → Parse → Chunk → Embed → Index         │
│               ↓                                       │
│  Hierarchical Chunking:                              │
│  • Claude Summary (if > 5 messages)                  │
│  • Sliding Windows (10 msgs, 2 overlap)              │
└───────────────────────────────────────────────────────┘

Results: Finding the Needle

With hierarchical chunking, searches that previously failed now work:

QueryNaive RAGHierarchical RAG
"nginx reverse proxy"No match (diluted)✅ Window 3 (msgs 21-30)
"dark mode styling"No match (diluted)✅ Window 5 (msgs 41-50)
"full stack app planning"Weak match✅ Summary chunk
"JWT authentication setup"Partial✅ Window 2 (msgs 11-20)

The window metadata enables the UI to auto-scroll and highlight the exact section that matched—no more scanning through 50 messages.

Production Considerations

Graceful Degradation

Not everyone has an API key. The system should work (with reduced quality) without one:

Python
def get_summary(text: str) -> str:
    api_key = os.environ.get("ANTHROPIC_API_KEY")

    if not api_key:
        # Fallback: extract first and last paragraphs
        lines = text.split('\n')
        return '\n'.join(lines[:3] + ['...'] + lines[-3:])

    # Use Claude for high-quality summary
    ...

Local-First Storage

ChromaDB provides persistent local storage—no cloud dependency:

Python
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="conversations",
    metadata={"hnsw:space": "cosine"}
)

Deduplication

Multiple windows from the same conversation might match. Deduplicate by parent_id to show one result per conversation:

Python
def deduplicate_by_conversation(results: List[dict]) -> List[dict]:
    seen_parents = set()
    deduplicated = []

    for res in results:
        parent_id = res['metadata']['parent_id']
        if parent_id not in seen_parents:
            deduplicated.append(res)
            seen_parents.add(parent_id)

    return deduplicated

Key Takeaways

  1. Hierarchical chunking solves embedding dilution. Long conversations need both summaries (for topic search) and windows (for detail search).

  2. HyDE bridges the query-document gap. Generating hypothetical answers aligns queries with the documents that would answer them.

  3. Intent classification enables specialized handling. Different query types need different retrieval strategies and response formats.

  4. Knowledge graphs add relationship context. Entities and relationships extracted from conversations enable richer answers.

  5. Agentic patterns enable complex research. Multi-step planning and synthesis handle queries that span multiple conversations.

  6. Metadata enables precise highlighting. Window indices let the UI show exactly where the match occurred.

The conversation graveyard problem is solvable. With the right architecture, your past conversations become a searchable knowledge base—not a cemetery of lost insights. For more on building persistent memory into LLM applications, see LLM Memory Systems.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles