Building Semantic Memory for LLM Conversations: A Hierarchical RAG Approach
A practical guide to building a semantic search system for your LLM conversation history using hierarchical chunking, HyDE retrieval, knowledge graphs, and agentic research patterns.
Table of Contents
Introduction
This post walks through building a semantic search system for your LLM conversation history. We'll combine several techniques—hierarchical chunking, HyDE retrieval, knowledge graphs, and agentic research patterns—into a complete, working system.
Prerequisites: This is an intermediate-to-advanced post. You should be familiar with:
- Basic RAG concepts (embeddings, vector stores, retrieval). If not, start with Building Production-Ready RAG Systems.
- How LLM agents work (tool use, reasoning loops). See Building Agentic AI Systems for background.
What you'll learn:
- Why standard RAG fails for long conversations
- Hierarchical chunking with summaries + sliding windows
- HyDE (Hypothetical Document Embedding) for better retrieval
- Intent classification for query routing
- Knowledge graphs for relationship-aware search
- Agentic multi-step research patterns
Tech stack: Python, ChromaDB, Claude API, NetworkX, FastAPI
The Conversation Graveyard Problem
If you're a heavy user of ChatGPT or Claude, you've likely experienced this frustration: you had a brilliant conversation weeks ago—maybe you debugged a complex issue, made an important decision, or learned something valuable—but now you can't find it.
The built-in search only matches exact keywords. You remember the concept but not the exact words you used. And even if you find the right conversation, it's 50 messages long and you have no idea where the relevant part is.
This is the conversation graveyard problem: valuable knowledge buried in chat history, effectively lost because traditional search can't surface it.
Why Standard RAG Fails for Conversations
You might think: "Just embed the conversations and do semantic search!" But naive RAG has a critical flaw with long conversations:
Embedding dilution. When you embed a 50-message conversation as a single chunk, the embedding becomes a murky average of everything discussed. A conversation that covers database design, authentication, deployment, and UI styling produces an embedding that strongly matches none of those topics.
Consider this scenario:
Conversation: "Full Stack App Planning" (50 messages)
├── Messages 1-10: Database schema (Users, Products, Orders)
├── Messages 11-20: Authentication with JWT
├── Messages 21-30: React vs Next.js decision
├── Messages 31-40: Stripe payment integration
└── Messages 41-50: Nginx deployment + dark mode styling
If you search for "nginx reverse proxy configuration," a naive embedding of the full conversation won't rank highly—the nginx discussion is 1/5 of the content, diluted by unrelated topics.
The Solution: Hierarchical Chunking
The key insight is that conversations need multiple levels of representation. This builds on the hierarchical chunking concepts from production RAG systems, but adapted specifically for conversational data:
| Level | Content | Use Case |
|---|---|---|
| Summary | LLM-generated overview | "What was that conversation about X?" |
| Windows | Sliding message windows | "Show me the exact discussion of Y" |
This dual-level indexing lets you search by high-level topic (summary) OR find specific details (windows).
Implementation: Dual-Level Indexing
from pydantic import BaseModel
from typing import List
class ConversationChunk(BaseModel):
id: str
parent_id: str
type: str # "summary" or "window"
text: str
metadata: dict
def create_sliding_windows(
messages: List[Message],
window_size: int = 10,
overlap: int = 2
) -> List[dict]:
"""
Creates overlapping windows of messages.
Window size of 10 with overlap of 2 means:
- Window 1: messages 0-9
- Window 2: messages 8-17
- Window 3: messages 16-25
...and so on
"""
windows = []
step = window_size - overlap
for i in range(0, len(messages), step):
window_msgs = messages[i : i + window_size]
window_text = "\n".join(
f"{msg.role}: {msg.content}"
for msg in window_msgs
)
windows.append({
"text": window_text,
"start_index": i,
"end_index": i + len(window_msgs) - 1
})
if i + window_size >= len(messages):
break
return windows
The overlap is crucial—without it, information at window boundaries gets lost. A question spanning messages 9-11 would be split across two windows. With 2-message overlap, both windows contain the context.
Generating Summaries
For conversations longer than 5 messages, generate a summary using an LLM:
def get_conversation_summary(text: str) -> str:
"""
Uses Claude to summarize the conversation.
Falls back to truncation if no API key.
"""
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=300,
temperature=0,
system="""You are a helpful assistant that summarizes
conversations for retrieval purposes. Capture the main
topics, decisions, and key entities.""",
messages=[{
"role": "user",
"content": f"Summarize this conversation:\n\n{text[:100000]}"
}]
)
return response.content[0].text
The summary chunk captures the semantic essence of the conversation, while windows capture specific details. When indexed together, both levels are searchable.
Putting It Together: The Ingestion Pipeline
def ingest_conversation(conversation: dict) -> List[ConversationChunk]:
"""
Creates hierarchical chunks for a conversation.
"""
chunks = []
conv_id = conversation['id']
title = conversation['title']
messages = conversation['messages']
# Full text for summary
full_text = f"Title: {title}\n" + "\n".join(
f"{msg['role']}: {msg['content']}"
for msg in messages
)
# 1. Summary Chunk (for conversations > 5 messages)
if len(messages) > 5:
summary = get_conversation_summary(full_text)
chunks.append(ConversationChunk(
id=f"{conv_id}_summary",
parent_id=conv_id,
type="summary",
text=f"Summary of '{title}':\n{summary}",
metadata={
"title": title,
"type": "summary"
}
))
# 2. Sliding Window Chunks
windows = create_sliding_windows(messages)
for i, window in enumerate(windows):
chunks.append(ConversationChunk(
id=f"{conv_id}_window_{i}",
parent_id=conv_id,
type="window",
text=f"Excerpt from '{title}' (msgs {window['start_index']}-{window['end_index']}):\n{window['text']}",
metadata={
"title": title,
"type": "window",
"start_index": window['start_index'],
"end_index": window['end_index']
}
})
return chunks
The metadata is essential—it lets the UI highlight exactly which messages matched, not just which conversation.
Advanced Retrieval with HyDE
Basic semantic search has a fundamental problem: the query-document mismatch. A user searching "How do I configure nginx?" expects to find an answer, but the embedding of a question may not be close to the embedding of an answer in vector space.
Hypothetical Document Embedding (HyDE) solves this by generating a hypothetical answer and searching with that instead.
class AdvancedRetriever:
def __init__(self, vector_store: VectorStore):
self.vector_store = vector_store
self.client = Anthropic()
def expand_query(self, query: str) -> List[str]:
"""
Generates synonyms and alternative phrasings.
"""
response = self.client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=100,
temperature=0.7,
system="""Generate 3-5 alternative search queries that
capture the same intent but use different keywords.
Return ONLY the queries, one per line.""",
messages=[{
"role": "user",
"content": f"Generate variations for: {query}"
}]
)
variations = [
line.strip()
for line in response.content[0].text.split('\n')
if line.strip()
]
return [query] + variations
def generate_hyde_document(self, query: str) -> str:
"""
Generates a hypothetical document that WOULD answer the query.
We then search using this document's embedding.
"""
response = self.client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=300,
temperature=0.7,
system="""Write a hypothetical conversation snippet
or answer that would satisfy the user's search query.
Make it look like a real conversation or technical
explanation.""",
messages=[{
"role": "user",
"content": f"Write a hypothetical passage answering: {query}"
}]
)
return response.content[0].text
def search(self, query: str, n_results: int = 10) -> List[dict]:
"""
Search using both the original query and HyDE document.
"""
# Generate hypothetical document
hyde_doc = self.generate_hyde_document(query)
# Search with both
results_query = self.vector_store.search(query, n_results)
results_hyde = self.vector_store.search(hyde_doc, n_results)
# Deduplicate and merge
seen_ids = set()
merged = []
for results in [results_query, results_hyde]:
for res in results:
if res['id'] not in seen_ids:
merged.append(res)
seen_ids.add(res['id'])
# Re-rank by distance
merged.sort(key=lambda x: x['distance'])
return merged[:n_results]
How HyDE improves retrieval:
| Query | Without HyDE | With HyDE |
|---|---|---|
| "nginx configuration" | Matches questions about nginx | Matches actual nginx configs |
| "Why did I choose React?" | Matches React mentions | Matches decision discussions |
| "database schema design" | Matches schema keywords | Matches schema explanations |
The hypothetical document moves the query into the same embedding space as the documents that would answer it.
Intent Classification for Better Responses
Not all queries need the same handling. Someone asking "find that chat about databases" wants different output than "how do I configure Stripe webhooks?"
from typing import Literal
IntentType = Literal["RETRIEVAL", "QA", "DECISION", "CODE_LOOKUP", "SYNTHESIS"]
class IntentClassifier:
def __init__(self):
self.client = Anthropic()
def classify(self, query: str) -> IntentType:
"""
Classifies query intent:
- RETRIEVAL: Locate a conversation
- QA: Answer a question
- DECISION: Explain past reasoning
- CODE_LOOKUP: Find specific code
- SYNTHESIS: Summarize/compare across conversations
"""
response = self.client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=50,
temperature=0,
system="""Classify user queries into exactly ONE category.
Categories:
- RETRIEVAL: Finding a conversation ("find that chat about...")
- QA: Asking a question ("how do I...", "what is...")
- DECISION: Asking about decisions ("why did I choose...")
- CODE_LOOKUP: Looking for code ("show me the code for...")
- SYNTHESIS: Summary/comparison ("summarize what I learned...")
Respond with ONLY the category name.""",
messages=[{
"role": "user",
"content": f"Classify this query:\n\n{query}"
}]
)
return response.content[0].text.strip()
Each intent type gets a specialized system prompt for answer generation:
def get_system_prompt(intent: IntentType) -> str:
prompts = {
"CODE_LOOKUP": """You are helping retrieve code from past
conversations. Extract and present code clearly with syntax
highlighting. Show all relevant examples.""",
"DECISION": """You are helping recall past decisions and
reasoning. Focus on explaining WHY choices were made.
Quote exact reasoning when possible.""",
"SYNTHESIS": """You are synthesizing knowledge from multiple
conversations. Create summaries, comparisons, or overviews.
Use tables or bullet points for clarity.""",
"QA": """You answer questions based on past conversations.
Be concise and direct. Quote relevant parts when appropriate.
If the context doesn't contain the answer, say so clearly.""",
"RETRIEVAL": """You help locate relevant conversations.
Summarize what was discussed and highlight the most
relevant sections."""
}
return prompts.get(intent, prompts["QA"])
Knowledge Graphs for Relationship-Aware Search
Conversations aren't isolated—they reference technologies, people, and concepts that relate to each other. A knowledge graph captures these relationships. This is similar to GraphRAG approaches in agentic RAG systems, but focused on extracting structure from conversational data:
class GraphBuilder:
def __init__(self, graph_store: GraphStore):
self.graph_store = graph_store
self.client = Anthropic()
def process_conversation(self, text: str, conv_id: str):
"""
Extracts entities and relationships from a conversation.
"""
response = self.client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1000,
temperature=0,
system="""You are a Knowledge Graph Extractor.
Analyze the conversation and extract key Entities and Relationships.
Entities: Tools, Technologies, People, Decisions, Concepts, Projects.
Relationships: USES, CHOSE, REJECTED, IS_A, PART_OF, RELATED_TO.
Format as JSON:
{
"entities": [
{"name": "Next.js", "type": "Technology"},
{"name": "SEO", "type": "Concept"}
],
"relationships": [
{"source": "Next.js", "target": "SEO", "relation": "IMPROVES"}
]
}""",
messages=[{
"role": "user",
"content": f"Extract knowledge graph:\n\n{text[:50000]}"
}]
)
data = json.loads(response.content[0].text)
# Update graph
for entity in data.get("entities", []):
self.graph_store.add_entity(
entity["name"],
entity["type"],
{"source_conv": conv_id}
)
for rel in data.get("relationships", []):
self.graph_store.add_relationship(
rel["source"],
rel["target"],
rel["relation"],
{"source_conv": conv_id}
)
Graph Storage and Traversal
The GraphStore class manages the actual graph using NetworkX:
import networkx as nx
import pickle
import os
from typing import List, Dict, Any
class GraphStore:
def __init__(self, persistence_path: str = "data/knowledge_graph.pkl"):
self.graph = nx.MultiDiGraph()
self.persistence_path = persistence_path
self.load()
def add_entity(self, name: str, entity_type: str, metadata: dict = None):
"""Adds a node (entity) to the graph."""
if not self.graph.has_node(name):
self.graph.add_node(name, type=entity_type, **(metadata or {}))
def add_relationship(
self, source: str, target: str, relation: str, metadata: dict = None
):
"""Adds an edge (relationship) between entities."""
# Auto-create nodes if missing
if not self.graph.has_node(source):
self.graph.add_node(source, type="Unknown")
if not self.graph.has_node(target):
self.graph.add_node(target, type="Unknown")
self.graph.add_edge(source, target, relation=relation, **(metadata or {}))
def search_graph(self, query_entities: List[str]) -> List[str]:
"""
Finds paths between entities mentioned in the query.
Returns text descriptions of connections.
"""
connections = []
found = [e for e in query_entities if self.graph.has_node(e)]
# Find paths between pairs of entities
import itertools
for start, end in itertools.combinations(found, 2):
try:
path = nx.shortest_path(self.graph, start, end)
desc = f"Connection: {start}"
for i in range(len(path) - 1):
u, v = path[i], path[i + 1]
edge = self.graph.get_edge_data(u, v)
relation = edge[0].get("relation", "related to")
desc += f" --[{relation}]--> {v}"
connections.append(desc)
except nx.NetworkXNoPath:
continue
# Fallback: show direct neighbors
if not connections and found:
for entity in found:
for neighbor in list(self.graph.neighbors(entity))[:3]:
edge = self.graph.get_edge_data(entity, neighbor)
rel = edge[0].get("relation", "related to")
connections.append(f"{entity} {rel} {neighbor}")
return connections
def save(self):
with open(self.persistence_path, "wb") as f:
pickle.dump(self.graph, f)
def load(self):
if os.path.exists(self.persistence_path):
with open(self.persistence_path, "rb") as f:
self.graph = pickle.load(f)
When answering questions, the graph provides additional context:
Query: "Why did I choose Next.js?"
Graph Path Found:
Next.js --[IMPROVES]--> SEO
Next.js --[REPLACES]--> React
React --[REJECTED_FOR]--> SSR limitations
Enhanced Answer: "You chose Next.js because it improves SEO through
server-side rendering. Your conversations mention rejecting plain
React due to SSR limitations..."
Agentic Deep Research
For complex queries that span multiple conversations, a simple retrieve-and-answer isn't enough. An agentic research loop breaks down complex queries into sub-queries, retrieves from multiple sources, and synthesizes a comprehensive answer. This follows the plan-execute-synthesize pattern from agentic AI systems:
class DeepResearchAgent:
def __init__(self, retriever: AdvancedRetriever):
self.retriever = retriever
self.client = Anthropic()
def research(self, query: str) -> dict:
"""
Multi-step research for complex queries.
"""
steps = []
all_findings = []
all_sources = set()
# 1. Generate Research Plan
plan = self._generate_plan(query)
steps.append({
"step": "Planning",
"details": f"Generated {len(plan)} sub-queries"
})
# 2. Execute Each Sub-Query
for sub_query in plan:
results = self.retriever.search(sub_query, n_results=5)
findings = []
for res in results:
title = res['metadata']['title']
snippet = res['document']
findings.append(f"Source: {title}\n{snippet}")
all_sources.add(title)
all_findings.append(
f"--- Findings for '{sub_query}' ---\n" +
"\n\n".join(findings)
)
steps.append({
"step": "Search",
"details": f"'{sub_query}' → {len(results)} results"
})
# 3. Synthesize Report
report = self._synthesize(query, all_findings)
steps.append({"step": "Synthesis", "details": "Compiled report"})
return {
"answer": report,
"steps": steps,
"sources": list(all_sources)
}
def _generate_plan(self, query: str) -> List[str]:
"""Break complex query into searchable sub-queries."""
response = self.client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=300,
temperature=0,
system="""You are a research planner. Break down the
user's complex query into 2-4 distinct search queries
needed to gather all information. Return ONLY the
queries, one per line.""",
messages=[{
"role": "user",
"content": f"Plan research for: {query}"
}]
)
return [
line.strip()
for line in response.content[0].text.split('\n')
if line.strip()
]
def _synthesize(self, query: str, findings: List[str]) -> str:
"""Synthesize findings into a comprehensive report."""
context = "\n\n".join(findings)
response = self.client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2000,
temperature=0,
system="""You are a research assistant. Write a
comprehensive answer based on the gathered findings.
Structure it with headings. Cite sources.""",
messages=[{
"role": "user",
"content": f"""Write a research report for: {query}
Based on these findings:
{context}"""
}]
)
return response.content[0].text
Example Deep Research Query:
Query: "Compare my database design decisions with my deployment strategy"
Generated Plan:
1. "database schema design decisions"
2. "deployment strategy infrastructure"
3. "database deployment configuration"
Execution:
- Search 1: Found 3 chunks about PostgreSQL schema, indexes
- Search 2: Found 4 chunks about DigitalOcean, Nginx, Docker
- Search 3: Found 2 chunks about database migrations, backups
Synthesized Report:
"# Database Design vs Deployment Strategy
## Database Decisions
Your conversations show a preference for PostgreSQL with...
## Deployment Choices
You decided on DigitalOcean for hosting because...
## Integration Points
The database deployment uses Docker containers with..."
Complete Architecture
Here's the full system architecture:
┌─────────────────────────────────────────────────────┐
│ User Interface │
│ ┌─────────┐ ┌─────────┐ ┌──────────────┐ │
│ │ Search │ │ Ask │ │ Research │ │
│ └────┬────┘ └────┬────┘ └──────┬───────┘ │
└───────┼────────────┼───────────────┼────────────────┘
│ │ │
▼ ▼ ▼
┌───────────────────────────────────────────────────────┐
│ FastAPI Backend │
├───────────────────────────────────────────────────────┤
│ Intent Classifier → Routes to appropriate handler │
├───────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────┐ │
│ │ Advanced Retriever │ │
│ │ • HyDE Document Generation │ │
│ │ • Query Expansion │ │
│ │ • Multi-query Search + Deduplication │ │
│ └─────────────────────────────────────────────────┘ │
├───────────────────────────────────────────────────────┤
│ ┌──────────────────┐ ┌───────────────────────────┐ │
│ │ Vector Store │ │ Knowledge Graph │ │
│ │ (ChromaDB) │ │ (NetworkX) │ │
│ │ │ │ │ │
│ │ • Summary Chunks│ │ • Entity Relationships │ │
│ │ • Window Chunks │ │ • Graph Traversal │ │
│ └──────────────────┘ └───────────────────────────┘ │
├───────────────────────────────────────────────────────┤
│ QA Engine / Deep Research Agent │
│ • Intent-aware prompting │
│ • Multi-step research loops │
│ • Source citation │
└───────────────────────────────────────────────────────┘
▲
│
┌───────┴───────────────────────────────────────────────┐
│ Ingestion Pipeline │
│ JSON Export → Parse → Chunk → Embed → Index │
│ ↓ │
│ Hierarchical Chunking: │
│ • Claude Summary (if > 5 messages) │
│ • Sliding Windows (10 msgs, 2 overlap) │
└───────────────────────────────────────────────────────┘
Results: Finding the Needle
With hierarchical chunking, searches that previously failed now work:
| Query | Naive RAG | Hierarchical RAG |
|---|---|---|
| "nginx reverse proxy" | No match (diluted) | ✅ Window 3 (msgs 21-30) |
| "dark mode styling" | No match (diluted) | ✅ Window 5 (msgs 41-50) |
| "full stack app planning" | Weak match | ✅ Summary chunk |
| "JWT authentication setup" | Partial | ✅ Window 2 (msgs 11-20) |
The window metadata enables the UI to auto-scroll and highlight the exact section that matched—no more scanning through 50 messages.
Production Considerations
Graceful Degradation
Not everyone has an API key. The system should work (with reduced quality) without one:
def get_summary(text: str) -> str:
api_key = os.environ.get("ANTHROPIC_API_KEY")
if not api_key:
# Fallback: extract first and last paragraphs
lines = text.split('\n')
return '\n'.join(lines[:3] + ['...'] + lines[-3:])
# Use Claude for high-quality summary
...
Local-First Storage
ChromaDB provides persistent local storage—no cloud dependency:
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="conversations",
metadata={"hnsw:space": "cosine"}
)
Deduplication
Multiple windows from the same conversation might match. Deduplicate by parent_id to show one result per conversation:
def deduplicate_by_conversation(results: List[dict]) -> List[dict]:
seen_parents = set()
deduplicated = []
for res in results:
parent_id = res['metadata']['parent_id']
if parent_id not in seen_parents:
deduplicated.append(res)
seen_parents.add(parent_id)
return deduplicated
Key Takeaways
-
Hierarchical chunking solves embedding dilution. Long conversations need both summaries (for topic search) and windows (for detail search).
-
HyDE bridges the query-document gap. Generating hypothetical answers aligns queries with the documents that would answer them.
-
Intent classification enables specialized handling. Different query types need different retrieval strategies and response formats.
-
Knowledge graphs add relationship context. Entities and relationships extracted from conversations enable richer answers.
-
Agentic patterns enable complex research. Multi-step planning and synthesis handle queries that span multiple conversations.
-
Metadata enables precise highlighting. Window indices let the UI show exactly where the match occurred.
The conversation graveyard problem is solvable. With the right architecture, your past conversations become a searchable knowledge base—not a cemetery of lost insights. For more on building persistent memory into LLM applications, see LLM Memory Systems.
Frequently Asked Questions
Related Articles
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
Building Agentic AI Systems: A Complete Implementation Guide
A comprehensive guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.
Agentic RAG: When Retrieval Meets Autonomous Reasoning
How to build RAG systems that don't just retrieve—they reason, plan, and iteratively refine their searches to solve complex information needs.
LLM Memory Systems: From MemGPT to Long-Term Agent Memory
Understanding memory architectures for LLM agents—MemGPT's hierarchical memory, Letta's agent framework, and patterns for building agents that learn and remember across conversations.