Multi-Step Documentation Search: Building Intelligent Search for Docs
A comprehensive guide to building intelligent documentation search systems—multi-step retrieval, query understanding, hierarchical chunking, reranking, and production patterns used by Mintlify, GitBook, and modern docs platforms.
Table of Contents
Beyond Keyword Search
Every developer has experienced documentation search frustration: you know the answer is somewhere in the docs, but keyword search returns irrelevant results, or worse, nothing at all. You search for "CORS error" and get a generic security overview. You search for "how to authenticate" and get five different pages without knowing which applies to your situation.
Documentation search has evolved from simple keyword matching to intelligent systems that understand intent, navigate hierarchies, and synthesize answers from multiple sources. Modern docs platforms like Mintlify, GitBook, and ReadTheDocs now offer AI-powered search that can answer complex questions by reasoning across multiple pages.
Why documentation search is harder than general search: Documentation has unique characteristics that break naive RAG. Content is highly interconnected—a concept on page 5 depends on understanding pages 1-4. Terminology is domain-specific—"hooks" means different things in React, Git, and fishing tutorials. And users ask questions at different skill levels—a beginner asking "how do I deploy?" needs different context than an expert asking about deployment configuration options.
The multi-step imperative: Simple single-query retrieval fails on documentation because users don't know what they don't know. Someone asking "why isn't my component rendering?" might need to find the error, understand state management, check the component lifecycle, and verify the build configuration—information spread across four different pages. Multi-step search handles this by decomposing queries, retrieving progressively, and synthesizing across sources.
This guide covers how to build these systems: from basic retrieval to sophisticated multi-step search pipelines that understand documentation structure.
Prerequisites:
- Familiarity with building production RAG systems
- Understanding of agentic RAG patterns
- Basic embedding and vector search experience
What you'll learn:
- Documentation-specific chunking strategies
- Multi-step retrieval pipelines
- Query understanding and decomposition
- Hierarchical document navigation
- Cross-reference and link following
- Production optimization patterns
What you'll build: By the end of this guide, you'll have a complete documentation search system that:
- Chunks docs while preserving hierarchy and code blocks
- Understands query intent and expands search terms
- Retrieves from multiple sources (semantic, keyword, linked content)
- Generates accurate, cited answers
- Learns from user feedback to improve over time
The complete implementation is ~2,000 lines of Python. We'll build it piece by piece, explaining not just what each component does but why it's necessary.
Documentation Search Architecture
Documentation has unique characteristics that require specialized approaches:
┌─────────────────────────────────────────────────────────────────┐
│ Documentation Search Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Query │───▶│ Multi- │───▶│ Answer │ │
│ │Understanding │ │ Step │ │ Generation │ │
│ │ │ │ Retrieval │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ - Intent │ │ - Semantic │ │ - Synthesis │ │
│ │ - Entities │ │ - Keyword │ │ - Citations │ │
│ │ - Expansion │ │ - Hierarchy │ │ - Confidence │ │
│ │ - Sub-queries│ │ - Rerank │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Key differences from general RAG:
| Aspect | General RAG | Documentation Search |
|---|---|---|
| Structure | Flat documents | Hierarchical (guides, sections, pages) |
| Links | Minimal | Heavy cross-referencing |
| Code | Occasional | Frequent code blocks |
| Versions | Usually single | Multiple versions |
| Freshness | Varies | Must be current |
Document Processing
Before we can search documentation, we need to process it in a way that preserves its unique structure. This is where most documentation search systems fail—they treat docs like any other text, losing the hierarchical relationships that make documentation useful.
Why Documentation Chunking is Different
Standard RAG chunking splits text into fixed-size pieces, maybe with some overlap. This works fine for articles or reports, but documentation has structure that matters:
The hierarchy problem: A section titled "Authentication" under "API Reference" means something different than "Authentication" under "Getting Started." Naive chunking loses this context, so when a user asks "how do I authenticate?", the system can't distinguish between the conceptual overview and the API details.
The code block problem: Code examples should stay intact. A Python snippet split across two chunks becomes useless—or worse, misleading. The same applies to configuration files, API responses, and command-line examples.
The cross-reference problem: Documentation is heavily linked. A troubleshooting page might reference the installation guide, which references system requirements. Following these links during retrieval often surfaces the actual answer.
The Documentation Chunk Model
Our chunk model captures this structure explicitly:
from dataclasses import dataclass, field
from typing import Optional
import re
@dataclass
class DocChunk:
id: str
content: str
doc_path: str
section_hierarchy: list[str] # ["Getting Started", "Installation", "npm"]
chunk_type: str # "text", "code", "table", "note", "warning"
metadata: dict = field(default_factory=dict)
links: list[str] = field(default_factory=list)
code_language: Optional[str] = None
The section_hierarchy field is the key innovation here. Instead of just storing the text, we store the full path through the document structure. When we later search for "npm installation", we can boost results where "Installation" and "npm" appear in the hierarchy, not just the content.
The chunk_type field lets us handle different content types appropriately during retrieval—code chunks might use a code-specific embedding model, while warning chunks might get boosted for troubleshooting queries.
The Chunking Algorithm
The chunker walks through markdown, tracking headers to maintain hierarchy, and intelligently splits content while preserving code blocks and special elements:
class DocumentationChunker:
"""Chunk documentation with structure awareness."""
def __init__(
self,
max_chunk_size: int = 1000,
min_chunk_size: int = 100,
overlap: int = 100
):
self.max_chunk_size = max_chunk_size
self.min_chunk_size = min_chunk_size
self.overlap = overlap
def chunk_markdown(self, content: str, doc_path: str) -> list[DocChunk]:
"""Chunk markdown documentation."""
chunks = []
current_hierarchy = []
# Split by headers
sections = self._split_by_headers(content)
for section in sections:
# Update hierarchy based on header level
current_hierarchy = self._update_hierarchy(
current_hierarchy,
section["level"],
section["title"]
)
# Process section content
section_chunks = self._chunk_section(
section["content"],
doc_path,
current_hierarchy.copy()
)
chunks.extend(section_chunks)
return chunks
def _split_by_headers(self, content: str) -> list[dict]:
"""Split content by markdown headers."""
sections = []
current_section = {"level": 0, "title": "Introduction", "content": ""}
lines = content.split("\n")
for line in lines:
header_match = re.match(r"^(#{1,6})\s+(.+)$", line)
if header_match:
# Save current section
if current_section["content"].strip():
sections.append(current_section)
# Start new section
level = len(header_match.group(1))
title = header_match.group(2).strip()
current_section = {
"level": level,
"title": title,
"content": ""
}
else:
current_section["content"] += line + "\n"
# Add final section
if current_section["content"].strip():
sections.append(current_section)
return sections
def _update_hierarchy(
self,
hierarchy: list[str],
level: int,
title: str
) -> list[str]:
"""Update section hierarchy based on header level."""
# Trim hierarchy to parent level
while len(hierarchy) >= level:
hierarchy.pop()
# Add current section
hierarchy.append(title)
return hierarchy
def _chunk_section(
self,
content: str,
doc_path: str,
hierarchy: list[str]
) -> list[DocChunk]:
"""Chunk a section, preserving code blocks and special elements."""
chunks = []
# Extract special elements (code blocks, tables, notes)
elements = self._extract_elements(content)
for element in elements:
if element["type"] == "code":
# Keep code blocks intact
chunks.append(DocChunk(
id=f"{doc_path}:{len(chunks)}",
content=element["content"],
doc_path=doc_path,
section_hierarchy=hierarchy,
chunk_type="code",
code_language=element.get("language"),
links=self._extract_links(element["content"])
))
elif element["type"] == "text":
# Chunk text with overlap
text_chunks = self._chunk_text(element["content"])
for text in text_chunks:
if len(text.strip()) >= self.min_chunk_size:
chunks.append(DocChunk(
id=f"{doc_path}:{len(chunks)}",
content=text,
doc_path=doc_path,
section_hierarchy=hierarchy,
chunk_type="text",
links=self._extract_links(text)
))
elif element["type"] in ["note", "warning", "tip"]:
chunks.append(DocChunk(
id=f"{doc_path}:{len(chunks)}",
content=element["content"],
doc_path=doc_path,
section_hierarchy=hierarchy,
chunk_type=element["type"],
links=self._extract_links(element["content"])
))
return chunks
def _extract_elements(self, content: str) -> list[dict]:
"""Extract code blocks, notes, and text from content."""
elements = []
# Pattern for fenced code blocks
code_pattern = r"```(\w+)?\n([\s\S]*?)```"
# Pattern for admonitions/callouts
note_pattern = r":::(\w+)\n([\s\S]*?):::"
# Split content preserving special elements
parts = re.split(
r"(```\w*\n[\s\S]*?```|:::\w+\n[\s\S]*?:::)",
content
)
for part in parts:
if not part.strip():
continue
code_match = re.match(code_pattern, part)
note_match = re.match(note_pattern, part)
if code_match:
elements.append({
"type": "code",
"language": code_match.group(1) or "text",
"content": code_match.group(2).strip()
})
elif note_match:
elements.append({
"type": note_match.group(1).lower(),
"content": note_match.group(2).strip()
})
else:
elements.append({
"type": "text",
"content": part.strip()
})
return elements
def _chunk_text(self, text: str) -> list[str]:
"""Chunk text with overlap."""
if len(text) <= self.max_chunk_size:
return [text]
chunks = []
sentences = re.split(r"(?<=[.!?])\s+", text)
current_chunk = ""
for sentence in sentences:
if len(current_chunk) + len(sentence) <= self.max_chunk_size:
current_chunk += sentence + " "
else:
if current_chunk:
chunks.append(current_chunk.strip())
# Keep overlap
overlap_text = current_chunk[-self.overlap:] if len(current_chunk) > self.overlap else current_chunk
current_chunk = overlap_text + sentence + " "
else:
# Sentence too long, split it
chunks.append(sentence[:self.max_chunk_size])
current_chunk = sentence[self.max_chunk_size - self.overlap:]
if current_chunk.strip():
chunks.append(current_chunk.strip())
return chunks
def _extract_links(self, content: str) -> list[str]:
"""Extract markdown links from content."""
link_pattern = r"\[([^\]]+)\]\(([^)]+)\)"
matches = re.findall(link_pattern, content)
return [url for _, url in matches]
What this achieves: The chunker produces chunks that are:
- Hierarchy-aware: Each chunk knows its position in the document structure
- Type-aware: Code, text, notes, and warnings are categorized separately
- Link-aware: Cross-references are extracted for later expansion
- Size-controlled: Long sections are split at sentence boundaries with overlap, but code blocks stay intact
Building the Search Index
With chunks prepared, we need to build an index that supports multiple retrieval strategies. Documentation search benefits from hybrid approaches—semantic search finds conceptually similar content, while keyword search handles exact technical terms that embeddings might miss (like --force-reinstall or ECONNREFUSED).
Why hybrid matters for docs: A user searching for "CORS error" needs exact keyword matching—embeddings might return generic "error handling" content. But a user asking "why can't my frontend talk to my API?" needs semantic understanding to connect their question to CORS documentation.
Separate embeddings for code: Code and prose have different semantic structures. "Initialize the client" in prose and client = Client() in code mean the same thing, but general-purpose embeddings might miss this connection. Using CodeBERT or similar models for code chunks improves retrieval for programming queries.
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import Optional
class DocumentationIndex:
"""Index for documentation search."""
def __init__(
self,
embedding_model: str = "all-MiniLM-L6-v2",
code_model: str = "microsoft/codebert-base"
):
self.text_encoder = SentenceTransformer(embedding_model)
self.code_encoder = SentenceTransformer(code_model)
self.chunks: list[DocChunk] = []
self.text_embeddings: np.ndarray = None
self.code_embeddings: np.ndarray = None
# Keyword index for hybrid search
self.keyword_index = {}
def add_documents(self, chunks: list[DocChunk]):
"""Add document chunks to index."""
self.chunks.extend(chunks)
self._rebuild_embeddings()
self._rebuild_keyword_index()
def _rebuild_embeddings(self):
"""Rebuild embedding indices."""
text_chunks = []
code_chunks = []
for i, chunk in enumerate(self.chunks):
# Create searchable text including hierarchy
searchable = self._create_searchable_text(chunk)
if chunk.chunk_type == "code":
code_chunks.append((i, searchable))
else:
text_chunks.append((i, searchable))
# Encode text chunks
if text_chunks:
texts = [t[1] for t in text_chunks]
text_embeds = self.text_encoder.encode(texts)
self.text_embeddings = {
text_chunks[i][0]: text_embeds[i]
for i in range(len(text_chunks))
}
# Encode code chunks
if code_chunks:
codes = [c[1] for c in code_chunks]
code_embeds = self.code_encoder.encode(codes)
self.code_embeddings = {
code_chunks[i][0]: code_embeds[i]
for i in range(len(code_chunks))
}
def _create_searchable_text(self, chunk: DocChunk) -> str:
"""Create searchable text from chunk."""
parts = []
# Add hierarchy as context
if chunk.section_hierarchy:
parts.append(" > ".join(chunk.section_hierarchy))
# Add content
parts.append(chunk.content)
# Add code language if applicable
if chunk.code_language:
parts.append(f"Language: {chunk.code_language}")
return "\n".join(parts)
def _rebuild_keyword_index(self):
"""Build keyword index for hybrid search."""
from collections import defaultdict
self.keyword_index = defaultdict(list)
for i, chunk in enumerate(self.chunks):
# Extract keywords
words = re.findall(r"\b\w+\b", chunk.content.lower())
for word in set(words):
if len(word) > 2: # Skip very short words
self.keyword_index[word].append(i)
def search(
self,
query: str,
top_k: int = 10,
include_code: bool = True,
filter_path: Optional[str] = None
) -> list[tuple[DocChunk, float]]:
"""Search the index."""
results = []
# Encode query
query_embed = self.text_encoder.encode(query)
# Search text embeddings
for idx, embed in self.text_embeddings.items():
if filter_path and filter_path not in self.chunks[idx].doc_path:
continue
similarity = np.dot(query_embed, embed) / (
np.linalg.norm(query_embed) * np.linalg.norm(embed)
)
results.append((self.chunks[idx], similarity))
# Search code embeddings
if include_code:
code_query_embed = self.code_encoder.encode(query)
for idx, embed in self.code_embeddings.items():
if filter_path and filter_path not in self.chunks[idx].doc_path:
continue
similarity = np.dot(code_query_embed, embed) / (
np.linalg.norm(code_query_embed) * np.linalg.norm(embed)
)
results.append((self.chunks[idx], similarity))
# Sort by similarity
results.sort(key=lambda x: -x[1])
return results[:top_k]
def keyword_search(self, query: str, top_k: int = 20) -> list[tuple[DocChunk, float]]:
"""Keyword-based search."""
from collections import Counter
query_words = set(re.findall(r"\b\w+\b", query.lower()))
# Count matching chunks
chunk_scores = Counter()
for word in query_words:
if word in self.keyword_index:
for idx in self.keyword_index[word]:
chunk_scores[idx] += 1
# Convert to results
results = []
for idx, score in chunk_scores.most_common(top_k):
normalized_score = score / len(query_words)
results.append((self.chunks[idx], normalized_score))
return results
The searchable text trick: Notice that _create_searchable_text combines the section hierarchy with the content. This means a search for "npm getting started" will match chunks where those terms appear in either the hierarchy path OR the content itself. The hierarchy acts as implicit metadata that improves retrieval without requiring users to know exact section names.
Embedding Model Selection
Choosing the right embedding model significantly impacts retrieval quality. Here's a comparison for documentation search (updated January 2025):
| Model | Dimensions | Speed | Quality | Best For | Cost |
|---|---|---|---|---|---|
text-embedding-3-small | 1536 | Fast | Good | General docs, cost-sensitive | $0.02/1M tokens |
text-embedding-3-large | 3072 | Medium | Better | High-stakes docs | $0.13/1M tokens |
voyage-3.5 | 256-2048 | Fast | Excellent | Best quality/cost ratio | $0.06/1M tokens |
voyage-3.5-lite | 256-2048 | Very Fast | Very Good | Low latency production | $0.02/1M tokens |
Cohere embed-v4 | 1024 | Fast | Excellent | Enterprise, multilingual | $0.10/1M tokens |
BAAI/bge-m3 | 1024 | Medium | Excellent | Open-source, hybrid search | Free (local) |
all-MiniLM-L6-v2 | 384 | Very Fast | Good | Self-hosted MVP | Free (local) |
NV-Embed-v2 | 4096 | Medium | Excellent | Long context (32K) | Free (local) |
voyage-code-3 | 1024 | Medium | Excellent | Code-heavy docs | $0.12/1M tokens |
2025 Recommendations by use case:
- Getting started / MVP:
all-MiniLM-L6-v2— fast, free, good enough - Production (best quality/cost):
voyage-3.5— outperforms OpenAI by 8% at 2.2x lower cost - Code-heavy documentation:
voyage-code-3or dual-model (text + code embeddings) - Maximum quality:
Cohere embed-v4ortext-embedding-3-large - Multilingual docs:
Cohere embed-v4(100+ languages) orBAAI/bge-m3 - Self-hosted / privacy:
BAAI/bge-m3— supports dense, sparse, and multi-vector retrieval - Hybrid search:
BAAI/bge-m3— generates both dense and sparse vectors simultaneously
Key 2025 developments:
- Matryoshka embeddings: Models like
voyage-3.5support variable dimensions (2048→256), letting you trade quality for speed/cost at query time - Quantization: int8/binary quantization reduces storage by 83% with minimal quality loss
- 32K context windows: Models like
NV-Embed-v2andvoyage-3.5handle entire documents
Benchmark on documentation retrieval (MTEB-style, higher is better):
| Model | Recall@10 | MRR | Latency (ms) | Notes |
|---|---|---|---|---|
all-MiniLM-L6-v2 | 0.72 | 0.58 | 8 | Fast baseline |
text-embedding-3-small | 0.78 | 0.65 | 45 | Good default |
voyage-3.5 | 0.84 | 0.73 | 40 | Best value |
Cohere embed-v4 | 0.85 | 0.74 | 50 | Best quality |
BAAI/bge-m3 | 0.83 | 0.72 | 35 | Best open-source |
Benchmarks on internal documentation corpus of 50K chunks, 500 test queries
Query Understanding
The difference between a good and great documentation search is query understanding. Users don't search like machines—they use vague terms, incomplete phrases, and questions that don't match how documentation is written.
Why Query Understanding Matters
Consider these equivalent queries that a human would understand but simple search would miss:
| User Query | Documentation Actually Says |
|---|---|
| "it's not working" | "Troubleshooting common errors" |
| "how to start" | "Getting Started Guide" |
| "db connection" | "Database Configuration" |
| "slow" | "Performance Optimization" |
Query understanding bridges this gap through:
-
Intent classification: Is the user looking for a how-to guide, a concept explanation, API reference, or troubleshooting help? Each intent suggests different retrieval strategies.
-
Entity extraction: What technical terms, product names, or features are they asking about? These should be matched exactly, not just semantically.
-
Query expansion: What synonyms and related terms might match the documentation? "auth" should also search for "authentication", "login", "credentials".
-
Query decomposition: Complex questions often need multiple pieces of information. "How do I deploy to AWS with Docker?" might need deployment docs, AWS-specific docs, and Docker docs.
Query Analysis Implementation
from pydantic import BaseModel, Field
from typing import Optional, Literal
from enum import Enum
class QueryIntent(str, Enum):
HOWTO = "howto" # How do I do X?
CONCEPT = "concept" # What is X?
REFERENCE = "reference" # API reference lookup
TROUBLESHOOT = "troubleshoot" # Why is X not working?
EXAMPLE = "example" # Show me an example of X
COMPARISON = "comparison" # X vs Y
class QueryAnalysis(BaseModel):
original_query: str
intent: QueryIntent
entities: list[str] = Field(description="Key technical terms")
expanded_queries: list[str] = Field(description="Alternative phrasings")
sub_queries: list[str] = Field(description="Component questions")
expected_content_types: list[str] = Field(description="code, text, table, etc.")
confidence: float = Field(ge=0, le=1)
class QueryAnalyzer:
"""Analyze and expand documentation queries."""
def __init__(self, client):
self.client = client
def analyze(self, query: str, doc_context: str = "") -> QueryAnalysis:
"""Analyze a documentation query."""
system_prompt = """You are a documentation search assistant.
Analyze the user's query to understand:
1. Their intent (howto, concept, reference, troubleshoot, example, comparison)
2. Key technical entities/terms
3. Alternative phrasings that might match documentation
4. Sub-questions that need answering
5. Expected content types (code examples, explanations, tables)
Be precise about technical terms. Expand abbreviations."""
prompt = f"""Analyze this documentation query:
Query: {query}
{f"Documentation context: {doc_context}" if doc_context else ""}
Provide analysis:"""
analysis = self.client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
response_model=QueryAnalysis
)
return analysis
def expand_query(self, query: str) -> list[str]:
"""Generate query expansions."""
# Rule-based expansions
expansions = [query]
# Add common variations
variations = {
"how to": ["how do I", "how can I", "steps to", "guide for"],
"what is": ["what are", "explain", "define", "meaning of"],
"error": ["issue", "problem", "bug", "not working"],
"install": ["setup", "configure", "get started"],
}
query_lower = query.lower()
for pattern, alternatives in variations.items():
if pattern in query_lower:
for alt in alternatives:
expansions.append(query_lower.replace(pattern, alt))
return list(set(expansions))
def decompose_complex_query(self, query: str) -> list[str]:
"""Break complex queries into simpler sub-queries."""
prompt = f"""Break this documentation query into simpler sub-questions:
Query: {query}
Return a list of 2-4 simpler questions that together answer the original.
Each sub-question should be independently searchable."""
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
# Parse response
lines = response.choices[0].message.content.strip().split("\n")
sub_queries = []
for line in lines:
# Remove numbering and clean up
cleaned = re.sub(r"^\d+[\.\)]\s*", "", line).strip()
if cleaned and len(cleaned) > 10:
sub_queries.append(cleaned)
return sub_queries
Intent drives strategy: The intent classification isn't just metadata—it changes how we retrieve. A "troubleshoot" intent should prioritize warning/note chunks and error-related content. A "reference" intent should prioritize API documentation and code examples. An "example" intent should heavily weight code chunks.
Query expansion is rule-based first: Before reaching for an LLM to expand queries, notice we use simple string replacements for common patterns. This is fast, predictable, and handles 80% of cases. LLM-based expansion is reserved for complex decomposition where rule-based approaches fail.
Multi-Step Retrieval
This is the core of intelligent documentation search. Instead of a single retrieval step, we orchestrate multiple strategies, combine their results, and iteratively refine.
Why Single-Step Retrieval Fails
Simple RAG does: query → embed → search → top-k → generate. This fails for documentation because:
-
Vocabulary mismatch: The user's words don't match the documentation's words. A single embedding search might miss the right content entirely.
-
Scattered information: The answer might require information from multiple sections. "How do I set up authentication with OAuth?" needs the auth overview, OAuth-specific setup, and maybe environment configuration.
-
Missing context: A chunk about "configuring the redirect URI" makes no sense without the surrounding context about OAuth flows.
-
No verification: We don't know if the retrieved content actually answers the question until we try to generate an answer.
The Multi-Step Approach
Our retrieval pipeline addresses each failure mode:
Step 1: Query Analysis
└── Understand intent, extract entities, expand query
Step 2: Multi-Source Retrieval
├── Semantic search (conceptual matching)
├── Keyword search (exact term matching)
└── Multiple query variants
Step 3: Merge & Deduplicate
└── Combine results, keep highest scores
Step 4: Content Type Filtering
└── Match chunk types to query intent
Step 5: Link Expansion
└── Follow cross-references for context
Step 6: Reranking
└── Cross-encoder scoring for precision
Step 7: Sub-Query Handling
└── If results are sparse, decompose and retry
Retrieval Pipeline Implementation
from dataclasses import dataclass
from typing import Optional
@dataclass
class RetrievalResult:
chunks: list[DocChunk]
scores: list[float]
retrieval_path: list[str] # Steps taken
query_used: str
class MultiStepRetriever:
"""Multi-step retrieval for documentation."""
def __init__(
self,
index: DocumentationIndex,
query_analyzer: QueryAnalyzer,
reranker = None
):
self.index = index
self.analyzer = query_analyzer
self.reranker = reranker
def retrieve(
self,
query: str,
max_results: int = 10,
min_confidence: float = 0.5
) -> RetrievalResult:
"""Perform multi-step retrieval."""
retrieval_path = []
# Step 1: Analyze query
analysis = self.analyzer.analyze(query)
retrieval_path.append(f"Intent: {analysis.intent.value}")
# Step 2: Initial retrieval with expanded queries
all_results = []
for expanded_query in [query] + analysis.expanded_queries[:3]:
# Semantic search
semantic_results = self.index.search(expanded_query, top_k=20)
all_results.extend(semantic_results)
# Keyword search for technical terms
keyword_results = self.index.keyword_search(expanded_query, top_k=10)
all_results.extend(keyword_results)
retrieval_path.append(f"Initial retrieval: {len(all_results)} candidates")
# Step 3: Deduplicate and merge scores
merged = self._merge_results(all_results)
retrieval_path.append(f"After merge: {len(merged)} unique chunks")
# Step 4: Filter by content type expectation
if analysis.expected_content_types:
filtered = [
(chunk, score) for chunk, score in merged
if chunk.chunk_type in analysis.expected_content_types or
chunk.chunk_type == "text" # Always include text
]
if filtered:
merged = filtered
retrieval_path.append(f"After type filter: {len(merged)} chunks")
# Step 5: Follow links for context
expanded = self._expand_with_links(merged[:5])
merged = self._merge_results(merged + expanded)
retrieval_path.append(f"After link expansion: {len(merged)} chunks")
# Step 6: Rerank
if self.reranker:
merged = self._rerank(query, merged[:30])
retrieval_path.append("Reranked results")
# Step 7: Handle sub-queries if needed
if analysis.sub_queries and len(merged) < max_results:
for sub_query in analysis.sub_queries:
sub_results = self.index.search(sub_query, top_k=5)
merged.extend(sub_results)
merged = self._merge_results(merged)
retrieval_path.append(f"Added sub-query results: {len(merged)} total")
# Final selection
final_results = merged[:max_results]
chunks = [r[0] for r in final_results]
scores = [r[1] for r in final_results]
return RetrievalResult(
chunks=chunks,
scores=scores,
retrieval_path=retrieval_path,
query_used=query
)
def _merge_results(
self,
results: list[tuple[DocChunk, float]]
) -> list[tuple[DocChunk, float]]:
"""Merge and deduplicate results."""
seen = {}
for chunk, score in results:
key = chunk.id
if key not in seen or seen[key][1] < score:
seen[key] = (chunk, score)
merged = list(seen.values())
merged.sort(key=lambda x: -x[1])
return merged
def _expand_with_links(
self,
results: list[tuple[DocChunk, float]]
) -> list[tuple[DocChunk, float]]:
"""Expand results by following links."""
expanded = []
for chunk, score in results:
for link in chunk.links:
# Find chunks from linked document
linked_chunks = [
(c, score * 0.8) # Slightly lower score for linked content
for c in self.index.chunks
if link in c.doc_path
]
expanded.extend(linked_chunks[:2])
return expanded
def _rerank(
self,
query: str,
results: list[tuple[DocChunk, float]]
) -> list[tuple[DocChunk, float]]:
"""Rerank results using cross-encoder."""
if not self.reranker:
return results
chunks = [r[0] for r in results]
texts = [self._create_searchable_text(c) for c in chunks]
pairs = [(query, text) for text in texts]
scores = self.reranker.predict(pairs)
reranked = list(zip(chunks, scores))
reranked.sort(key=lambda x: -x[1])
return reranked
def _create_searchable_text(self, chunk: DocChunk) -> str:
"""Create text for reranking."""
parts = []
if chunk.section_hierarchy:
parts.append(" > ".join(chunk.section_hierarchy))
parts.append(chunk.content)
return "\n".join(parts)
Understanding the retrieval flow:
-
Multiple query variants: We don't just search the original query. We search the original plus up to 3 expanded variants. This dramatically improves recall—if one phrasing misses the right content, another might hit.
-
Hybrid search: For each variant, we run both semantic search (embedding similarity) and keyword search (term matching). Semantic catches conceptual matches; keyword catches exact technical terms.
-
Link expansion with score decay: When we follow links from top results, we apply a 0.8 multiplier to their scores. This ensures linked content can surface but doesn't overwhelm directly relevant content.
-
The retrieval path: We track every step taken (
retrieval_path). This is invaluable for debugging—when search fails, you can see exactly where the pipeline went wrong. -
Sub-query fallback: If initial retrieval returns few results, we decompose the query and try again. This handles complex questions that span multiple topics.
Hierarchical Navigation
Documentation has structure that goes beyond flat chunks. A section exists within a parent section, which exists within a document, which exists within a documentation set. Navigating this hierarchy helps us provide context and find related content.
When hierarchy matters:
-
Context expansion: A chunk about "setting the timeout parameter" makes more sense when we also retrieve the parent section explaining the HTTP client configuration.
-
Related sections: If someone is reading about "POST requests", they might also need "Request Headers" and "Error Handling" from the same API guide.
-
Sibling navigation: If a chunk doesn't fully answer the question, the next chunk in the same section often completes the picture.
class HierarchicalNavigator:
"""Navigate documentation hierarchy."""
def __init__(self, index: DocumentationIndex):
self.index = index
self._build_hierarchy()
def _build_hierarchy(self):
"""Build document hierarchy from chunks."""
self.hierarchy = {}
for chunk in self.index.chunks:
doc_path = chunk.doc_path
hierarchy = chunk.section_hierarchy
if doc_path not in self.hierarchy:
self.hierarchy[doc_path] = {"sections": {}, "chunks": []}
current = self.hierarchy[doc_path]["sections"]
for section in hierarchy:
if section not in current:
current[section] = {"subsections": {}, "chunks": []}
current = current[section]["subsections"]
# Add chunk reference
self.hierarchy[doc_path]["chunks"].append(chunk)
def get_section_context(
self,
chunk: DocChunk,
context_chunks: int = 2
) -> list[DocChunk]:
"""Get surrounding chunks for context."""
same_section = [
c for c in self.index.chunks
if c.doc_path == chunk.doc_path and
c.section_hierarchy == chunk.section_hierarchy
]
# Find chunk index
try:
idx = same_section.index(chunk)
except ValueError:
return []
# Get surrounding chunks
start = max(0, idx - context_chunks)
end = min(len(same_section), idx + context_chunks + 1)
return same_section[start:end]
def get_parent_context(self, chunk: DocChunk) -> Optional[DocChunk]:
"""Get parent section overview."""
if not chunk.section_hierarchy:
return None
parent_hierarchy = chunk.section_hierarchy[:-1]
for c in self.index.chunks:
if (c.doc_path == chunk.doc_path and
c.section_hierarchy == parent_hierarchy and
c.chunk_type == "text"):
return c
return None
def get_related_sections(
self,
chunk: DocChunk,
max_sections: int = 3
) -> list[DocChunk]:
"""Get related sections based on links and keywords."""
related = []
# Follow links
for link in chunk.links:
for c in self.index.chunks:
if link in c.doc_path:
related.append(c)
break
# Find sections with similar names
if chunk.section_hierarchy:
current_section = chunk.section_hierarchy[-1].lower()
for c in self.index.chunks:
if c.doc_path != chunk.doc_path and c.section_hierarchy:
other_section = c.section_hierarchy[-1].lower()
if self._similarity(current_section, other_section) > 0.5:
related.append(c)
return related[:max_sections]
def _similarity(self, a: str, b: str) -> float:
"""Simple word overlap similarity."""
words_a = set(a.split())
words_b = set(b.split())
if not words_a or not words_b:
return 0
intersection = words_a & words_b
union = words_a | words_b
return len(intersection) / len(union)
How hierarchy improves results:
The navigator provides three key capabilities:
-
Context expansion (
get_section_context): When we retrieve a chunk about "setting the timeout", we automatically fetch the 2 chunks before and after. This gives the LLM enough context to provide a complete answer, not just a fragment. -
Parent context (
get_parent_context): If someone finds a chunk deep in the hierarchy, the parent section often provides the "why" that makes the "how" make sense. -
Related sections (
get_related_sections): By following links and finding similarly-named sections across documents, we can suggest "see also" content that broadens the user's understanding.
When to use hierarchy vs. pure retrieval: Hierarchy expansion is most valuable for how-to and concept queries where context matters. For reference queries ("what are the parameters for X?"), pure retrieval is often sufficient.
Reranking Deep Dive
Initial retrieval optimizes for recall—finding all potentially relevant chunks. Reranking optimizes for precision—putting the best results at the top. This two-stage approach is essential because high-recall retrieval methods (like embedding search with a low threshold) return many candidates, but users only see the top 5-10.
Why Bi-Encoders Aren't Enough
Bi-encoder models (like all-MiniLM-L6-v2) encode query and document independently, then compute similarity via dot product. This is fast—you can pre-compute document embeddings—but it limits expressiveness. The model can't attend across query and document to capture nuanced relevance.
Example of bi-encoder failure:
- Query: "How do I handle errors when the API returns 429?"
- Document A: "Rate limiting returns HTTP 429. Implement exponential backoff." (relevant)
- Document B: "Error handling best practices for API calls." (less relevant)
A bi-encoder might score B higher because "error handling" and "API" match well semantically. But a human (or cross-encoder) recognizes that A specifically addresses the 429 status code.
Cross-Encoder Reranking
Cross-encoders process query and document together, allowing full attention between them. This captures relevance signals that bi-encoders miss:
from sentence_transformers import CrossEncoder
import numpy as np
class DocumentationReranker:
"""Rerank documentation search results with cross-encoder."""
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-12-v2"):
self.model = CrossEncoder(model_name)
def rerank(
self,
query: str,
results: list[tuple[DocChunk, float]],
top_k: int = 10
) -> list[tuple[DocChunk, float]]:
"""Rerank results using cross-encoder."""
if not results:
return []
# Prepare pairs for cross-encoder
chunks = [r[0] for r in results]
pairs = []
for chunk in chunks:
# Include hierarchy in the text for context
text = self._format_for_reranking(chunk)
pairs.append((query, text))
# Score all pairs
scores = self.model.predict(pairs)
# Combine with original scores (optional boosting)
reranked = []
for i, (chunk, original_score) in enumerate(results):
# Cross-encoder score dominates, but original score breaks ties
combined_score = scores[i] * 0.9 + original_score * 0.1
reranked.append((chunk, combined_score))
# Sort by combined score
reranked.sort(key=lambda x: -x[1])
return reranked[:top_k]
def _format_for_reranking(self, chunk: DocChunk) -> str:
"""Format chunk for reranking, including hierarchy context."""
parts = []
# Add hierarchy path
if chunk.section_hierarchy:
parts.append(f"Section: {' > '.join(chunk.section_hierarchy)}")
# Add document path
parts.append(f"Document: {chunk.doc_path}")
# Add content
parts.append(chunk.content)
return "\n".join(parts)
def rerank_with_diversity(
self,
query: str,
results: list[tuple[DocChunk, float]],
top_k: int = 10,
diversity_weight: float = 0.3
) -> list[tuple[DocChunk, float]]:
"""Rerank with diversity to avoid redundant results."""
# First, get cross-encoder scores
reranked = self.rerank(query, results, top_k=len(results))
# Then apply MMR (Maximal Marginal Relevance) for diversity
selected = []
remaining = list(reranked)
while len(selected) < top_k and remaining:
if not selected:
# First item: highest score
selected.append(remaining.pop(0))
else:
# Subsequent items: balance relevance and diversity
best_idx = 0
best_mmr = float('-inf')
for i, (chunk, score) in enumerate(remaining):
# Relevance component
relevance = score
# Diversity component: max similarity to already selected
max_sim = max(
self._chunk_similarity(chunk, sel_chunk)
for sel_chunk, _ in selected
)
# MMR score
mmr = (1 - diversity_weight) * relevance - diversity_weight * max_sim
if mmr > best_mmr:
best_mmr = mmr
best_idx = i
selected.append(remaining.pop(best_idx))
return selected
def _chunk_similarity(self, chunk_a: DocChunk, chunk_b: DocChunk) -> float:
"""Compute similarity between chunks for diversity."""
# Simple Jaccard similarity on words
words_a = set(chunk_a.content.lower().split())
words_b = set(chunk_b.content.lower().split())
if not words_a or not words_b:
return 0
intersection = len(words_a & words_b)
union = len(words_a | words_b)
return intersection / union
Why diversity matters: Documentation often has multiple pages covering similar topics—installation guides for different operating systems, authentication methods, or deployment targets. Without diversity, results might show 5 variations of "install on Ubuntu" when the user needed "install on Mac."
When to Use Reranking
Reranking adds latency (50-200ms for 30 candidates). Use it strategically:
| Scenario | Use Reranking? | Why |
|---|---|---|
| High-stakes queries | Yes | Precision matters more than latency |
| API reference lookup | Sometimes | Exact matches might not need it |
| Troubleshooting | Yes | Finding the right error fix is critical |
| Getting started guides | No | Usually few candidates, all relevant |
| Complex multi-part queries | Yes | Cross-encoder handles nuance better |
Reranking Model Selection (2025)
| Model | Latency (30 docs) | Quality | Best For | Cost |
|---|---|---|---|---|
Cohere rerank-v4.0-fast | ~30ms | Excellent | Production speed | $0.10/1K queries |
Cohere rerank-v4.0-pro | ~80ms | Best | Maximum quality | $0.20/1K queries |
Cohere rerank-v3.5 | ~60ms | Excellent | Multilingual (100+ langs) | $0.10/1K queries |
BAAI/bge-reranker-v2-gemma2-lightweight | ~40ms | Excellent | Open-source SOTA | Free (local) |
BAAI/bge-reranker-v2-m3 | ~50ms | Very Good | Multilingual open-source | Free (local) |
mxbai-rerank-v2 | ~45ms | Excellent | Open-source alternative | Free (local) |
cross-encoder/ms-marco-MiniLM-L-12-v2 | ~100ms | Good | Baseline/comparison | Free (local) |
2025 Reranking Developments:
- Cohere Rerank 4.0 (newest): SOTA performance with 'fast' and 'pro' variants for different latency/quality tradeoffs
- BGE-reranker-v2-gemma2-lightweight: Based on Gemma-2-9B with token compression, achieving excellent quality while saving resources
- LLM-based reranking: 5-8% higher accuracy than cross-encoders but adds 4-6 seconds latency—use for offline/batch processing only
Reranking improves retrieval by up to 48% according to Databricks research. The quality boost is especially pronounced for complex queries where initial retrieval returns many marginally-relevant results.
Documentation-specific tuning: General rerankers are trained on web search data (MS MARCO). For best results on documentation, fine-tune on query-doc pairs from your search logs. Even 1,000 labeled examples significantly improves relevance.
Answer Generation
Retrieval gets us the right content. Answer generation synthesizes it into a response that actually helps the user. For documentation, this means more than just summarizing—it means citing sources, estimating confidence, and knowing when to say "I don't know."
What Good Documentation Answers Look Like
Bad answer: "You need to configure authentication. Set up your credentials and make sure the client is initialized properly."
Good answer: "To configure authentication, add your API key to the environment variables as shown in the Configuration section [1]. Then initialize the client with Client(api_key=os.environ['API_KEY']) as demonstrated in the Quick Start [2]. If you're using OAuth instead of API keys, see the OAuth Setup guide [3]."
The difference:
- Specific: References exact configuration steps, not vague guidance
- Cited: Points to source sections so users can dive deeper
- Complete: Covers the common case (API key) and alternatives (OAuth)
- Grounded: Only states what's in the documentation
Confidence Estimation
Not all answers are equally reliable. We estimate confidence based on:
- Retrieval scores: High-scoring chunks suggest strong matches
- Uncertainty language: Phrases like "I'm not sure" or "not found" indicate gaps
- Answer completeness: Very short answers might indicate missing information
This confidence score helps the UI decide whether to show the answer prominently, suggest related searches, or recommend contacting support.
Answer Generator Implementation
class AnswerGenerator:
"""Generate answers from documentation."""
def __init__(self, client):
self.client = client
def generate(
self,
query: str,
retrieval_result: RetrievalResult,
include_citations: bool = True
) -> dict:
"""Generate answer from retrieved chunks."""
# Format context
context = self._format_context(retrieval_result.chunks)
system_prompt = """You are a documentation assistant.
Answer the user's question based ONLY on the provided documentation excerpts.
Guidelines:
1. Be accurate - only state what's in the documentation
2. Be concise - give clear, direct answers
3. Include code examples when relevant
4. If the documentation doesn't contain the answer, say so
5. Reference specific sections when helpful
{citation_instruction}
""".format(
citation_instruction="Add citations like [1], [2] referencing the doc excerpts."
if include_citations else ""
)
prompt = f"""Question: {query}
Documentation excerpts:
{context}
Answer:"""
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
)
answer = response.choices[0].message.content
# Extract citations
citations = self._extract_citations(
answer,
retrieval_result.chunks
) if include_citations else []
# Estimate confidence
confidence = self._estimate_confidence(
query,
answer,
retrieval_result
)
return {
"answer": answer,
"citations": citations,
"confidence": confidence,
"sources": [
{
"path": chunk.doc_path,
"section": " > ".join(chunk.section_hierarchy),
"type": chunk.chunk_type
}
for chunk in retrieval_result.chunks[:5]
]
}
def _format_context(self, chunks: list[DocChunk]) -> str:
"""Format chunks for context."""
formatted = []
for i, chunk in enumerate(chunks):
header = f"[{i+1}] {chunk.doc_path}"
if chunk.section_hierarchy:
header += f" > {' > '.join(chunk.section_hierarchy)}"
if chunk.chunk_type == "code":
content = f"```{chunk.code_language or ''}\n{chunk.content}\n```"
else:
content = chunk.content
formatted.append(f"{header}\n{content}")
return "\n\n---\n\n".join(formatted)
def _extract_citations(
self,
answer: str,
chunks: list[DocChunk]
) -> list[dict]:
"""Extract citations from answer."""
citations = []
# Find citation markers [1], [2], etc.
markers = re.findall(r"\[(\d+)\]", answer)
for marker in set(markers):
idx = int(marker) - 1
if 0 <= idx < len(chunks):
chunk = chunks[idx]
citations.append({
"marker": f"[{marker}]",
"path": chunk.doc_path,
"section": " > ".join(chunk.section_hierarchy)
})
return citations
def _estimate_confidence(
self,
query: str,
answer: str,
retrieval_result: RetrievalResult
) -> float:
"""Estimate answer confidence."""
# Factors:
# 1. Retrieval scores
avg_score = sum(retrieval_result.scores[:3]) / min(3, len(retrieval_result.scores))
# 2. Whether answer indicates uncertainty
uncertainty_phrases = [
"I don't have",
"not found in",
"documentation doesn't",
"unclear",
"might be",
"not sure"
]
has_uncertainty = any(p in answer.lower() for p in uncertainty_phrases)
# 3. Answer length (very short might indicate missing info)
length_factor = min(1.0, len(answer) / 200)
# Combine factors
confidence = avg_score * 0.5 + (0 if has_uncertainty else 0.3) + length_factor * 0.2
return min(1.0, max(0.0, confidence))
Key answer generation patterns:
-
Context formatting matters: The
_format_contextmethod adds citation markers[1],[2]and section paths. This structured format helps the LLM produce citations and lets users trace answers back to sources. -
Grounded generation: The system prompt emphasizes "ONLY on the provided documentation excerpts." This is crucial—hallucination in documentation answers is worse than no answer. Users trust docs to be accurate.
-
Confidence as a feature: The confidence score isn't just internal metadata. Surface it in the UI: high-confidence answers can be shown prominently, while low-confidence ones might show "This answer may be incomplete—try refining your search."
-
When to skip generation: If
retrieval_result.chunksis empty or all low-scoring, don't generate—just say "No relevant documentation found" and suggest alternative searches.
Streaming for better UX: For production, stream the answer generation. Users see tokens appear immediately rather than waiting 2-3 seconds for the full response. This dramatically improves perceived latency.
Production Optimization
A documentation search system that works in development might fall over in production. Real traffic brings latency requirements, cache invalidation challenges, and the need to understand what's working and what's not.
The Latency Challenge
Documentation search has stricter latency requirements than you might expect. Users searching docs are often in the middle of a task—coding, debugging, configuring. Every second they wait is a second of lost flow state.
Target latencies:
- Autocomplete suggestions: < 100ms
- Search results: < 500ms
- AI-generated answers: < 2s (with streaming)
To hit these targets, we need caching at multiple levels.
Caching Strategy
We cache at two levels:
-
Query result cache: Store retrieval results for repeated queries. Documentation queries are highly repetitive—"how to install" gets asked constantly.
-
Embedding cache: Store embeddings for query strings. Computing embeddings is relatively expensive (50-100ms), and the same queries recur.
Cache invalidation is the hard part: When documentation updates, we need to invalidate cached results that included the changed content. Naive approaches (clear everything) hurt performance. Smart approaches (track which docs affect which cached queries) add complexity.
import hashlib
from datetime import datetime, timedelta
from typing import Optional
class DocumentationCache:
"""Cache for documentation search."""
def __init__(
self,
query_ttl: int = 3600,
embedding_ttl: int = 86400
):
self.query_cache = {}
self.embedding_cache = {}
self.query_ttl = query_ttl
self.embedding_ttl = embedding_ttl
def _hash_query(self, query: str, **kwargs) -> str:
"""Hash query and parameters."""
key_data = f"{query}:{sorted(kwargs.items())}"
return hashlib.md5(key_data.encode()).hexdigest()
def get_query_result(
self,
query: str,
**kwargs
) -> Optional[RetrievalResult]:
"""Get cached query result."""
key = self._hash_query(query, **kwargs)
if key in self.query_cache:
result, timestamp = self.query_cache[key]
if datetime.now() - timestamp < timedelta(seconds=self.query_ttl):
return result
del self.query_cache[key]
return None
def set_query_result(
self,
query: str,
result: RetrievalResult,
**kwargs
):
"""Cache query result."""
key = self._hash_query(query, **kwargs)
self.query_cache[key] = (result, datetime.now())
def get_embedding(self, text: str) -> Optional[np.ndarray]:
"""Get cached embedding."""
key = hashlib.md5(text.encode()).hexdigest()
if key in self.embedding_cache:
embedding, timestamp = self.embedding_cache[key]
if datetime.now() - timestamp < timedelta(seconds=self.embedding_ttl):
return embedding
del self.embedding_cache[key]
return None
def set_embedding(self, text: str, embedding: np.ndarray):
"""Cache embedding."""
key = hashlib.md5(text.encode()).hexdigest()
self.embedding_cache[key] = (embedding, datetime.now())
def invalidate_doc(self, doc_path: str):
"""Invalidate cache for a document."""
# Remove query results that include this doc
keys_to_remove = []
for key, (result, _) in self.query_cache.items():
if any(c.doc_path == doc_path for c in result.chunks):
keys_to_remove.append(key)
for key in keys_to_remove:
del self.query_cache[key]
Cache design decisions:
-
TTL-based expiration: Query results expire after 1 hour (configurable). This balances freshness with performance—most docs don't change hourly.
-
Content-aware invalidation: The
invalidate_docmethod only removes cached results that actually included the changed document. This is more surgical than clearing everything. -
In-memory vs. distributed: This implementation uses in-memory dicts, which works for single-instance deployments. For multi-instance, swap to Redis with the same interface.
-
What NOT to cache: Don't cache low-confidence results—they're likely to be wrong, and caching them means serving wrong answers faster.
Incremental Index Updates
Documentation changes frequently—typo fixes, new sections, updated examples. Rebuilding the entire index for every change is wasteful and slow. Instead, we track which documents changed and update only those.
The challenge: When a document changes, we need to:
- Remove old chunks from that document
- Re-chunk the updated document
- Re-embed the new chunks
- Invalidate any cached queries that included the old chunks
When to full rebuild vs. incremental:
- Single doc change: Incremental update
- Few docs changed: Incremental updates
- Major restructuring (>30% of docs): Full rebuild is faster
class IncrementalIndexManager:
"""Manage incremental documentation updates."""
def __init__(self, index: DocumentationIndex, cache: DocumentationCache):
self.index = index
self.cache = cache
self.doc_hashes = {} # doc_path -> content hash
def check_updates(self, doc_paths: list[str]) -> list[str]:
"""Check which documents need updating."""
updated = []
for path in doc_paths:
with open(path, "r") as f:
content = f.read()
content_hash = hashlib.md5(content.encode()).hexdigest()
if path not in self.doc_hashes or self.doc_hashes[path] != content_hash:
updated.append(path)
self.doc_hashes[path] = content_hash
return updated
def update_documents(self, doc_paths: list[str], chunker: DocumentationChunker):
"""Update specific documents in the index."""
for path in doc_paths:
# Remove old chunks
self.index.chunks = [
c for c in self.index.chunks
if c.doc_path != path
]
# Add new chunks
with open(path, "r") as f:
content = f.read()
new_chunks = chunker.chunk_markdown(content, path)
self.index.add_documents(new_chunks)
# Invalidate cache
self.cache.invalidate_doc(path)
def rebuild_if_needed(self, threshold: float = 0.3) -> bool:
"""Rebuild full index if too many updates."""
# If more than threshold% of docs updated, full rebuild is more efficient
updated_ratio = len(self.doc_hashes) / max(len(self.index.chunks), 1)
if updated_ratio > threshold:
self._full_rebuild()
return True
return False
def _full_rebuild(self):
"""Perform full index rebuild."""
self.index._rebuild_embeddings()
self.index._rebuild_keyword_index()
self.cache.query_cache.clear()
Incremental update workflow:
- On git push/deploy: CI/CD detects which markdown files changed
- Call update endpoint:
POST /index/updatewith the changed file paths - Manager handles the rest: Re-chunks, re-embeds, invalidates cache
Cost savings: For a 1,000-page documentation site, incremental updates when editing a single page save ~99% of embedding compute costs vs. full rebuild. At $0.0001/1K tokens, this adds up.
Gotcha—structural changes: If you rename sections or move content between pages, incremental updates might miss cross-references. Run a full rebuild weekly or when major restructuring happens.
Search Analytics
You can't improve what you don't measure. Search analytics reveal what users are actually looking for, where the system fails, and what documentation is missing.
Key metrics to track:
| Metric | What It Tells You |
|---|---|
| Zero-result rate | Queries where we found nothing—documentation gaps or retrieval failures |
| Low-confidence rate | Queries where we found something but aren't sure it's right |
| Click-through rate | Whether users find results useful (requires UI integration) |
| Latency percentiles | P50 for typical experience, P95/P99 for worst cases |
| Popular queries | What users actually search for vs. what you expect |
The most valuable insight: Zero-result and low-confidence queries often reveal missing documentation. If 50 users search for "webhook setup" and get poor results, you probably need a webhook guide.
from collections import defaultdict
from datetime import datetime
class SearchAnalytics:
"""Track and analyze search patterns."""
def __init__(self):
self.queries = []
self.clicks = defaultdict(int)
self.zero_results = []
self.low_confidence = []
def log_search(
self,
query: str,
result_count: int,
confidence: float,
latency_ms: float
):
"""Log a search event."""
self.queries.append({
"query": query,
"result_count": result_count,
"confidence": confidence,
"latency_ms": latency_ms,
"timestamp": datetime.now()
})
if result_count == 0:
self.zero_results.append(query)
if confidence < 0.5:
self.low_confidence.append(query)
def log_click(self, query: str, doc_path: str):
"""Log a result click."""
self.clicks[f"{query}:{doc_path}"] += 1
def get_popular_queries(self, limit: int = 20) -> list[tuple[str, int]]:
"""Get most popular queries."""
from collections import Counter
query_counts = Counter(q["query"].lower() for q in self.queries)
return query_counts.most_common(limit)
def get_failed_queries(self) -> list[str]:
"""Get queries with no results or low confidence."""
return list(set(self.zero_results + self.low_confidence))
def get_stats(self) -> dict:
"""Get search statistics."""
if not self.queries:
return {}
latencies = [q["latency_ms"] for q in self.queries]
confidences = [q["confidence"] for q in self.queries]
return {
"total_searches": len(self.queries),
"zero_result_rate": len(self.zero_results) / len(self.queries),
"low_confidence_rate": len(self.low_confidence) / len(self.queries),
"avg_latency_ms": sum(latencies) / len(latencies),
"p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
"avg_confidence": sum(confidences) / len(confidences)
}
Making analytics actionable:
-
Weekly review ritual: Every week, pull
get_failed_queries()and review the top 10. Are these documentation gaps or retrieval failures? Either write the missing docs or tune retrieval. -
Popular queries drive priorities:
get_popular_queries()shows what users actually search for. If your top query is "authentication" but your auth docs are weak, that's high-impact work. -
Click logging requires UI integration: The
log_clickmethod needs frontend instrumentation. Worth the effort—click-through rate is the most reliable signal of result quality. -
Export for deeper analysis: Periodically export
self.queriesto a data warehouse. Run cohort analysis, build ML models, correlate with user outcomes.
Complete Search Service
Putting it all together, here's how the components connect into a production API. The key insight is that each component we've built (index, cache, analyzer, retriever, generator, analytics) plugs together cleanly:
Request flow:
- Cache check: Skip expensive retrieval if we've seen this query recently
- Multi-step retrieval: Query analysis → hybrid search → link expansion → reranking
- Answer generation: Synthesize response from retrieved chunks
- Analytics logging: Track for later analysis and improvement
Why FastAPI? It's async-native (important for LLM calls), has automatic OpenAPI docs, and handles validation via Pydantic. But the pattern works with Flask, Django, or any framework.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
app = FastAPI()
# Initialize components
index = DocumentationIndex()
cache = DocumentationCache()
query_analyzer = QueryAnalyzer(client)
retriever = MultiStepRetriever(index, query_analyzer)
generator = AnswerGenerator(client)
analytics = SearchAnalytics()
class SearchRequest(BaseModel):
query: str
max_results: int = 10
include_answer: bool = True
class SearchResponse(BaseModel):
answer: Optional[str]
confidence: float
sources: list[dict]
citations: list[dict]
latency_ms: float
@app.post("/search", response_model=SearchResponse)
async def search(request: SearchRequest):
"""Search documentation."""
import time
start = time.time()
# Check cache
cached = cache.get_query_result(
request.query,
max_results=request.max_results
)
if cached:
retrieval_result = cached
else:
# Perform retrieval
retrieval_result = retriever.retrieve(
request.query,
max_results=request.max_results
)
cache.set_query_result(
request.query,
retrieval_result,
max_results=request.max_results
)
# Generate answer
if request.include_answer and retrieval_result.chunks:
answer_result = generator.generate(
request.query,
retrieval_result
)
else:
answer_result = {
"answer": None,
"confidence": 0,
"citations": [],
"sources": []
}
latency_ms = (time.time() - start) * 1000
# Log analytics
analytics.log_search(
request.query,
len(retrieval_result.chunks),
answer_result["confidence"],
latency_ms
)
return SearchResponse(
answer=answer_result["answer"],
confidence=answer_result["confidence"],
sources=answer_result["sources"],
citations=answer_result["citations"],
latency_ms=latency_ms
)
@app.post("/index/update")
async def update_index(doc_paths: list[str]):
"""Update index for specific documents."""
chunker = DocumentationChunker()
manager = IncrementalIndexManager(index, cache)
manager.update_documents(doc_paths, chunker)
return {"updated": len(doc_paths)}
@app.get("/analytics")
async def get_analytics():
"""Get search analytics."""
return analytics.get_stats()
Deployment considerations:
- Horizontal scaling: The service is stateless (cache/index can be shared via Redis/vector DB). Run multiple instances behind a load balancer.
- Index updates: The
/index/updateendpoint handles incremental updates. Trigger it from CI/CD when docs change. - Monitoring: Export analytics to Prometheus/Grafana. Alert on latency spikes and zero-result rate increases.
- Cost control: Cache aggressively. Use smaller models for query analysis, larger models for answer generation.
Documentation Versioning
Real documentation has versions. Users on v1.x need different answers than users on v2.x. A search system that ignores this returns confusing, potentially harmful results—telling a v1 user to use a v2-only feature.
The Version Challenge
Versioned documentation creates several problems:
-
Version detection: How do we know which version the user needs? URL path? Explicit selector? Inference from their question?
-
Cross-version search: Sometimes users want to see how things changed. "What's different in v2?" requires comparing versions.
-
Version aliases: "latest", "stable", "LTS"—these need to resolve to actual versions.
-
Deprecation handling: Content might exist in old versions but be deprecated. We should warn users, not just serve stale content.
Multi-Version Index Management
The key insight: maintain separate indices per version, with a routing layer that directs queries appropriately.
from dataclasses import dataclass
from typing import Optional
from datetime import datetime
@dataclass
class DocVersion:
version: str
release_date: datetime
is_latest: bool = False
is_default: bool = False
status: str = "stable" # stable, beta, deprecated, eol
class VersionedDocumentationIndex:
"""Manage documentation across multiple versions."""
def __init__(self):
self.versions: dict[str, DocVersion] = {}
self.indices: dict[str, DocumentationIndex] = {}
self.version_aliases: dict[str, str] = {} # "latest" -> "v2.0"
def add_version(
self,
version: str,
index: DocumentationIndex,
config: DocVersion
):
"""Add a new documentation version."""
self.versions[version] = config
self.indices[version] = index
# Update aliases
if config.is_latest:
self.version_aliases["latest"] = version
if config.is_default:
self.version_aliases["default"] = version
def resolve_version(self, version_query: str) -> str:
"""Resolve version query to actual version."""
# Handle aliases
if version_query in self.version_aliases:
return self.version_aliases[version_query]
# Handle version patterns (e.g., "v2.x" -> latest v2)
if version_query.endswith(".x"):
major = version_query[:-2]
matching = [
v for v in self.versions
if v.startswith(major)
]
if matching:
return sorted(matching)[-1] # Latest matching
# Direct version
if version_query in self.versions:
return version_query
# Default to latest
return self.version_aliases.get("latest", list(self.versions.keys())[0])
def search(
self,
query: str,
version: str = "default",
cross_version: bool = False,
top_k: int = 10
) -> dict:
"""Search documentation with version awareness."""
target_version = self.resolve_version(version)
results = {}
if cross_version:
# Search across all stable versions
for ver, config in self.versions.items():
if config.status in ["stable", "beta"]:
version_results = self.indices[ver].search(query, top_k=5)
results[ver] = version_results
else:
# Search single version
results[target_version] = self.indices[target_version].search(
query, top_k=top_k
)
return {
"requested_version": version,
"resolved_version": target_version,
"results": results,
"cross_version": cross_version
}
def find_version_differences(
self,
query: str,
version_a: str,
version_b: str
) -> dict:
"""Find differences in content between versions."""
results_a = self.indices[version_a].search(query, top_k=5)
results_b = self.indices[version_b].search(query, top_k=5)
# Compare content
content_a = [r[0].content for r in results_a]
content_b = [r[0].content for r in results_b]
differences = []
for chunk_a in results_a:
matching = [
r for r in results_b
if r[0].doc_path == chunk_a[0].doc_path and
r[0].section_hierarchy == chunk_a[0].section_hierarchy
]
if matching:
if chunk_a[0].content != matching[0][0].content:
differences.append({
"path": chunk_a[0].doc_path,
"section": chunk_a[0].section_hierarchy,
"version_a": chunk_a[0].content[:200],
"version_b": matching[0][0].content[:200]
})
else:
differences.append({
"path": chunk_a[0].doc_path,
"section": chunk_a[0].section_hierarchy,
"only_in": version_a
})
return {
"version_a": version_a,
"version_b": version_b,
"differences": differences
}
Version resolution patterns:
"latest"→ Most recent stable version"v2.x"→ Latest v2 minor version"stable"→ Current recommended version (might not be latest)
Cross-version search use cases:
- Migration guides: "What changed between v1 and v2?"
- Compatibility: "Does this feature exist in v1.5?"
- Regression investigation: "When did this behavior change?"
Migration Assistance
One of the most valuable features for versioned docs: helping users migrate. This combines version comparison with intelligent answer generation.
class VersionMigrationAssistant:
"""Help users migrate between documentation versions."""
def __init__(self, client, versioned_index: VersionedDocumentationIndex):
self.client = client
self.index = versioned_index
def generate_migration_guide(
self,
topic: str,
from_version: str,
to_version: str
) -> dict:
"""Generate a migration guide for a topic."""
# Get relevant content from both versions
old_results = self.index.indices[from_version].search(topic, top_k=5)
new_results = self.index.indices[to_version].search(topic, top_k=5)
old_content = "\n\n".join([r[0].content for r in old_results])
new_content = "\n\n".join([r[0].content for r in new_results])
prompt = f"""Compare these two versions of documentation and identify changes.
## Version {from_version}:
{old_content}
## Version {to_version}:
{new_content}
Provide:
1. Breaking changes that require code updates
2. New features or options available
3. Deprecated features to avoid
4. Step-by-step migration instructions"""
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return {
"topic": topic,
"from_version": from_version,
"to_version": to_version,
"migration_guide": response.choices[0].message.content
}
def detect_deprecated_usage(
self,
code_snippet: str,
current_version: str
) -> list[dict]:
"""Detect deprecated API usage in code."""
# Search for deprecation notices
deprecation_results = self.index.indices[current_version].search(
"deprecated removed breaking change",
top_k=20
)
deprecation_info = "\n\n".join([
r[0].content for r in deprecation_results
if "deprecated" in r[0].content.lower()
])
prompt = f"""Analyze this code for deprecated API usage based on the documentation.
Code:
{code_snippet}
Deprecation information from docs:
{deprecation_info}
List any deprecated APIs used and their replacements."""
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return {
"code": code_snippet,
"version": current_version,
"analysis": response.choices[0].message.content
}
Migration assistance use cases:
- Upgrade planning: Before upgrading, users can see what changes affect their code
- Breaking change detection: Automatically flag API calls that won't work in the new version
- Migration guides: Generate step-by-step migration instructions from version diffs
- Deprecation warnings: Surface deprecated features before they're removed
Integration with search: When a user searches and gets results from an old version, proactively offer: "This documentation is for v1.x. You might also want to see what changed in v2."
API Reference Search
API reference documentation is different from guides and tutorials. It's highly structured—endpoints, parameters, return types, examples. Users search it differently too: they know what they're looking for (a specific method, parameter, or endpoint) and need precise answers.
Why API Reference Needs Special Treatment
General documentation search struggles with API reference because:
-
Structure over prose: API docs are tables, signatures, and code—not flowing text that embeds well.
-
Exact matching matters: A user searching for
POST /usersneeds that exact endpoint, not semantically similar content about "creating accounts." -
Parameter-level granularity: Users often need details about a single parameter, not an entire endpoint.
-
Cross-referencing: Understanding an endpoint often requires understanding related types, error codes, and authentication.
Structured API Documentation Model
We model API documentation as structured objects, not just text chunks:
from pydantic import BaseModel, Field
from typing import Optional, Literal
class APIParameter(BaseModel):
name: str
type: str
required: bool = True
default: Optional[str] = None
description: str
class APIEndpoint(BaseModel):
method: Literal["GET", "POST", "PUT", "DELETE", "PATCH"]
path: str
description: str
parameters: list[APIParameter] = Field(default_factory=list)
request_body: Optional[dict] = None
response_schema: Optional[dict] = None
examples: list[dict] = Field(default_factory=list)
auth_required: bool = True
rate_limit: Optional[str] = None
deprecated: bool = False
class APIClass(BaseModel):
name: str
module: str
description: str
methods: list["APIMethod"] = Field(default_factory=list)
properties: list[APIParameter] = Field(default_factory=list)
inheritance: list[str] = Field(default_factory=list)
examples: list[str] = Field(default_factory=list)
class APIMethod(BaseModel):
name: str
signature: str
description: str
parameters: list[APIParameter] = Field(default_factory=list)
returns: Optional[str] = None
raises: list[str] = Field(default_factory=list)
examples: list[str] = Field(default_factory=list)
async_method: bool = False
deprecated: bool = False
class APIReferenceIndex:
"""Specialized index for API reference documentation."""
def __init__(self, embedding_model):
self.encoder = embedding_model
self.endpoints: list[APIEndpoint] = []
self.classes: list[APIClass] = []
self.methods: list[APIMethod] = []
self.endpoint_embeddings = {}
self.class_embeddings = {}
self.method_embeddings = {}
def add_endpoint(self, endpoint: APIEndpoint):
"""Add an API endpoint to the index."""
self.endpoints.append(endpoint)
# Create searchable text
text = f"{endpoint.method} {endpoint.path}\n{endpoint.description}"
for param in endpoint.parameters:
text += f"\nParam: {param.name} ({param.type}) - {param.description}"
embedding = self.encoder.encode(text)
self.endpoint_embeddings[len(self.endpoints) - 1] = embedding
def add_class(self, api_class: APIClass):
"""Add an API class to the index."""
self.classes.append(api_class)
# Create searchable text
text = f"class {api_class.name}\n{api_class.description}"
for method in api_class.methods:
text += f"\nMethod: {method.name} - {method.description[:100]}"
embedding = self.encoder.encode(text)
self.class_embeddings[len(self.classes) - 1] = embedding
# Also index individual methods
for method in api_class.methods:
self._add_method(method, api_class.name)
def _add_method(self, method: APIMethod, class_name: str):
"""Add a method to the index."""
self.methods.append(method)
text = f"{class_name}.{method.name}{method.signature}\n{method.description}"
for param in method.parameters:
text += f"\nParam: {param.name} ({param.type}) - {param.description}"
embedding = self.encoder.encode(text)
self.method_embeddings[len(self.methods) - 1] = embedding
def search_endpoints(self, query: str, top_k: int = 5) -> list[tuple[APIEndpoint, float]]:
"""Search for API endpoints."""
query_embed = self.encoder.encode(query)
results = []
for idx, embed in self.endpoint_embeddings.items():
similarity = self._cosine_similarity(query_embed, embed)
results.append((self.endpoints[idx], similarity))
results.sort(key=lambda x: -x[1])
return results[:top_k]
def search_methods(self, query: str, top_k: int = 5) -> list[tuple[APIMethod, float]]:
"""Search for API methods."""
query_embed = self.encoder.encode(query)
results = []
for idx, embed in self.method_embeddings.items():
similarity = self._cosine_similarity(query_embed, embed)
results.append((self.methods[idx], similarity))
results.sort(key=lambda x: -x[1])
return results[:top_k]
def search_classes(self, query: str, top_k: int = 5) -> list[tuple[APIClass, float]]:
"""Search for API classes."""
query_embed = self.encoder.encode(query)
results = []
for idx, embed in self.class_embeddings.items():
similarity = self._cosine_similarity(query_embed, embed)
results.append((self.classes[idx], similarity))
results.sort(key=lambda x: -x[1])
return results[:top_k]
def find_by_signature(self, signature_pattern: str) -> list:
"""Find methods by signature pattern."""
import re
pattern = re.compile(signature_pattern, re.IGNORECASE)
matches = []
for method in self.methods:
if pattern.search(method.signature):
matches.append(method)
return matches
def _cosine_similarity(self, a, b) -> float:
import numpy as np
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
class APISearchAgent:
"""Intelligent API reference search."""
def __init__(self, client, api_index: APIReferenceIndex, doc_index: DocumentationIndex):
self.client = client
self.api_index = api_index
self.doc_index = doc_index
def answer_api_question(self, question: str) -> dict:
"""Answer a question about the API."""
# Determine question type
question_lower = question.lower()
if any(word in question_lower for word in ["how to", "how do", "example"]):
return self._answer_howto(question)
elif any(word in question_lower for word in ["what is", "explain", "describe"]):
return self._answer_concept(question)
elif any(word in question_lower for word in ["parameter", "argument", "option"]):
return self._answer_parameter(question)
else:
return self._answer_general(question)
def _answer_howto(self, question: str) -> dict:
"""Answer how-to questions with examples."""
# Search for relevant methods
methods = self.api_index.search_methods(question, top_k=3)
# Search documentation for examples
doc_results = self.doc_index.search(question + " example", top_k=5)
# Build context
api_context = []
for method, score in methods:
api_context.append(f"Method: {method.name}\nSignature: {method.signature}\n"
f"Description: {method.description}")
if method.examples:
api_context.append(f"Examples:\n" + "\n".join(method.examples))
doc_context = [chunk.content for chunk, _ in doc_results]
prompt = f"""Question: {question}
API Reference:
{chr(10).join(api_context)}
Documentation:
{chr(10).join(doc_context)}
Provide a clear answer with code examples."""
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return {
"answer": response.choices[0].message.content,
"relevant_methods": [m[0].name for m in methods],
"sources": [{"path": r[0].doc_path, "section": r[0].section_hierarchy}
for r in doc_results[:3]]
}
def _answer_concept(self, question: str) -> dict:
"""Answer conceptual questions."""
classes = self.api_index.search_classes(question, top_k=3)
doc_results = self.doc_index.search(question, top_k=5)
context = []
for cls, score in classes:
context.append(f"Class: {cls.name}\nModule: {cls.module}\n"
f"Description: {cls.description}")
for chunk, score in doc_results:
context.append(chunk.content)
prompt = f"""Question: {question}
Context:
{chr(10).join(context)}
Provide a clear explanation."""
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return {
"answer": response.choices[0].message.content,
"relevant_classes": [c[0].name for c in classes],
"sources": [{"path": r[0].doc_path, "section": r[0].section_hierarchy}
for r in doc_results[:3]]
}
def _answer_parameter(self, question: str) -> dict:
"""Answer parameter-specific questions."""
methods = self.api_index.search_methods(question, top_k=5)
# Extract parameter info
param_info = []
for method, score in methods:
for param in method.parameters:
if any(word in param.name.lower() or word in param.description.lower()
for word in question.lower().split()):
param_info.append({
"method": method.name,
"parameter": param.name,
"type": param.type,
"required": param.required,
"default": param.default,
"description": param.description
})
return {
"answer": self._format_parameter_answer(param_info),
"parameters": param_info
}
def _answer_general(self, question: str) -> dict:
"""General API question answering."""
methods = self.api_index.search_methods(question, top_k=3)
endpoints = self.api_index.search_endpoints(question, top_k=3)
doc_results = self.doc_index.search(question, top_k=5)
context = []
for method, score in methods:
context.append(f"Method: {method.signature}\n{method.description}")
for endpoint, score in endpoints:
context.append(f"{endpoint.method} {endpoint.path}\n{endpoint.description}")
for chunk, score in doc_results:
context.append(chunk.content)
prompt = f"""Question: {question}
API Reference and Documentation:
{chr(10).join(context[:10])}
Provide a comprehensive answer."""
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return {
"answer": response.choices[0].message.content,
"relevant_methods": [m[0].name for m in methods],
"relevant_endpoints": [f"{e[0].method} {e[0].path}" for e in endpoints],
"sources": [{"path": r[0].doc_path, "section": r[0].section_hierarchy}
for r in doc_results[:3]]
}
def _format_parameter_answer(self, params: list[dict]) -> str:
"""Format parameter information as readable answer."""
if not params:
return "No matching parameters found."
answer = "Here are the relevant parameters:\n\n"
for p in params:
answer += f"**{p['method']}.{p['parameter']}**\n"
answer += f"- Type: `{p['type']}`\n"
answer += f"- Required: {p['required']}\n"
if p['default']:
answer += f"- Default: `{p['default']}`\n"
answer += f"- Description: {p['description']}\n\n"
return answer
API search patterns:
The APISearchAgent demonstrates how to route questions to the right strategy:
- How-to questions → Search methods + documentation examples, emphasize code
- Concept questions → Search classes + documentation explanations, emphasize prose
- Parameter questions → Search method signatures, return structured parameter info
- General questions → Search everything, synthesize comprehensive answer
This routing is simple but effective. More sophisticated approaches use LLMs to classify questions, but rule-based routing handles 80% of cases with zero latency overhead.
Structured output is key: Instead of returning free-text answers, the API search returns structured data (relevant methods, endpoints, sources). This lets the UI render rich results—method signatures, parameter tables, "see also" links—rather than just a wall of text.
User Feedback Integration
The best documentation search systems learn from their users. Every "this wasn't helpful" click is a signal. Every unanswered query reveals a gap. Building feedback loops turns your search system into a continuously improving system.
Why Feedback Matters
Static search systems degrade over time. Documentation changes, user expectations evolve, and new features create new query patterns. Without feedback, you're flying blind—optimizing for benchmarks that may not reflect real usage.
What feedback reveals:
| Feedback Signal | What It Means | Action |
|---|---|---|
| "Not helpful" on high-confidence result | Retrieval or generation failure | Review and fix |
| "Helpful" on low-confidence result | Confidence estimation is wrong | Adjust thresholds |
| "Incorrect" feedback | Serious issue—wrong information | Urgent review |
| "Outdated" feedback | Documentation staleness | Update docs |
| "Missing info" feedback | Documentation gap | Write new content |
Feedback Collection System
The feedback system needs to capture enough context to be actionable—not just "good/bad" but why and what the user was actually looking for.
from dataclasses import dataclass
from datetime import datetime
from typing import Optional, Literal
from enum import Enum
class FeedbackType(str, Enum):
HELPFUL = "helpful"
NOT_HELPFUL = "not_helpful"
INCORRECT = "incorrect"
OUTDATED = "outdated"
MISSING_INFO = "missing_info"
@dataclass
class SearchFeedback:
query: str
answer: str
feedback_type: FeedbackType
comment: Optional[str]
user_id: Optional[str]
timestamp: datetime
search_results: list[str] # doc paths
confidence: float
class FeedbackCollector:
"""Collect and store user feedback on search results."""
def __init__(self, storage_path: str = "feedback.jsonl"):
self.storage_path = storage_path
self.feedback_buffer: list[SearchFeedback] = []
def submit_feedback(
self,
query: str,
answer: str,
feedback_type: FeedbackType,
search_results: list[str],
confidence: float,
comment: str = None,
user_id: str = None
) -> str:
"""Submit feedback for a search result."""
feedback = SearchFeedback(
query=query,
answer=answer,
feedback_type=feedback_type,
comment=comment,
user_id=user_id,
timestamp=datetime.now(),
search_results=search_results,
confidence=confidence
)
self.feedback_buffer.append(feedback)
self._persist_feedback(feedback)
return f"feedback_{len(self.feedback_buffer)}"
def _persist_feedback(self, feedback: SearchFeedback):
"""Persist feedback to storage."""
import json
with open(self.storage_path, "a") as f:
f.write(json.dumps({
"query": feedback.query,
"answer": feedback.answer[:500],
"feedback_type": feedback.feedback_type.value,
"comment": feedback.comment,
"user_id": feedback.user_id,
"timestamp": feedback.timestamp.isoformat(),
"search_results": feedback.search_results,
"confidence": feedback.confidence
}) + "\n")
def get_feedback_stats(self) -> dict:
"""Get feedback statistics."""
from collections import Counter
type_counts = Counter(f.feedback_type for f in self.feedback_buffer)
total = len(self.feedback_buffer)
return {
"total_feedback": total,
"helpful_rate": type_counts.get(FeedbackType.HELPFUL, 0) / max(total, 1),
"by_type": {t.value: c for t, c in type_counts.items()},
"avg_confidence_helpful": self._avg_confidence(FeedbackType.HELPFUL),
"avg_confidence_not_helpful": self._avg_confidence(FeedbackType.NOT_HELPFUL)
}
def _avg_confidence(self, feedback_type: FeedbackType) -> float:
matching = [f for f in self.feedback_buffer if f.feedback_type == feedback_type]
if not matching:
return 0
return sum(f.confidence for f in matching) / len(matching)
Learning from Feedback
Collecting feedback is useless if you don't act on it. The feedback learner automatically extracts signals and adjusts retrieval accordingly.
Automatic adjustments:
- Document boosting: If users consistently find results from a particular doc helpful, boost it in rankings. If results are consistently unhelpful, demote it.
- Query correction: If users provide "what they were actually looking for" in feedback, use that to expand future similar queries.
- Negative examples: Track query-answer pairs that failed so you can evaluate whether changes actually fix them.
class FeedbackLearner:
"""Learn from feedback to improve search."""
def __init__(self, feedback_collector: FeedbackCollector):
self.collector = feedback_collector
self.query_corrections: dict[str, str] = {} # query -> better query
self.doc_boosts: dict[str, float] = {} # doc_path -> boost factor
self.negative_examples: list[tuple[str, str]] = [] # (query, bad_answer)
def analyze_feedback(self):
"""Analyze feedback to extract learning signals."""
for feedback in self.collector.feedback_buffer:
if feedback.feedback_type == FeedbackType.HELPFUL:
# Boost documents that were helpful
for doc in feedback.search_results[:3]:
self.doc_boosts[doc] = self.doc_boosts.get(doc, 1.0) * 1.1
elif feedback.feedback_type == FeedbackType.NOT_HELPFUL:
# Reduce boost for unhelpful docs
for doc in feedback.search_results[:3]:
self.doc_boosts[doc] = self.doc_boosts.get(doc, 1.0) * 0.9
elif feedback.feedback_type == FeedbackType.INCORRECT:
# Track as negative example
self.negative_examples.append((feedback.query, feedback.answer))
elif feedback.feedback_type == FeedbackType.MISSING_INFO:
# Track queries that need better coverage
if feedback.comment:
# User might have provided what they were looking for
self._extract_query_improvement(feedback)
def _extract_query_improvement(self, feedback: SearchFeedback):
"""Extract query improvement from feedback comment."""
if feedback.comment and len(feedback.comment) > 10:
# Store as potential query expansion
self.query_corrections[feedback.query] = feedback.comment
def apply_doc_boosts(
self,
results: list[tuple[DocChunk, float]]
) -> list[tuple[DocChunk, float]]:
"""Apply learned boosts to search results."""
boosted = []
for chunk, score in results:
boost = self.doc_boosts.get(chunk.doc_path, 1.0)
boosted.append((chunk, score * boost))
boosted.sort(key=lambda x: -x[1])
return boosted
def get_query_expansion(self, query: str) -> Optional[str]:
"""Get learned query expansion."""
return self.query_corrections.get(query)
def export_training_data(self) -> dict:
"""Export data for fine-tuning or evaluation."""
return {
"positive_examples": [
{"query": f.query, "docs": f.search_results, "answer": f.answer}
for f in self.collector.feedback_buffer
if f.feedback_type == FeedbackType.HELPFUL
],
"negative_examples": self.negative_examples,
"query_corrections": self.query_corrections,
"doc_boosts": self.doc_boosts
}
Active Learning Loop
The active learning loop combines feedback collection and learning into a continuous improvement cycle. It also identifies documentation gaps—queries that consistently fail despite the system's best efforts.
The improvement cycle:
User searches → System returns results → User provides feedback
↓
← Apply boosts ← Learn from feedback ← Store feedback
Gap identification: When multiple users ask similar questions and get poor results, that's a documentation gap—not a retrieval failure. The active learning loop surfaces these gaps for content teams to address.
class ActiveLearningLoop:
"""Active learning loop to continuously improve search."""
def __init__(
self,
retriever: MultiStepRetriever,
generator: AnswerGenerator,
feedback_learner: FeedbackLearner
):
self.retriever = retriever
self.generator = generator
self.learner = feedback_learner
def search_with_learning(
self,
query: str,
**kwargs
) -> dict:
"""Search with feedback-based improvements."""
# Check for query expansion from feedback
expanded_query = self.learner.get_query_expansion(query)
if expanded_query:
query = f"{query} {expanded_query}"
# Perform retrieval
retrieval_result = self.retriever.retrieve(query, **kwargs)
# Apply learned doc boosts
boosted_results = self.learner.apply_doc_boosts(
list(zip(retrieval_result.chunks, retrieval_result.scores))
)
retrieval_result.chunks = [r[0] for r in boosted_results]
retrieval_result.scores = [r[1] for r in boosted_results]
# Generate answer
answer_result = self.generator.generate(query, retrieval_result)
return {
"answer": answer_result["answer"],
"confidence": answer_result["confidence"],
"sources": answer_result["sources"],
"query_expanded": expanded_query is not None
}
def identify_gaps(self) -> list[dict]:
"""Identify documentation gaps from feedback."""
gaps = []
# Queries with low confidence but high frequency
from collections import Counter
query_counts = Counter(
f.query for f in self.learner.collector.feedback_buffer
)
for query, count in query_counts.most_common(20):
matching = [
f for f in self.learner.collector.feedback_buffer
if f.query == query
]
avg_confidence = sum(f.confidence for f in matching) / len(matching)
if avg_confidence < 0.5 and count >= 3:
gaps.append({
"query": query,
"frequency": count,
"avg_confidence": avg_confidence,
"feedback_types": [f.feedback_type.value for f in matching]
})
return gaps
def generate_improvement_report(self) -> str:
"""Generate a report of suggested improvements."""
gaps = self.identify_gaps()
stats = self.learner.collector.get_feedback_stats()
report = "# Documentation Search Improvement Report\n\n"
report += "## Overall Statistics\n"
report += f"- Total feedback: {stats['total_feedback']}\n"
report += f"- Helpful rate: {stats['helpful_rate']:.1%}\n"
report += f"- By type: {stats['by_type']}\n\n"
if gaps:
report += "## Documentation Gaps\n"
report += "These queries frequently return low-confidence results:\n\n"
for gap in gaps[:10]:
report += f"- **{gap['query']}** (asked {gap['frequency']}x, "
report += f"avg confidence: {gap['avg_confidence']:.2f})\n"
if self.learner.query_corrections:
report += "\n## Query Corrections\n"
report += "Users suggested these query improvements:\n\n"
for original, correction in list(self.learner.query_corrections.items())[:10]:
report += f"- \"{original}\" → \"{correction}\"\n"
return report
Real-Time Search Quality Monitoring
In production, you need to know when search quality degrades—before users complain. Real-time monitoring catches issues like:
- Latency spikes: Maybe an embedding service is slow, or the index got too large
- Confidence drops: A documentation update might have broken something
- Zero-result increases: New queries the system can't handle
- Feedback pattern changes: Sudden increase in "not helpful" feedback
The monitoring system tracks rolling windows of metrics and alerts when thresholds are exceeded.
from dataclasses import dataclass
from datetime import datetime, timedelta
import statistics
@dataclass
class SearchQualityMetric:
timestamp: datetime
query: str
latency_ms: float
result_count: int
confidence: float
had_feedback: bool
feedback_positive: Optional[bool]
class SearchQualityMonitor:
"""Monitor search quality in real-time."""
def __init__(self, window_minutes: int = 60):
self.metrics: list[SearchQualityMetric] = []
self.window_minutes = window_minutes
self.alerts: list[dict] = []
def record_search(
self,
query: str,
latency_ms: float,
result_count: int,
confidence: float
):
"""Record a search event."""
self.metrics.append(SearchQualityMetric(
timestamp=datetime.now(),
query=query,
latency_ms=latency_ms,
result_count=result_count,
confidence=confidence,
had_feedback=False,
feedback_positive=None
))
# Clean old metrics
self._cleanup_old_metrics()
# Check for alerts
self._check_alerts()
def record_feedback(self, query: str, positive: bool):
"""Record feedback for a recent search."""
# Find matching recent metric
for metric in reversed(self.metrics):
if metric.query == query and not metric.had_feedback:
metric.had_feedback = True
metric.feedback_positive = positive
break
def _cleanup_old_metrics(self):
"""Remove metrics outside the window."""
cutoff = datetime.now() - timedelta(minutes=self.window_minutes)
self.metrics = [m for m in self.metrics if m.timestamp > cutoff]
def _check_alerts(self):
"""Check for quality degradation."""
if len(self.metrics) < 10:
return
recent = self.metrics[-10:]
# Check latency
avg_latency = statistics.mean(m.latency_ms for m in recent)
if avg_latency > 2000: # 2 second threshold
self._add_alert("high_latency", f"Average latency {avg_latency:.0f}ms")
# Check confidence
avg_confidence = statistics.mean(m.confidence for m in recent)
if avg_confidence < 0.4:
self._add_alert("low_confidence", f"Average confidence {avg_confidence:.2f}")
# Check zero results
zero_result_rate = sum(1 for m in recent if m.result_count == 0) / len(recent)
if zero_result_rate > 0.2:
self._add_alert("high_zero_results", f"Zero result rate {zero_result_rate:.1%}")
def _add_alert(self, alert_type: str, message: str):
"""Add an alert."""
self.alerts.append({
"type": alert_type,
"message": message,
"timestamp": datetime.now().isoformat()
})
def get_dashboard_metrics(self) -> dict:
"""Get metrics for dashboard display."""
if not self.metrics:
return {"error": "No data"}
recent = self.metrics[-100:]
# Feedback metrics
with_feedback = [m for m in recent if m.had_feedback]
positive_feedback = [m for m in with_feedback if m.feedback_positive]
return {
"window_minutes": self.window_minutes,
"total_searches": len(recent),
"avg_latency_ms": statistics.mean(m.latency_ms for m in recent),
"p95_latency_ms": sorted(m.latency_ms for m in recent)[int(len(recent) * 0.95)],
"avg_confidence": statistics.mean(m.confidence for m in recent),
"zero_result_rate": sum(1 for m in recent if m.result_count == 0) / len(recent),
"feedback_rate": len(with_feedback) / len(recent) if recent else 0,
"positive_feedback_rate": len(positive_feedback) / len(with_feedback) if with_feedback else 0,
"active_alerts": self.alerts[-5:]
}
def get_slow_queries(self, threshold_ms: float = 1000) -> list[dict]:
"""Get queries that are consistently slow."""
from collections import defaultdict
query_latencies = defaultdict(list)
for m in self.metrics:
query_latencies[m.query].append(m.latency_ms)
slow = []
for query, latencies in query_latencies.items():
if len(latencies) >= 2 and statistics.mean(latencies) > threshold_ms:
slow.append({
"query": query,
"avg_latency_ms": statistics.mean(latencies),
"count": len(latencies)
})
return sorted(slow, key=lambda x: -x["avg_latency_ms"])[:10]
def get_low_quality_queries(self, threshold: float = 0.4) -> list[dict]:
"""Get queries with consistently low confidence."""
from collections import defaultdict
query_confidences = defaultdict(list)
for m in self.metrics:
query_confidences[m.query].append(m.confidence)
low_quality = []
for query, confidences in query_confidences.items():
if len(confidences) >= 2 and statistics.mean(confidences) < threshold:
low_quality.append({
"query": query,
"avg_confidence": statistics.mean(confidences),
"count": len(confidences)
})
return sorted(low_quality, key=lambda x: x["avg_confidence"])[:10]
Using the monitor effectively:
-
Dashboard integration: Call
get_dashboard_metrics()periodically and display in your ops dashboard. The metrics give you at-a-glance system health. -
Alerting thresholds: The
_check_alertsmethod fires when recent searches show degradation. Wire these to PagerDuty/Slack for on-call notification. -
Proactive debugging:
get_slow_queries()andget_low_quality_queries()identify specific problem areas. Use these weekly to prioritize improvements. -
Feedback correlation: By tracking both search metrics and feedback on the same queries, you can identify whether low confidence actually correlates with user dissatisfaction—calibrating your confidence estimation.
What to do with alerts:
| Alert | Likely Cause | Action |
|---|---|---|
high_latency | Embedding service slow, index too large | Check external services, consider index sharding |
low_confidence | Doc update broke retrieval, new query patterns | Review recent changes, check failed queries |
high_zero_results | New terminology, missing docs | Add query expansions, write missing content |
Advanced Techniques
Once you have a basic documentation search working, these advanced techniques can significantly improve quality for specific use cases.
HyDE: Hypothetical Document Embeddings
HyDE addresses the query-document mismatch problem. Instead of embedding the user's question directly, we first generate a hypothetical answer, then embed that. The hypothesis is closer to how documentation is actually written.
Why it works: When a user asks "why is my app crashing?", they use question phrasing. But documentation says "Common causes of crashes include...". HyDE generates a hypothetical document that bridges this gap.
class HyDEQueryExpander:
"""Expand queries using Hypothetical Document Embeddings."""
def __init__(self, client, encoder):
self.client = client
self.encoder = encoder
def expand_with_hyde(self, query: str, doc_context: str = "") -> np.ndarray:
"""Generate hypothetical document and embed it."""
prompt = f"""You are a technical documentation writer.
Given this user question, write a short documentation excerpt (2-3 sentences)
that would answer it. Write it as if it's from official documentation—
factual, direct, no hedging.
Question: {query}
{f"Documentation context: {doc_context}" if doc_context else ""}
Documentation excerpt:"""
response = self.client.chat.completions.create(
model="gpt-4o-mini", # Fast model is fine for this
messages=[{"role": "user", "content": prompt}],
max_tokens=150
)
hypothetical_doc = response.choices[0].message.content
# Embed the hypothetical document instead of the query
hyde_embedding = self.encoder.encode(hypothetical_doc)
return hyde_embedding
def expand_with_multiple_hypotheses(
self,
query: str,
num_hypotheses: int = 3
) -> np.ndarray:
"""Generate multiple hypotheses and average embeddings."""
embeddings = []
for i in range(num_hypotheses):
prompt = f"""You are a technical documentation writer.
Write a documentation excerpt that answers this question.
Variation {i+1}: {"Focus on getting started." if i == 0 else "Focus on troubleshooting." if i == 1 else "Focus on advanced configuration."}
Question: {query}
Documentation excerpt:"""
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=150
)
hypo = response.choices[0].message.content
embeddings.append(self.encoder.encode(hypo))
# Average embeddings
return np.mean(embeddings, axis=0)
When to use HyDE:
- High vocabulary mismatch (user questions vs. documentation style)
- Low initial retrieval quality
- Willing to accept ~200ms additional latency per query
When NOT to use HyDE:
- API reference queries (exact terms matter)
- Very short queries (not enough signal for generation)
- Latency-critical applications
HyPE: The 2025 Evolution
HyPE (Hypothetical Passage Embeddings) is a newer technique that flips HyDE's approach:
| Technique | When It Runs | What It Generates | Latency Impact |
|---|---|---|---|
| HyDE | Query time | Hypothetical document from query | +200ms per query |
| HyPE | Index time | Hypothetical queries from documents | Zero query-time cost |
How HyPE works: Instead of generating a hypothetical document for each query, HyPE pre-generates hypothetical queries that each document could answer during indexing. At query time, you're matching query-to-query rather than query-to-document.
HyPE results: Research shows HyPE improves retrieval precision by up to 42 percentage points and recall by up to 45 points on certain datasets—without any query-time generation cost.
class HyPEIndexer:
"""Index documents with hypothetical query embeddings."""
def __init__(self, client, encoder):
self.client = client
self.encoder = encoder
def generate_hypothetical_queries(self, doc_chunk: str, num_queries: int = 3) -> list[str]:
"""Generate queries this document could answer."""
prompt = f"""Given this documentation excerpt, generate {num_queries} questions
that a user might ask that this content would answer.
Documentation:
{doc_chunk}
Generate realistic search queries (not full sentences). Return one per line."""
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=200
)
queries = response.choices[0].message.content.strip().split("\n")
return [q.strip() for q in queries if q.strip()][:num_queries]
def index_with_hype(self, chunks: list[DocChunk]) -> dict:
"""Create index with both document and hypothetical query embeddings."""
index = {"doc_embeddings": [], "query_embeddings": [], "chunks": []}
for chunk in chunks:
# Standard document embedding
doc_emb = self.encoder.encode(chunk.content)
index["doc_embeddings"].append(doc_emb)
# Generate and embed hypothetical queries
hypo_queries = self.generate_hypothetical_queries(chunk.content)
query_embs = [self.encoder.encode(q) for q in hypo_queries]
# Average the query embeddings
avg_query_emb = np.mean(query_embs, axis=0)
index["query_embeddings"].append(avg_query_emb)
index["chunks"].append(chunk)
return index
When to use HyPE over HyDE:
- Query latency is critical
- You have compute budget for indexing
- Documents are relatively static (re-indexing isn't frequent)
Combine both: Use HyPE for the initial retrieval stage, then HyDE for query expansion on complex queries that return few results.
Semantic Chunking
Instead of fixed-size chunks, semantic chunking splits at natural boundaries where the topic changes:
from sklearn.metrics.pairwise import cosine_similarity
class SemanticChunker:
"""Chunk documents based on semantic similarity between sentences."""
def __init__(self, encoder, similarity_threshold: float = 0.5):
self.encoder = encoder
self.similarity_threshold = similarity_threshold
def chunk_semantically(self, content: str, doc_path: str) -> list[DocChunk]:
"""Chunk content based on semantic breaks."""
# Split into sentences
sentences = self._split_sentences(content)
if len(sentences) < 2:
return [self._create_chunk(content, doc_path, 0)]
# Embed all sentences
embeddings = self.encoder.encode(sentences)
# Find semantic breaks
chunks = []
current_chunk = [sentences[0]]
current_start = 0
for i in range(1, len(sentences)):
# Compare current sentence to previous
sim = cosine_similarity(
[embeddings[i]],
[embeddings[i-1]]
)[0][0]
if sim < self.similarity_threshold:
# Semantic break detected
chunks.append(self._create_chunk(
" ".join(current_chunk),
doc_path,
current_start
))
current_chunk = [sentences[i]]
current_start = i
else:
current_chunk.append(sentences[i])
# Add final chunk
if current_chunk:
chunks.append(self._create_chunk(
" ".join(current_chunk),
doc_path,
current_start
))
return chunks
def _split_sentences(self, text: str) -> list[str]:
"""Split text into sentences."""
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
return [s.strip() for s in sentences if s.strip()]
def _create_chunk(self, content: str, doc_path: str, index: int) -> DocChunk:
"""Create a DocChunk from content."""
return DocChunk(
id=f"{doc_path}:semantic:{index}",
content=content,
doc_path=doc_path,
section_hierarchy=[], # Would need to be passed in
chunk_type="text",
links=[]
)
When semantic chunking helps:
- Long-form documentation with multiple topics per section
- FAQ pages where each Q&A pair should be a chunk
- Changelog/release notes with distinct entries
Query Rewriting with LLMs
Sometimes the user's query needs transformation before search. Query rewriting handles ambiguity, expands context, and fixes common mistakes:
class QueryRewriter:
"""Rewrite queries for better retrieval."""
def __init__(self, client):
self.client = client
def rewrite(self, query: str, conversation_history: list = None) -> str:
"""Rewrite query for improved retrieval."""
system_prompt = """You are a search query optimizer for technical documentation.
Rewrite the user's query to be more searchable:
1. Expand abbreviations (e.g., "auth" → "authentication")
2. Add implicit context from conversation
3. Fix typos and unclear phrasing
4. Convert questions to documentation-style phrases
5. Keep technical terms precise
Return ONLY the rewritten query, nothing else."""
messages = [{"role": "system", "content": system_prompt}]
# Add conversation context if available
if conversation_history:
context = "\n".join([
f"{msg['role']}: {msg['content']}"
for msg in conversation_history[-3:] # Last 3 exchanges
])
messages.append({
"role": "user",
"content": f"Conversation context:\n{context}\n\nQuery to rewrite: {query}"
})
else:
messages.append({"role": "user", "content": f"Query to rewrite: {query}"})
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
max_tokens=100
)
return response.choices[0].message.content.strip()
def generate_search_variants(self, query: str, num_variants: int = 3) -> list[str]:
"""Generate multiple search query variants."""
prompt = f"""Generate {num_variants} different ways to search for documentation about:
"{query}"
Requirements:
- Each variant should capture a different aspect or phrasing
- Keep technical accuracy
- Focus on terms likely to appear in documentation
Return one query per line, no numbering or explanation."""
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=200
)
variants = response.choices[0].message.content.strip().split("\n")
return [v.strip() for v in variants if v.strip()][:num_variants]
Query rewriting best practices:
-
Preserve intent: Rewriting should clarify, not change the meaning. "fast API" → "FastAPI framework" not "high-performance API design"
-
Use conversation context: If the user previously asked about Python and now asks "how do I install it?", expand to "how do I install FastAPI in Python"
-
Don't over-expand: Adding too many terms dilutes the query. Limit to 2-3 key expansions.
-
Cache rewrites: The same queries recur. Cache rewritten versions to avoid repeated LLM calls.
-
A/B test: Rewriting can hurt as well as help. Test whether it improves your specific metrics before deploying.
Evaluation & Benchmarking
You can't improve what you don't measure. Systematic evaluation is essential for documentation search.
Key Metrics
| Metric | Definition | Target | How to Measure |
|---|---|---|---|
| Recall@k | % of relevant docs in top-k results | >80% | Human-labeled test set |
| MRR | Mean Reciprocal Rank of first relevant result | >0.7 | Human-labeled test set |
| Answer correctness | % of generated answers that are factually correct | >90% | Human review or LLM-as-judge |
| Answer completeness | % of answers that fully address the query | >75% | Human review |
| Latency P95 | 95th percentile response time | <2s | Production monitoring |
| Zero-result rate | % of queries with no results | <5% | Production monitoring |
Building a Test Set
A good test set is your most valuable evaluation asset. Build it from real queries:
from dataclasses import dataclass
from typing import Optional
@dataclass
class TestCase:
query: str
relevant_doc_ids: list[str] # Ground truth relevant documents
expected_answer_contains: list[str] # Key phrases answer should include
expected_intent: str
difficulty: str # "easy", "medium", "hard"
category: str # "howto", "reference", "troubleshoot", etc.
class TestSetBuilder:
"""Build and manage documentation search test sets."""
def __init__(self):
self.test_cases: list[TestCase] = []
def add_from_search_logs(
self,
queries: list[str],
clicked_docs: dict[str, list[str]],
min_clicks: int = 2
):
"""Create test cases from search logs with click data."""
for query in queries:
if query in clicked_docs and len(clicked_docs[query]) >= min_clicks:
# Assume clicked docs are relevant
self.test_cases.append(TestCase(
query=query,
relevant_doc_ids=clicked_docs[query],
expected_answer_contains=[], # Fill manually
expected_intent="unknown",
difficulty="medium",
category="from_logs"
))
def add_synthetic(
self,
doc_chunk: DocChunk,
num_queries: int = 3,
client = None
):
"""Generate synthetic test queries from a document chunk."""
prompt = f"""Given this documentation excerpt, generate {num_queries} realistic
user questions that this content would answer.
Documentation:
{doc_chunk.content}
Generate questions that a real user might ask, varying in:
- Phrasing (some formal, some casual)
- Specificity (some general, some very specific)
- Intent (how-to, what-is, troubleshooting)
Return one question per line."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
questions = response.choices[0].message.content.strip().split("\n")
for q in questions[:num_queries]:
self.test_cases.append(TestCase(
query=q.strip(),
relevant_doc_ids=[doc_chunk.id],
expected_answer_contains=[],
expected_intent="synthetic",
difficulty="medium",
category="synthetic"
))
def to_dict(self) -> list[dict]:
"""Export test set to JSON-serializable format."""
return [
{
"query": tc.query,
"relevant_doc_ids": tc.relevant_doc_ids,
"expected_answer_contains": tc.expected_answer_contains,
"expected_intent": tc.expected_intent,
"difficulty": tc.difficulty,
"category": tc.category
}
for tc in self.test_cases
]
Building effective test sets:
-
Source from real queries: Search logs with click data are gold. Users who clicked and stayed found what they needed.
-
Include failure cases: Deliberately add queries that historically failed. These catch regressions.
-
Cover all intents: Ensure representation across how-to, reference, troubleshooting, and concept queries.
-
Vary difficulty: Include obvious queries ("how to install") and hard ones ("why does X fail when Y is configured with Z?").
-
Synthetic augmentation: Use LLMs to generate query variations from your best docs. Helps coverage but shouldn't replace real queries.
Recommended test set size: Start with 100-200 queries. Aim for 500+ for production systems. More is better for statistical significance.
Running Evaluations
from dataclasses import dataclass
@dataclass
class EvaluationResult:
recall_at_5: float
recall_at_10: float
mrr: float
avg_latency_ms: float
zero_result_rate: float
per_category_recall: dict[str, float]
class SearchEvaluator:
"""Evaluate documentation search quality."""
def __init__(self, retriever: MultiStepRetriever):
self.retriever = retriever
def evaluate(self, test_cases: list[TestCase]) -> EvaluationResult:
"""Run evaluation on test set."""
import time
recalls_5 = []
recalls_10 = []
reciprocal_ranks = []
latencies = []
zero_results = 0
category_recalls = {}
for tc in test_cases:
start = time.time()
result = self.retriever.retrieve(tc.query, max_results=10)
latency = (time.time() - start) * 1000
latencies.append(latency)
retrieved_ids = [c.id for c in result.chunks]
# Recall@5
relevant_in_5 = len(set(retrieved_ids[:5]) & set(tc.relevant_doc_ids))
recall_5 = relevant_in_5 / len(tc.relevant_doc_ids) if tc.relevant_doc_ids else 0
recalls_5.append(recall_5)
# Recall@10
relevant_in_10 = len(set(retrieved_ids[:10]) & set(tc.relevant_doc_ids))
recall_10 = relevant_in_10 / len(tc.relevant_doc_ids) if tc.relevant_doc_ids else 0
recalls_10.append(recall_10)
# MRR
rr = 0
for i, doc_id in enumerate(retrieved_ids):
if doc_id in tc.relevant_doc_ids:
rr = 1 / (i + 1)
break
reciprocal_ranks.append(rr)
# Zero results
if not result.chunks:
zero_results += 1
# Per-category tracking
if tc.category not in category_recalls:
category_recalls[tc.category] = []
category_recalls[tc.category].append(recall_10)
return EvaluationResult(
recall_at_5=sum(recalls_5) / len(recalls_5),
recall_at_10=sum(recalls_10) / len(recalls_10),
mrr=sum(reciprocal_ranks) / len(reciprocal_ranks),
avg_latency_ms=sum(latencies) / len(latencies),
zero_result_rate=zero_results / len(test_cases),
per_category_recall={
cat: sum(vals) / len(vals)
for cat, vals in category_recalls.items()
}
)
def compare_configs(
self,
test_cases: list[TestCase],
configs: dict[str, MultiStepRetriever]
) -> dict[str, EvaluationResult]:
"""Compare multiple retriever configurations."""
results = {}
for name, retriever in configs.items():
self.retriever = retriever
results[name] = self.evaluate(test_cases)
return results
Using the evaluator effectively:
-
Run regularly: Evaluate weekly or after any retrieval changes. Catch regressions early.
-
Track per-category: Overall metrics hide category-specific problems. If troubleshooting queries degrade while how-to improves, you need to know.
-
Compare configurations: Use
compare_configsto A/B test changes before deploying. Does adding reranking help? Does a new embedding model improve recall? -
Set alerts: If Recall@10 drops below 0.70 or latency exceeds 2s, alert the team.
-
Investigate outliers: Look at specific queries that fail. Often a handful of edge cases tank your metrics.
LLM-as-Judge for Answer Quality
For evaluating generated answers, use an LLM judge:
class AnswerEvaluator:
"""Evaluate answer quality using LLM-as-judge."""
def __init__(self, client):
self.client = client
def evaluate_answer(
self,
query: str,
answer: str,
ground_truth_docs: list[str],
expected_contains: list[str] = None
) -> dict:
"""Evaluate a single answer."""
prompt = f"""Evaluate this documentation search answer.
Question: {query}
Answer provided:
{answer}
Ground truth documentation:
{chr(10).join(ground_truth_docs)}
Evaluate on these criteria (score 1-5 each):
1. **Correctness**: Is the answer factually accurate based on the documentation?
2. **Completeness**: Does it fully answer the question?
3. **Relevance**: Does it stay on topic without unnecessary information?
4. **Clarity**: Is it easy to understand?
5. **Grounding**: Are claims supported by the documentation (no hallucination)?
Also note:
- Any factual errors
- Missing important information
- Hallucinated content not in the docs
Return JSON:
{{
"correctness": <1-5>,
"completeness": <1-5>,
"relevance": <1-5>,
"clarity": <1-5>,
"grounding": <1-5>,
"errors": ["<error1>", ...],
"missing": ["<missing1>", ...],
"hallucinations": ["<hallucination1>", ...]
}}"""
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
import json
return json.loads(response.choices[0].message.content)
LLM-as-judge considerations:
-
Cost: GPT-4o evaluation costs ~$0.01-0.05 per answer. Budget for 100-500 evaluations per release.
-
Consistency: LLM judges can be inconsistent. Run each evaluation 2-3 times and average, or use a rubric-based prompt.
-
Calibration: Validate LLM judgments against human ratings on a sample. Adjust prompts if they diverge.
-
Specific criteria: Generic "rate this answer" is less useful than specific criteria (correctness, completeness, grounding).
-
Failure analysis: The
errors,missing, andhallucinationsfields are actionable. Use them to identify systematic issues.
When to use human evaluation instead: For high-stakes changes, sample 50-100 answers for human review. LLM judges are good for scale but miss subtle quality issues.
Regression Testing in CI/CD
Add search quality checks to your deployment pipeline:
# test_search_quality.py
import pytest
from search_evaluator import SearchEvaluator, TestCase
@pytest.fixture
def test_set():
"""Load test set from file."""
import json
with open("test_cases.json") as f:
data = json.load(f)
return [TestCase(**tc) for tc in data]
@pytest.fixture
def evaluator():
"""Initialize evaluator with production config."""
from search_service import retriever
return SearchEvaluator(retriever)
def test_recall_at_10(evaluator, test_set):
"""Recall@10 should be above threshold."""
result = evaluator.evaluate(test_set)
assert result.recall_at_10 >= 0.75, f"Recall@10 dropped to {result.recall_at_10}"
def test_mrr(evaluator, test_set):
"""MRR should be above threshold."""
result = evaluator.evaluate(test_set)
assert result.mrr >= 0.65, f"MRR dropped to {result.mrr}"
def test_latency(evaluator, test_set):
"""P95 latency should be under 2 seconds."""
result = evaluator.evaluate(test_set)
assert result.avg_latency_ms < 2000, f"Latency increased to {result.avg_latency_ms}ms"
def test_zero_results(evaluator, test_set):
"""Zero-result rate should be low."""
result = evaluator.evaluate(test_set)
assert result.zero_result_rate < 0.05, f"Zero-result rate is {result.zero_result_rate}"
CI/CD integration tips:
-
Fast feedback: Run a small test set (50 queries) on every PR. Full evaluation on merge to main.
-
Baseline tracking: Store baseline metrics. Alert when new code degrades by >5% on any metric.
-
Block deploys: Make tests required. Don't deploy if search quality regresses.
-
Separate index tests: Test indexing separately from retrieval. Catch chunking bugs before they affect search.
-
Environment parity: Run tests against a staging index that mirrors production. Mock LLMs with cached responses for speed.
# Example GitHub Actions workflow
name: Search Quality Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run search quality tests
run: pytest tests/test_search_quality.py -v
- name: Upload metrics
if: github.ref == 'refs/heads/main'
run: python scripts/upload_metrics.py
Worked Example: End-to-End Query
Let's trace a real query through the entire system to see how all components work together.
The Query
User asks: "Why is my API returning 401 even though I set the authentication header?"
Step 1: Query Analysis
analysis = query_analyzer.analyze(query)
# Returns:
# QueryAnalysis(
# original_query="Why is my API returning 401...",
# intent=QueryIntent.TROUBLESHOOT,
# entities=["API", "401", "authentication header"],
# expanded_queries=[
# "authentication header not working",
# "401 unauthorized error",
# "API authentication troubleshooting"
# ],
# sub_queries=[
# "What causes 401 errors?",
# "How to set authentication headers correctly?",
# "Common authentication mistakes"
# ],
# expected_content_types=["text", "code", "warning"],
# confidence=0.85
# )
What happened: The analyzer:
- Identified this as a troubleshooting query
- Extracted key entities: "401", "authentication header"
- Generated alternative phrasings that might match docs better
- Created sub-queries to find related information
Step 2: Multi-Source Retrieval
# For each expanded query, we search:
# Original: "Why is my API returning 401..."
# Expanded 1: "authentication header not working"
# Expanded 2: "401 unauthorized error"
# Expanded 3: "API authentication troubleshooting"
# Each query gets:
semantic_results = index.search(query, top_k=20) # ~50 results total
keyword_results = index.keyword_search(query, top_k=10) # ~30 results
# Combined: ~80 candidates before deduplication
Step 3: Merge & Deduplicate
# After merging, we have ~35 unique chunks
# Top candidates:
# 1. "Troubleshooting > Authentication > 401 Errors" (score: 0.89)
# 2. "API Reference > Authentication > Headers" (score: 0.85)
# 3. "Getting Started > Authentication" (score: 0.78)
# 4. "FAQ > Why am I getting 401?" (score: 0.76)
Step 4: Content Type Filtering
# Intent is TROUBLESHOOT, so we prioritize:
# - warning/note chunks (common mistakes)
# - code chunks (correct usage examples)
# - text chunks explaining error causes
# Filter keeps 28 chunks, drops 7 unrelated tables/diagrams
Step 5: Link Expansion
# Top result "401 Errors" links to:
# - "Token Expiration" section
# - "API Key vs OAuth" comparison
# - "Environment Variables" setup
# We fetch these linked chunks with 0.8 score decay
# Now have 35 candidates
Step 6: Reranking
reranker = DocumentationReranker()
reranked = reranker.rerank(query, merged[:30])
# After reranking, order changes:
# 1. "FAQ > Why am I getting 401?" (was #4, now #1)
# - Cross-encoder recognized this directly answers the question
# 2. "Troubleshooting > 401 Errors" (was #1, now #2)
# 3. "Common Mistakes > Authentication" (was #8, now #3)
# - Contains "forgetting to add Bearer prefix" - highly relevant
Step 7: Answer Generation
answer = generator.generate(query, retrieval_result)
# Generated answer:
"""
A 401 error despite setting the authentication header usually indicates one of these issues:
1. **Missing "Bearer" prefix**: The header should be `Authorization: Bearer <token>`, not just the token [1].
2. **Expired token**: Tokens expire after 1 hour by default. Check `exp` claim in your JWT [2].
3. **Wrong header name**: Use `Authorization`, not `Auth` or `X-Auth-Token` [1].
4. **Environment mismatch**: Make sure you're using production keys for production API, not test keys [3].
To debug, check the response body—it includes a specific error code indicating which check failed [2].
"""
# Citations:
# [1] FAQ > Why am I getting 401?
# [2] Troubleshooting > Authentication > 401 Errors
# [3] Common Mistakes > Authentication
# Confidence: 0.87 (high - multiple relevant sources, specific answer)
What Made This Work
- Query expansion found "401 unauthorized" even though the user said "returning 401"
- Hybrid search caught the exact "401" keyword that semantic search might have generalized
- Reranking promoted the FAQ entry that directly addressed the question
- Citation tracking let us point to specific sources for each claim
What Could Go Wrong
| Failure Mode | Symptom | Fix |
|---|---|---|
| No results for "401" | User wrote "4O1" (letter O) | Add fuzzy matching for error codes |
| Wrong auth method | Docs have OAuth, user uses API keys | Add sub-query for auth method detection |
| Outdated info | Token expiry changed from 1hr to 24hr | Ensure index is updated with doc changes |
| Missing context | User is on v1, docs default to v2 | Add version detection/filtering |
Conclusion
Building intelligent documentation search requires combining multiple techniques that each solve a specific problem:
| Technique | Problem It Solves |
|---|---|
| Documentation-aware chunking | Preserves structure, code blocks, and cross-references that naive chunking destroys |
| Query understanding | Bridges the vocabulary gap between how users ask and how docs are written |
| Multi-step retrieval | Handles complex questions that need information from multiple sections |
| Hybrid search | Catches both conceptual matches (semantic) and exact technical terms (keyword) |
| Reranking | Improves precision after recall-optimized initial retrieval |
| Answer generation | Synthesizes coherent, cited responses from scattered chunks |
| Feedback loops | Continuously improves quality based on real user signals |
Implementation Roadmap
Phase 1 - Foundation (1-2 weeks):
- Implement documentation-aware chunking with hierarchy preservation
- Build basic semantic search index
- Create simple answer generation
Phase 2 - Quality (2-3 weeks):
- Add query understanding (intent classification, entity extraction)
- Implement hybrid search (semantic + keyword)
- Add reranking with cross-encoder
- Build basic analytics
Phase 3 - Production (2-3 weeks):
- Add caching layer
- Implement incremental index updates
- Build feedback collection system
- Set up monitoring and alerting
Phase 4 - Optimization (ongoing):
- Analyze failed queries and documentation gaps
- Tune retrieval parameters based on feedback
- A/B test improvements
- Add version support if needed
Key Metrics to Track
Measure success with these metrics:
- Answer helpfulness rate: % of answers users mark as helpful (target: >70%)
- Zero-result rate: % of queries with no results (target: <5%)
- P95 latency: 95th percentile response time (target: <2s with answer)
- Retrieval precision@5: % of top-5 results that are relevant (target: >60%)
Start with basic semantic search on well-chunked docs. Add query analysis, hybrid search, and reranking as you validate the approach. Monitor failed queries to identify gaps in documentation or retrieval. The best documentation search systems aren't built—they're grown through continuous iteration based on user feedback.
Frequently Asked Questions
Related Articles
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
Agentic RAG: When Retrieval Meets Autonomous Reasoning
How to build RAG systems that don't just retrieve—they reason, plan, and iteratively refine their searches to solve complex information needs.
Building Semantic Memory for LLM Conversations: A Hierarchical RAG Approach
A practical guide to building a semantic search system for your LLM conversation history using hierarchical chunking, HyDE retrieval, knowledge graphs, and agentic research patterns.