How does this differ from Mintlify/GitBook AI search?

The core techniques are similar: documentation-aware chunking, semantic search, and LLM answer generation. Commercial platforms add polish (UI, integrations, hosting), but the retrieval pipeline follows these same patterns. Building your own gives you control over customization and cost.

What embedding model works best for documentation?

**Best overall (2025):** `voyage-3.5` — outperforms OpenAI by 8% at lower cost with Matryoshka dimension support. For general docs: `text-embedding-3-small` remains a solid default. For code-heavy docs: `voyage-code-3` or dual-model approaches. For open-source/self-hosted: `BAAI/bge-m3` supports hybrid search (dense + sparse) in a single model. Multilingual: `Cohere embed-v4` covers 100+ languages.

How do I handle versioned documentation?

Index each version separately with version metadata. Filter search results by version. Consider cross-version search that highlights differences. Cache per-version to avoid stale results.

What about multi-language documentation?

Use multilingual embedding models (`multilingual-e5-large`). Index each language variant. Consider query language detection to route to appropriate index. Machine translation for cross-language search is possible but adds latency.

How do I improve search when I get zero results?

Log failed queries for analysis. Add query expansion with synonyms. Consider fuzzy matching for typos. Implement "did you mean" suggestions. Review failed queries to identify documentation gaps.

How do I measure search quality?

Track: click-through rate, answer confidence, user feedback, zero-result rate. Implement A/B testing for retrieval changes. Use LLM-as-judge for answer quality evaluation. Build a test set of queries with expected results for regression testing.

How do I handle documentation that changes frequently?

Use incremental indexing triggered by your CI/CD pipeline. When docs change: Detect changed files via git diff, Re-chunk only changed files, Update embeddings for affected chunks, Invalidate relevant cache entries For real-time docs (like API status pages), consider shorter cache TTLs or webhook-triggered updates.

Should I use a vector database or in-memory search?

| Scale | Recommendation | |-------|---------------| | 1M chunks | Distributed vector DB with sharding | For most documentation sites, in-memory FAISS handles the load with sub-10ms query times.

How do I handle code in multiple programming languages?

Two approaches: 1. **Unified embedding**: Use a code-aware model like CodeBERT that handles multiple languages. Works for most cases. 2. **Language-specific**: Index each language separately, detect language in query ("python authentication" vs "javascript authentication"), route to appropriate index. Better for polyglot docs. Always preserve language tags in chunk metadata for filtering.

How do I integrate with existing documentation platforms?

Most platforms expose content via: Build an ingestion layer that normalizes content from your source into the DocChunk format. Run it on a schedule or trigger via webhooks. **Git sync**: GitBook, Docusaurus, MkDocs—scrape from repo, **API**: Notion, Confluence—fetch via API, **Sitemap**: Any web-based docs—crawl from sitemap.xml, **Export**: Many platforms export to markdown/HTML.

What's the minimum viable documentation search?

Start with: Basic markdown chunking (split on headers), Single embedding model (all-MiniLM-L6-v2), Cosine similarity search, Simple answer generation with Claude/GPT This gives you 70-80% of the value. Add query expansion, reranking, and feedback loops as you identify specific failure modes.

Back to Blog

Education RAG AI Search LLMs ML Engineering Tutorials

Multi-Step Documentation Search: Building Intelligent Search for Docs

Production-focused guide to building intelligent documentation search systems—multi-step retrieval, query understanding, hierarchical chunking, reranking, and production patterns used by Mintlify, GitBook, and modern docs platforms.

September 22, 202527 min read

Beyond Keyword Search

Every developer has experienced documentation search frustration: you know the answer is somewhere in the docs, but keyword search returns irrelevant results, or worse, nothing at all. You search for "CORS error" and get a generic security overview. You search for "how to authenticate" and get five different pages without knowing which applies to your situation.

Documentation search has evolved from simple keyword matching to intelligent systems that understand intent, navigate hierarchies, and synthesize answers from multiple sources. Modern docs platforms like Mintlify, GitBook, and ReadTheDocs now offer AI-powered search that can answer complex questions by reasoning across multiple pages.

Why documentation search is harder than general search: Documentation has unique characteristics that break naive RAG. Content is highly interconnected—a concept on page 5 depends on understanding pages 1-4. Terminology is domain-specific—"hooks" means different things in React, Git, and fishing tutorials. And users ask questions at different skill levels—a beginner asking "how do I deploy?" needs different context than an expert asking about deployment configuration options.

The multi-step imperative: Simple single-query retrieval fails on documentation because users don't know what they don't know. Someone asking "why isn't my component rendering?" might need to find the error, understand state management, check the component lifecycle, and verify the build configuration—information spread across four different pages. Multi-step search handles this by decomposing queries, retrieving progressively, and synthesizing across sources.

This guide covers how to build these systems: from basic retrieval to sophisticated multi-step search pipelines that understand documentation structure.

Prerequisites:

Familiarity with building production RAG systems
Understanding of agentic RAG patterns
Basic embedding and vector search experience

What you'll learn:

Documentation-specific chunking strategies
Multi-step retrieval pipelines
Query understanding and decomposition
Hierarchical document navigation
Cross-reference and link following
Production optimization patterns

What you'll build: By the end of this guide, you'll have a complete documentation search system that:

Chunks docs while preserving hierarchy and code blocks
Understands query intent and expands search terms
Retrieves from multiple sources (semantic, keyword, linked content)
Generates accurate, cited answers
Learns from user feedback to improve over time

The complete implementation is ~2,000 lines of Python. We'll build it piece by piece, explaining not just what each component does but why it's necessary.

Documentation Search Architecture

Documentation has unique characteristics that require specialized approaches:

Code

┌─────────────────────────────────────────────────────────────────┐
│                Documentation Search Pipeline                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │    Query     │───▶│    Multi-    │───▶│   Answer     │      │
│  │Understanding │    │    Step      │    │  Generation  │      │
│  │              │    │  Retrieval   │    │              │      │
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│         │                   │                   │               │
│         ▼                   ▼                   ▼               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │ - Intent     │    │ - Semantic   │    │ - Synthesis  │      │
│  │ - Entities   │    │ - Keyword    │    │ - Citations  │      │
│  │ - Expansion  │    │ - Hierarchy  │    │ - Confidence │      │
│  │ - Sub-queries│    │ - Rerank     │    │              │      │
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Key differences from general RAG:

Aspect	General RAG	Documentation Search
Structure	Flat documents	Hierarchical (guides, sections, pages)
Links	Minimal	Heavy cross-referencing
Code	Occasional	Frequent code blocks
Versions	Usually single	Multiple versions
Freshness	Varies	Must be current

Document Processing

Before we can search documentation, we need to process it in a way that preserves its unique structure. This is where most documentation search systems fail—they treat docs like any other text, losing the hierarchical relationships that make documentation useful.

Why Documentation Chunking is Different

Standard RAG chunking splits text into fixed-size pieces, maybe with some overlap. This works fine for articles or reports, but documentation has structure that matters:

The hierarchy problem: A section titled "Authentication" under "API Reference" means something different than "Authentication" under "Getting Started." Naive chunking loses this context, so when a user asks "how do I authenticate?", the system can't distinguish between the conceptual overview and the API details.

The code block problem: Code examples should stay intact. A Python snippet split across two chunks becomes useless—or worse, misleading. The same applies to configuration files, API responses, and command-line examples.

The cross-reference problem: Documentation is heavily linked. A troubleshooting page might reference the installation guide, which references system requirements. Following these links during retrieval often surfaces the actual answer.

The Documentation Chunk Model

Our chunk model captures this structure explicitly:

Python

from dataclasses import dataclass, field
from typing import Optional
import re

@dataclass
class DocChunk:
    id: str
    content: str
    doc_path: str
    section_hierarchy: list[str]  # ["Getting Started", "Installation", "npm"]
    chunk_type: str  # "text", "code", "table", "note", "warning"
    metadata: dict = field(default_factory=dict)
    links: list[str] = field(default_factory=list)
    code_language: Optional[str] = None

The section_hierarchy field is the key innovation here. Instead of just storing the text, we store the full path through the document structure. When we later search for "npm installation", we can boost results where "Installation" and "npm" appear in the hierarchy, not just the content.

The chunk_type field lets us handle different content types appropriately during retrieval—code chunks might use a code-specific embedding model, while warning chunks might get boosted for troubleshooting queries.

The Chunking Algorithm

The chunker walks through markdown, tracking headers to maintain hierarchy, and intelligently splits content while preserving code blocks and special elements:

Python

class DocumentationChunker:
    """Chunk documentation with structure awareness."""

    def __init__(
        self,
        max_chunk_size: int = 1000,
        min_chunk_size: int = 100,
        overlap: int = 100
    ):
        self.max_chunk_size = max_chunk_size
        self.min_chunk_size = min_chunk_size
        self.overlap = overlap

    def chunk_markdown(self, content: str, doc_path: str) -> list[DocChunk]:
        """Chunk markdown documentation."""
        chunks = []
        current_hierarchy = []

        # Split by headers
        sections = self._split_by_headers(content)

        for section in sections:
            # Update hierarchy based on header level
            current_hierarchy = self._update_hierarchy(
                current_hierarchy,
                section["level"],
                section["title"]
            )

            # Process section content
            section_chunks = self._chunk_section(
                section["content"],
                doc_path,
                current_hierarchy.copy()
            )
            chunks.extend(section_chunks)

        return chunks

    def _split_by_headers(self, content: str) -> list[dict]:
        """Split content by markdown headers."""
        sections = []
        current_section = {"level": 0, "title": "Introduction", "content": ""}

        lines = content.split("\n")
        for line in lines:
            header_match = re.match(r"^(#{1,6})\s+(.+)$", line)
            if header_match:
                # Save current section
                if current_section["content"].strip():
                    sections.append(current_section)

                # Start new section
                level = len(header_match.group(1))
                title = header_match.group(2).strip()
                current_section = {
                    "level": level,
                    "title": title,
                    "content": ""
                }
            else:
                current_section["content"] += line + "\n"

        # Add final section
        if current_section["content"].strip():
            sections.append(current_section)

        return sections

    def _update_hierarchy(
        self,
        hierarchy: list[str],
        level: int,
        title: str
    ) -> list[str]:
        """Update section hierarchy based on header level."""
        # Trim hierarchy to parent level
        while len(hierarchy) >= level:
            hierarchy.pop()

        # Add current section
        hierarchy.append(title)
        return hierarchy

    def _chunk_section(
        self,
        content: str,
        doc_path: str,
        hierarchy: list[str]
    ) -> list[DocChunk]:
        """Chunk a section, preserving code blocks and special elements."""
        chunks = []

        # Extract special elements (code blocks, tables, notes)
        elements = self._extract_elements(content)

        for element in elements:
            if element["type"] == "code":
                # Keep code blocks intact
                chunks.append(DocChunk(
                    id=f"{doc_path}:{len(chunks)}",
                    content=element["content"],
                    doc_path=doc_path,
                    section_hierarchy=hierarchy,
                    chunk_type="code",
                    code_language=element.get("language"),
                    links=self._extract_links(element["content"])
                ))
            elif element["type"] == "text":
                # Chunk text with overlap
                text_chunks = self._chunk_text(element["content"])
                for text in text_chunks:
                    if len(text.strip()) >= self.min_chunk_size:
                        chunks.append(DocChunk(
                            id=f"{doc_path}:{len(chunks)}",
                            content=text,
                            doc_path=doc_path,
                            section_hierarchy=hierarchy,
                            chunk_type="text",
                            links=self._extract_links(text)
                        ))
            elif element["type"] in ["note", "warning", "tip"]:
                chunks.append(DocChunk(
                    id=f"{doc_path}:{len(chunks)}",
                    content=element["content"],
                    doc_path=doc_path,
                    section_hierarchy=hierarchy,
                    chunk_type=element["type"],
                    links=self._extract_links(element["content"])
                ))

        return chunks

    def _extract_elements(self, content: str) -> list[dict]:
        """Extract code blocks, notes, and text from content."""
        elements = []

        # Pattern for fenced code blocks
        code_pattern = r"```(\w+)?\n([\s\S]*?)```"

        # Pattern for admonitions/callouts
        note_pattern = r":::(\w+)\n([\s\S]*?):::"

        # Split content preserving special elements
        parts = re.split(
            r"(```\w*\n[\s\S]*?```|:::\w+\n[\s\S]*?:::)",
            content
        )

        for part in parts:
            if not part.strip():
                continue

            code_match = re.match(code_pattern, part)
            note_match = re.match(note_pattern, part)

            if code_match:
                elements.append({
                    "type": "code",
                    "language": code_match.group(1) or "text",
                    "content": code_match.group(2).strip()
                })
            elif note_match:
                elements.append({
                    "type": note_match.group(1).lower(),
                    "content": note_match.group(2).strip()
                })
            else:
                elements.append({
                    "type": "text",
                    "content": part.strip()
                })

        return elements

    def _chunk_text(self, text: str) -> list[str]:
        """Chunk text with overlap."""
        if len(text) <= self.max_chunk_size:
            return [text]

        chunks = []
        sentences = re.split(r"(?<=[.!?])\s+", text)

        current_chunk = ""
        for sentence in sentences:
            if len(current_chunk) + len(sentence) <= self.max_chunk_size:
                current_chunk += sentence + " "
            else:
                if current_chunk:
                    chunks.append(current_chunk.strip())
                    # Keep overlap
                    overlap_text = current_chunk[-self.overlap:] if len(current_chunk) > self.overlap else current_chunk
                    current_chunk = overlap_text + sentence + " "
                else:
                    # Sentence too long, split it
                    chunks.append(sentence[:self.max_chunk_size])
                    current_chunk = sentence[self.max_chunk_size - self.overlap:]

        if current_chunk.strip():
            chunks.append(current_chunk.strip())

        return chunks

    def _extract_links(self, content: str) -> list[str]:
        """Extract markdown links from content."""
        link_pattern = r"\[([^\]]+)\]\(([^)]+)\)"
        matches = re.findall(link_pattern, content)
        return [url for _, url in matches]

What this achieves: The chunker produces chunks that are:

Hierarchy-aware: Each chunk knows its position in the document structure
Type-aware: Code, text, notes, and warnings are categorized separately
Link-aware: Cross-references are extracted for later expansion
Size-controlled: Long sections are split at sentence boundaries with overlap, but code blocks stay intact

Building the Search Index

With chunks prepared, we need to build an index that supports multiple retrieval strategies. Documentation search benefits from hybrid approaches—semantic search finds conceptually similar content, while keyword search handles exact technical terms that embeddings might miss (like --force-reinstall or ECONNREFUSED).

Why hybrid matters for docs: A user searching for "CORS error" needs exact keyword matching—embeddings might return generic "error handling" content. But a user asking "why can't my frontend talk to my API?" needs semantic understanding to connect their question to CORS documentation.

Separate embeddings for code: Code and prose have different semantic structures. "Initialize the client" in prose and client = Client() in code mean the same thing, but general-purpose embeddings might miss this connection. Using CodeBERT or similar models for code chunks improves retrieval for programming queries.

Python

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import Optional

class DocumentationIndex:
    """Index for documentation search."""

    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        code_model: str = "microsoft/codebert-base"
    ):
        self.text_encoder = SentenceTransformer(embedding_model)
        self.code_encoder = SentenceTransformer(code_model)

        self.chunks: list[DocChunk] = []
        self.text_embeddings: np.ndarray = None
        self.code_embeddings: np.ndarray = None

        # Keyword index for hybrid search
        self.keyword_index = {}

    def add_documents(self, chunks: list[DocChunk]):
        """Add document chunks to index."""
        self.chunks.extend(chunks)
        self._rebuild_embeddings()
        self._rebuild_keyword_index()

    def _rebuild_embeddings(self):
        """Rebuild embedding indices."""
        text_chunks = []
        code_chunks = []

        for i, chunk in enumerate(self.chunks):
            # Create searchable text including hierarchy
            searchable = self._create_searchable_text(chunk)

            if chunk.chunk_type == "code":
                code_chunks.append((i, searchable))
            else:
                text_chunks.append((i, searchable))

        # Encode text chunks
        if text_chunks:
            texts = [t[1] for t in text_chunks]
            text_embeds = self.text_encoder.encode(texts)
            self.text_embeddings = {
                text_chunks[i][0]: text_embeds[i]
                for i in range(len(text_chunks))
            }

        # Encode code chunks
        if code_chunks:
            codes = [c[1] for c in code_chunks]
            code_embeds = self.code_encoder.encode(codes)
            self.code_embeddings = {
                code_chunks[i][0]: code_embeds[i]
                for i in range(len(code_chunks))
            }

    def _create_searchable_text(self, chunk: DocChunk) -> str:
        """Create searchable text from chunk."""
        parts = []

        # Add hierarchy as context
        if chunk.section_hierarchy:
            parts.append(" > ".join(chunk.section_hierarchy))

        # Add content
        parts.append(chunk.content)

        # Add code language if applicable
        if chunk.code_language:
            parts.append(f"Language: {chunk.code_language}")

        return "\n".join(parts)

    def _rebuild_keyword_index(self):
        """Build keyword index for hybrid search."""
        from collections import defaultdict

        self.keyword_index = defaultdict(list)

        for i, chunk in enumerate(self.chunks):
            # Extract keywords
            words = re.findall(r"\b\w+\b", chunk.content.lower())
            for word in set(words):
                if len(word) > 2:  # Skip very short words
                    self.keyword_index[word].append(i)

    def search(
        self,
        query: str,
        top_k: int = 10,
        include_code: bool = True,
        filter_path: Optional[str] = None
    ) -> list[tuple[DocChunk, float]]:
        """Search the index."""
        results = []

        # Encode query
        query_embed = self.text_encoder.encode(query)

        # Search text embeddings
        for idx, embed in self.text_embeddings.items():
            if filter_path and filter_path not in self.chunks[idx].doc_path:
                continue
            similarity = np.dot(query_embed, embed) / (
                np.linalg.norm(query_embed) * np.linalg.norm(embed)
            )
            results.append((self.chunks[idx], similarity))

        # Search code embeddings
        if include_code:
            code_query_embed = self.code_encoder.encode(query)
            for idx, embed in self.code_embeddings.items():
                if filter_path and filter_path not in self.chunks[idx].doc_path:
                    continue
                similarity = np.dot(code_query_embed, embed) / (
                    np.linalg.norm(code_query_embed) * np.linalg.norm(embed)
                )
                results.append((self.chunks[idx], similarity))

        # Sort by similarity
        results.sort(key=lambda x: -x[1])
        return results[:top_k]

    def keyword_search(self, query: str, top_k: int = 20) -> list[tuple[DocChunk, float]]:
        """Keyword-based search."""
        from collections import Counter

        query_words = set(re.findall(r"\b\w+\b", query.lower()))

        # Count matching chunks
        chunk_scores = Counter()
        for word in query_words:
            if word in self.keyword_index:
                for idx in self.keyword_index[word]:
                    chunk_scores[idx] += 1

        # Convert to results
        results = []
        for idx, score in chunk_scores.most_common(top_k):
            normalized_score = score / len(query_words)
            results.append((self.chunks[idx], normalized_score))

        return results

The searchable text trick: Notice that _create_searchable_text combines the section hierarchy with the content. This means a search for "npm getting started" will match chunks where those terms appear in either the hierarchy path OR the content itself. The hierarchy acts as implicit metadata that improves retrieval without requiring users to know exact section names.

Embedding Model Selection

Choosing the right embedding model significantly impacts retrieval quality. Here's a comparison for documentation search (updated January 2025):

Model	Dimensions	Speed	Quality	Best For	Cost
`text-embedding-3-small`	1536	Fast	Good	General docs, cost-sensitive	$0.02/1M tokens
`text-embedding-3-large`	3072	Medium	Better	High-stakes docs	$0.13/1M tokens
`voyage-3.5`	256-2048	Fast	Excellent	Best quality/cost ratio	$0.06/1M tokens
`voyage-3.5-lite`	256-2048	Very Fast	Very Good	Low latency production	$0.02/1M tokens
`Cohere embed-v4`	1024	Fast	Excellent	Enterprise, multilingual	$0.10/1M tokens
`BAAI/bge-m3`	1024	Medium	Excellent	Open-source, hybrid search	Free (local)
`all-MiniLM-L6-v2`	384	Very Fast	Good	Self-hosted MVP	Free (local)
`NV-Embed-v2`	4096	Medium	Excellent	Long context (32K)	Free (local)
`voyage-code-3`	1024	Medium	Excellent	Code-heavy docs	$0.12/1M tokens

2025 Recommendations by use case:

Getting started / MVP: all-MiniLM-L6-v2 — fast, free, good enough
Production (best quality/cost): voyage-3.5 — outperforms OpenAI by 8% at 2.2x lower cost
Code-heavy documentation: voyage-code-3 or dual-model (text + code embeddings)
Maximum quality: Cohere embed-v4 or text-embedding-3-large
Multilingual docs: Cohere embed-v4 (100+ languages) or BAAI/bge-m3
Self-hosted / privacy: BAAI/bge-m3 — supports dense, sparse, and multi-vector retrieval
Hybrid search: BAAI/bge-m3 — generates both dense and sparse vectors simultaneously

Key 2025 developments:

Matryoshka embeddings: Models like voyage-3.5 support variable dimensions (2048→256), letting you trade quality for speed/cost at query time
Quantization: int8/binary quantization reduces storage by 83% with minimal quality loss
32K context windows: Models like NV-Embed-v2 and voyage-3.5 handle entire documents

Benchmark on documentation retrieval (MTEB-style, higher is better):

Model	Recall@10	MRR	Latency (ms)	Notes
`all-MiniLM-L6-v2`	0.72	0.58	8	Fast baseline
`text-embedding-3-small`	0.78	0.65	45	Good default
`voyage-3.5`	0.84	0.73	40	Best value
`Cohere embed-v4`	0.85	0.74	50	Best quality
`BAAI/bge-m3`	0.83	0.72	35	Best open-source

Benchmarks on internal documentation corpus of 50K chunks, 500 test queries

Query Understanding

The difference between a good and great documentation search is query understanding. Users don't search like machines—they use vague terms, incomplete phrases, and questions that don't match how documentation is written.

Why Query Understanding Matters

Consider these equivalent queries that a human would understand but simple search would miss:

User Query	Documentation Actually Says
"it's not working"	"Troubleshooting common errors"
"how to start"	"Getting Started Guide"
"db connection"	"Database Configuration"
"slow"	"Performance Optimization"

Query understanding bridges this gap through:

Intent classification: Is the user looking for a how-to guide, a concept explanation, API reference, or troubleshooting help? Each intent suggests different retrieval strategies.
Entity extraction: What technical terms, product names, or features are they asking about? These should be matched exactly, not just semantically.
Query expansion: What synonyms and related terms might match the documentation? "auth" should also search for "authentication", "login", "credentials".
Query decomposition: Complex questions often need multiple pieces of information. "How do I deploy to AWS with Docker?" might need deployment docs, AWS-specific docs, and Docker docs.

Query Analysis Implementation

Python

from pydantic import BaseModel, Field
from typing import Optional, Literal
from enum import Enum

class QueryIntent(str, Enum):
    HOWTO = "howto"           # How do I do X?
    CONCEPT = "concept"        # What is X?
    REFERENCE = "reference"    # API reference lookup
    TROUBLESHOOT = "troubleshoot"  # Why is X not working?
    EXAMPLE = "example"        # Show me an example of X
    COMPARISON = "comparison"  # X vs Y

class QueryAnalysis(BaseModel):
    original_query: str
    intent: QueryIntent
    entities: list[str] = Field(description="Key technical terms")
    expanded_queries: list[str] = Field(description="Alternative phrasings")
    sub_queries: list[str] = Field(description="Component questions")
    expected_content_types: list[str] = Field(description="code, text, table, etc.")
    confidence: float = Field(ge=0, le=1)

class QueryAnalyzer:
    """Analyze and expand documentation queries."""

    def __init__(self, client):
        self.client = client

    def analyze(self, query: str, doc_context: str = "") -> QueryAnalysis:
        """Analyze a documentation query."""

        system_prompt = """You are a documentation search assistant.

Analyze the user's query to understand:
1. Their intent (howto, concept, reference, troubleshoot, example, comparison)
2. Key technical entities/terms
3. Alternative phrasings that might match documentation
4. Sub-questions that need answering
5. Expected content types (code examples, explanations, tables)

Be precise about technical terms. Expand abbreviations."""

        prompt = f"""Analyze this documentation query:

Query: {query}

{f"Documentation context: {doc_context}" if doc_context else ""}

Provide analysis:"""

        analysis = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ],
            response_model=QueryAnalysis
        )

        return analysis

    def expand_query(self, query: str) -> list[str]:
        """Generate query expansions."""
        # Rule-based expansions
        expansions = [query]

        # Add common variations
        variations = {
            "how to": ["how do I", "how can I", "steps to", "guide for"],
            "what is": ["what are", "explain", "define", "meaning of"],
            "error": ["issue", "problem", "bug", "not working"],
            "install": ["setup", "configure", "get started"],
        }

        query_lower = query.lower()
        for pattern, alternatives in variations.items():
            if pattern in query_lower:
                for alt in alternatives:
                    expansions.append(query_lower.replace(pattern, alt))

        return list(set(expansions))

    def decompose_complex_query(self, query: str) -> list[str]:
        """Break complex queries into simpler sub-queries."""

        prompt = f"""Break this documentation query into simpler sub-questions:

Query: {query}

Return a list of 2-4 simpler questions that together answer the original.
Each sub-question should be independently searchable."""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )

        # Parse response
        lines = response.choices[0].message.content.strip().split("\n")
        sub_queries = []
        for line in lines:
            # Remove numbering and clean up
            cleaned = re.sub(r"^\d+[\.\)]\s*", "", line).strip()
            if cleaned and len(cleaned) > 10:
                sub_queries.append(cleaned)

        return sub_queries

Intent drives strategy: The intent classification isn't just metadata—it changes how we retrieve. A "troubleshoot" intent should prioritize warning/note chunks and error-related content. A "reference" intent should prioritize API documentation and code examples. An "example" intent should heavily weight code chunks.

Query expansion is rule-based first: Before reaching for an LLM to expand queries, notice we use simple string replacements for common patterns. This is fast, predictable, and handles 80% of cases. LLM-based expansion is reserved for complex decomposition where rule-based approaches fail.

Multi-Step Retrieval

This is the core of intelligent documentation search. Instead of a single retrieval step, we orchestrate multiple strategies, combine their results, and iteratively refine.

Why Single-Step Retrieval Fails

Simple RAG does: query → embed → search → top-k → generate. This fails for documentation because:

Vocabulary mismatch: The user's words don't match the documentation's words. A single embedding search might miss the right content entirely.
Scattered information: The answer might require information from multiple sections. "How do I set up authentication with OAuth?" needs the auth overview, OAuth-specific setup, and maybe environment configuration.
Missing context: A chunk about "configuring the redirect URI" makes no sense without the surrounding context about OAuth flows.
No verification: We don't know if the retrieved content actually answers the question until we try to generate an answer.

The Multi-Step Approach

Our retrieval pipeline addresses each failure mode:

Code

Step 1: Query Analysis
    └── Understand intent, extract entities, expand query

Step 2: Multi-Source Retrieval
    ├── Semantic search (conceptual matching)
    ├── Keyword search (exact term matching)
    └── Multiple query variants

Step 3: Merge & Deduplicate
    └── Combine results, keep highest scores

Step 4: Content Type Filtering
    └── Match chunk types to query intent

Step 5: Link Expansion
    └── Follow cross-references for context

Step 6: Reranking
    └── Cross-encoder scoring for precision

Step 7: Sub-Query Handling
    └── If results are sparse, decompose and retry

Retrieval Pipeline Implementation

Python

from dataclasses import dataclass
from typing import Optional

@dataclass
class RetrievalResult:
    chunks: list[DocChunk]
    scores: list[float]
    retrieval_path: list[str]  # Steps taken
    query_used: str

class MultiStepRetriever:
    """Multi-step retrieval for documentation."""

    def __init__(
        self,
        index: DocumentationIndex,
        query_analyzer: QueryAnalyzer,
        reranker = None
    ):
        self.index = index
        self.analyzer = query_analyzer
        self.reranker = reranker

    def retrieve(
        self,
        query: str,
        max_results: int = 10,
        min_confidence: float = 0.5
    ) -> RetrievalResult:
        """Perform multi-step retrieval."""

        retrieval_path = []

        # Step 1: Analyze query
        analysis = self.analyzer.analyze(query)
        retrieval_path.append(f"Intent: {analysis.intent.value}")

        # Step 2: Initial retrieval with expanded queries
        all_results = []

        for expanded_query in [query] + analysis.expanded_queries[:3]:
            # Semantic search
            semantic_results = self.index.search(expanded_query, top_k=20)
            all_results.extend(semantic_results)

            # Keyword search for technical terms
            keyword_results = self.index.keyword_search(expanded_query, top_k=10)
            all_results.extend(keyword_results)

        retrieval_path.append(f"Initial retrieval: {len(all_results)} candidates")

        # Step 3: Deduplicate and merge scores
        merged = self._merge_results(all_results)
        retrieval_path.append(f"After merge: {len(merged)} unique chunks")

        # Step 4: Filter by content type expectation
        if analysis.expected_content_types:
            filtered = [
                (chunk, score) for chunk, score in merged
                if chunk.chunk_type in analysis.expected_content_types or
                   chunk.chunk_type == "text"  # Always include text
            ]
            if filtered:
                merged = filtered
            retrieval_path.append(f"After type filter: {len(merged)} chunks")

        # Step 5: Follow links for context
        expanded = self._expand_with_links(merged[:5])
        merged = self._merge_results(merged + expanded)
        retrieval_path.append(f"After link expansion: {len(merged)} chunks")

        # Step 6: Rerank
        if self.reranker:
            merged = self._rerank(query, merged[:30])
            retrieval_path.append("Reranked results")

        # Step 7: Handle sub-queries if needed
        if analysis.sub_queries and len(merged) < max_results:
            for sub_query in analysis.sub_queries:
                sub_results = self.index.search(sub_query, top_k=5)
                merged.extend(sub_results)
            merged = self._merge_results(merged)
            retrieval_path.append(f"Added sub-query results: {len(merged)} total")

        # Final selection
        final_results = merged[:max_results]
        chunks = [r[0] for r in final_results]
        scores = [r[1] for r in final_results]

        return RetrievalResult(
            chunks=chunks,
            scores=scores,
            retrieval_path=retrieval_path,
            query_used=query
        )

    def _merge_results(
        self,
        results: list[tuple[DocChunk, float]]
    ) -> list[tuple[DocChunk, float]]:
        """Merge and deduplicate results."""
        seen = {}

        for chunk, score in results:
            key = chunk.id
            if key not in seen or seen[key][1] < score:
                seen[key] = (chunk, score)

        merged = list(seen.values())
        merged.sort(key=lambda x: -x[1])
        return merged

    def _expand_with_links(
        self,
        results: list[tuple[DocChunk, float]]
    ) -> list[tuple[DocChunk, float]]:
        """Expand results by following links."""
        expanded = []

        for chunk, score in results:
            for link in chunk.links:
                # Find chunks from linked document
                linked_chunks = [
                    (c, score * 0.8)  # Slightly lower score for linked content
                    for c in self.index.chunks
                    if link in c.doc_path
                ]
                expanded.extend(linked_chunks[:2])

        return expanded

    def _rerank(
        self,
        query: str,
        results: list[tuple[DocChunk, float]]
    ) -> list[tuple[DocChunk, float]]:
        """Rerank results using cross-encoder."""
        if not self.reranker:
            return results

        chunks = [r[0] for r in results]
        texts = [self._create_searchable_text(c) for c in chunks]

        pairs = [(query, text) for text in texts]
        scores = self.reranker.predict(pairs)

        reranked = list(zip(chunks, scores))
        reranked.sort(key=lambda x: -x[1])
        return reranked

    def _create_searchable_text(self, chunk: DocChunk) -> str:
        """Create text for reranking."""
        parts = []
        if chunk.section_hierarchy:
            parts.append(" > ".join(chunk.section_hierarchy))
        parts.append(chunk.content)
        return "\n".join(parts)

Understanding the retrieval flow:

Multiple query variants: We don't just search the original query. We search the original plus up to 3 expanded variants. This dramatically improves recall—if one phrasing misses the right content, another might hit.
Hybrid search: For each variant, we run both semantic search (embedding similarity) and keyword search (term matching). Semantic catches conceptual matches; keyword catches exact technical terms.
Link expansion with score decay: When we follow links from top results, we apply a 0.8 multiplier to their scores. This ensures linked content can surface but doesn't overwhelm directly relevant content.
The retrieval path: We track every step taken (retrieval_path). This is invaluable for debugging—when search fails, you can see exactly where the pipeline went wrong.
Sub-query fallback: If initial retrieval returns few results, we decompose the query and try again. This handles complex questions that span multiple topics.

Documentation has structure that goes beyond flat chunks. A section exists within a parent section, which exists within a document, which exists within a documentation set. Navigating this hierarchy helps us provide context and find related content.

When hierarchy matters:

Context expansion: A chunk about "setting the timeout parameter" makes more sense when we also retrieve the parent section explaining the HTTP client configuration.
Related sections: If someone is reading about "POST requests", they might also need "Request Headers" and "Error Handling" from the same API guide.
Sibling navigation: If a chunk doesn't fully answer the question, the next chunk in the same section often completes the picture.

Python

class HierarchicalNavigator:
    """Navigate documentation hierarchy."""

    def __init__(self, index: DocumentationIndex):
        self.index = index
        self._build_hierarchy()

    def _build_hierarchy(self):
        """Build document hierarchy from chunks."""
        self.hierarchy = {}

        for chunk in self.index.chunks:
            doc_path = chunk.doc_path
            hierarchy = chunk.section_hierarchy

            if doc_path not in self.hierarchy:
                self.hierarchy[doc_path] = {"sections": {}, "chunks": []}

            current = self.hierarchy[doc_path]["sections"]
            for section in hierarchy:
                if section not in current:
                    current[section] = {"subsections": {}, "chunks": []}
                current = current[section]["subsections"]

            # Add chunk reference
            self.hierarchy[doc_path]["chunks"].append(chunk)

    def get_section_context(
        self,
        chunk: DocChunk,
        context_chunks: int = 2
    ) -> list[DocChunk]:
        """Get surrounding chunks for context."""
        same_section = [
            c for c in self.index.chunks
            if c.doc_path == chunk.doc_path and
               c.section_hierarchy == chunk.section_hierarchy
        ]

        # Find chunk index
        try:
            idx = same_section.index(chunk)
        except ValueError:
            return []

        # Get surrounding chunks
        start = max(0, idx - context_chunks)
        end = min(len(same_section), idx + context_chunks + 1)

        return same_section[start:end]

    def get_parent_context(self, chunk: DocChunk) -> Optional[DocChunk]:
        """Get parent section overview."""
        if not chunk.section_hierarchy:
            return None

        parent_hierarchy = chunk.section_hierarchy[:-1]

        for c in self.index.chunks:
            if (c.doc_path == chunk.doc_path and
                c.section_hierarchy == parent_hierarchy and
                c.chunk_type == "text"):
                return c

        return None

    def get_related_sections(
        self,
        chunk: DocChunk,
        max_sections: int = 3
    ) -> list[DocChunk]:
        """Get related sections based on links and keywords."""
        related = []

        # Follow links
        for link in chunk.links:
            for c in self.index.chunks:
                if link in c.doc_path:
                    related.append(c)
                    break

        # Find sections with similar names
        if chunk.section_hierarchy:
            current_section = chunk.section_hierarchy[-1].lower()
            for c in self.index.chunks:
                if c.doc_path != chunk.doc_path and c.section_hierarchy:
                    other_section = c.section_hierarchy[-1].lower()
                    if self._similarity(current_section, other_section) > 0.5:
                        related.append(c)

        return related[:max_sections]

    def _similarity(self, a: str, b: str) -> float:
        """Simple word overlap similarity."""
        words_a = set(a.split())
        words_b = set(b.split())
        if not words_a or not words_b:
            return 0
        intersection = words_a & words_b
        union = words_a | words_b
        return len(intersection) / len(union)

How hierarchy improves results:

The navigator provides three key capabilities:

Context expansion (get_section_context): When we retrieve a chunk about "setting the timeout", we automatically fetch the 2 chunks before and after. This gives the LLM enough context to provide a complete answer, not just a fragment.
Parent context (get_parent_context): If someone finds a chunk deep in the hierarchy, the parent section often provides the "why" that makes the "how" make sense.
Related sections (get_related_sections): By following links and finding similarly-named sections across documents, we can suggest "see also" content that broadens the user's understanding.

When to use hierarchy vs. pure retrieval: Hierarchy expansion is most valuable for how-to and concept queries where context matters. For reference queries ("what are the parameters for X?"), pure retrieval is often sufficient.

Reranking Deep Dive

Initial retrieval optimizes for recall—finding all potentially relevant chunks. Reranking optimizes for precision—putting the best results at the top. This two-stage approach is essential because high-recall retrieval methods (like embedding search with a low threshold) return many candidates, but users only see the top 5-10.

Why Bi-Encoders Aren't Enough

Bi-encoder models (like all-MiniLM-L6-v2) encode query and document independently, then compute similarity via dot product. This is fast—you can pre-compute document embeddings—but it limits expressiveness. The model can't attend across query and document to capture nuanced relevance.

Example of bi-encoder failure:

Query: "How do I handle errors when the API returns 429?"
Document A: "Rate limiting returns HTTP 429. Implement exponential backoff." (relevant)
Document B: "Error handling best practices for API calls." (less relevant)

A bi-encoder might score B higher because "error handling" and "API" match well semantically. But a human (or cross-encoder) recognizes that A specifically addresses the 429 status code.

Cross-Encoder Reranking

Cross-encoders process query and document together, allowing full attention between them. This captures relevance signals that bi-encoders miss:

Python

from sentence_transformers import CrossEncoder
import numpy as np

class DocumentationReranker:
    """Rerank documentation search results with cross-encoder."""

    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-12-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(
        self,
        query: str,
        results: list[tuple[DocChunk, float]],
        top_k: int = 10
    ) -> list[tuple[DocChunk, float]]:
        """Rerank results using cross-encoder."""

        if not results:
            return []

        # Prepare pairs for cross-encoder
        chunks = [r[0] for r in results]
        pairs = []

        for chunk in chunks:
            # Include hierarchy in the text for context
            text = self._format_for_reranking(chunk)
            pairs.append((query, text))

        # Score all pairs
        scores = self.model.predict(pairs)

        # Combine with original scores (optional boosting)
        reranked = []
        for i, (chunk, original_score) in enumerate(results):
            # Cross-encoder score dominates, but original score breaks ties
            combined_score = scores[i] * 0.9 + original_score * 0.1
            reranked.append((chunk, combined_score))

        # Sort by combined score
        reranked.sort(key=lambda x: -x[1])
        return reranked[:top_k]

    def _format_for_reranking(self, chunk: DocChunk) -> str:
        """Format chunk for reranking, including hierarchy context."""
        parts = []

        # Add hierarchy path
        if chunk.section_hierarchy:
            parts.append(f"Section: {' > '.join(chunk.section_hierarchy)}")

        # Add document path
        parts.append(f"Document: {chunk.doc_path}")

        # Add content
        parts.append(chunk.content)

        return "\n".join(parts)

    def rerank_with_diversity(
        self,
        query: str,
        results: list[tuple[DocChunk, float]],
        top_k: int = 10,
        diversity_weight: float = 0.3
    ) -> list[tuple[DocChunk, float]]:
        """Rerank with diversity to avoid redundant results."""

        # First, get cross-encoder scores
        reranked = self.rerank(query, results, top_k=len(results))

        # Then apply MMR (Maximal Marginal Relevance) for diversity
        selected = []
        remaining = list(reranked)

        while len(selected) < top_k and remaining:
            if not selected:
                # First item: highest score
                selected.append(remaining.pop(0))
            else:
                # Subsequent items: balance relevance and diversity
                best_idx = 0
                best_mmr = float('-inf')

                for i, (chunk, score) in enumerate(remaining):
                    # Relevance component
                    relevance = score

                    # Diversity component: max similarity to already selected
                    max_sim = max(
                        self._chunk_similarity(chunk, sel_chunk)
                        for sel_chunk, _ in selected
                    )

                    # MMR score
                    mmr = (1 - diversity_weight) * relevance - diversity_weight * max_sim

                    if mmr > best_mmr:
                        best_mmr = mmr
                        best_idx = i

                selected.append(remaining.pop(best_idx))

        return selected

    def _chunk_similarity(self, chunk_a: DocChunk, chunk_b: DocChunk) -> float:
        """Compute similarity between chunks for diversity."""
        # Simple Jaccard similarity on words
        words_a = set(chunk_a.content.lower().split())
        words_b = set(chunk_b.content.lower().split())

        if not words_a or not words_b:
            return 0

        intersection = len(words_a & words_b)
        union = len(words_a | words_b)

        return intersection / union

Why diversity matters: Documentation often has multiple pages covering similar topics—installation guides for different operating systems, authentication methods, or deployment targets. Without diversity, results might show 5 variations of "install on Ubuntu" when the user needed "install on Mac."

When to Use Reranking

Reranking adds latency (50-200ms for 30 candidates). Use it strategically:

Scenario	Use Reranking?	Why
High-stakes queries	Yes	Precision matters more than latency
API reference lookup	Sometimes	Exact matches might not need it
Troubleshooting	Yes	Finding the right error fix is critical
Getting started guides	No	Usually few candidates, all relevant
Complex multi-part queries	Yes	Cross-encoder handles nuance better

Reranking Model Selection (2025)

Model	Latency (30 docs)	Quality	Best For	Cost
`Cohere rerank-v4.0-fast`	~30ms	Excellent	Production speed	$0.10/1K queries
`Cohere rerank-v4.0-pro`	~80ms	Best	Maximum quality	$0.20/1K queries
`Cohere rerank-v3.5`	~60ms	Excellent	Multilingual (100+ langs)	$0.10/1K queries
`BAAI/bge-reranker-v2-gemma2-lightweight`	~40ms	Excellent	Open-source SOTA	Free (local)
`BAAI/bge-reranker-v2-m3`	~50ms	Very Good	Multilingual open-source	Free (local)
`mxbai-rerank-v2`	~45ms	Excellent	Open-source alternative	Free (local)
`cross-encoder/ms-marco-MiniLM-L-12-v2`	~100ms	Good	Baseline/comparison	Free (local)

2025 Reranking Developments:

Cohere Rerank 4.0 (newest): SOTA performance with 'fast' and 'pro' variants for different latency/quality tradeoffs
BGE-reranker-v2-gemma2-lightweight: Based on Gemma-2-9B with token compression, achieving excellent quality while saving resources
LLM-based reranking: 5-8% higher accuracy than cross-encoders but adds 4-6 seconds latency—use for offline/batch processing only

Reranking improves retrieval by up to 48% according to Databricks research. The quality boost is especially pronounced for complex queries where initial retrieval returns many marginally-relevant results.

Documentation-specific tuning: General rerankers are trained on web search data (MS MARCO). For best results on documentation, fine-tune on query-doc pairs from your search logs. Even 1,000 labeled examples significantly improves relevance.

Answer Generation

Retrieval gets us the right content. Answer generation synthesizes it into a response that actually helps the user. For documentation, this means more than just summarizing—it means citing sources, estimating confidence, and knowing when to say "I don't know."

What Good Documentation Answers Look Like

Bad answer: "You need to configure authentication. Set up your credentials and make sure the client is initialized properly."

Good answer: "To configure authentication, add your API key to the environment variables as shown in the Configuration section [1]. Then initialize the client with Client(api_key=os.environ['API_KEY']) as demonstrated in the Quick Start [2]. If you're using OAuth instead of API keys, see the OAuth Setup guide [3]."

The difference:

Specific: References exact configuration steps, not vague guidance
Cited: Points to source sections so users can dive deeper
Complete: Covers the common case (API key) and alternatives (OAuth)
Grounded: Only states what's in the documentation

Confidence Estimation

Not all answers are equally reliable. We estimate confidence based on:

Retrieval scores: High-scoring chunks suggest strong matches
Uncertainty language: Phrases like "I'm not sure" or "not found" indicate gaps
Answer completeness: Very short answers might indicate missing information

This confidence score helps the UI decide whether to show the answer prominently, suggest related searches, or recommend contacting support.

Answer Generator Implementation

Python

class AnswerGenerator:
    """Generate answers from documentation."""

    def __init__(self, client):
        self.client = client

    def generate(
        self,
        query: str,
        retrieval_result: RetrievalResult,
        include_citations: bool = True
    ) -> dict:
        """Generate answer from retrieved chunks."""

        # Format context
        context = self._format_context(retrieval_result.chunks)

        system_prompt = """You are a documentation assistant.

Answer the user's question based ONLY on the provided documentation excerpts.

Guidelines:
1. Be accurate - only state what's in the documentation
2. Be concise - give clear, direct answers
3. Include code examples when relevant
4. If the documentation doesn't contain the answer, say so
5. Reference specific sections when helpful

{citation_instruction}
""".format(
            citation_instruction="Add citations like [1], [2] referencing the doc excerpts."
            if include_citations else ""
        )

        prompt = f"""Question: {query}

Documentation excerpts:

{context}

Answer:"""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ]
        )

        answer = response.choices[0].message.content

        # Extract citations
        citations = self._extract_citations(
            answer,
            retrieval_result.chunks
        ) if include_citations else []

        # Estimate confidence
        confidence = self._estimate_confidence(
            query,
            answer,
            retrieval_result
        )

        return {
            "answer": answer,
            "citations": citations,
            "confidence": confidence,
            "sources": [
                {
                    "path": chunk.doc_path,
                    "section": " > ".join(chunk.section_hierarchy),
                    "type": chunk.chunk_type
                }
                for chunk in retrieval_result.chunks[:5]
            ]
        }

    def _format_context(self, chunks: list[DocChunk]) -> str:
        """Format chunks for context."""
        formatted = []

        for i, chunk in enumerate(chunks):
            header = f"[{i+1}] {chunk.doc_path}"
            if chunk.section_hierarchy:
                header += f" > {' > '.join(chunk.section_hierarchy)}"

            if chunk.chunk_type == "code":
                content = f"```{chunk.code_language or ''}\n{chunk.content}\n```"
            else:
                content = chunk.content

            formatted.append(f"{header}\n{content}")

        return "\n\n---\n\n".join(formatted)

    def _extract_citations(
        self,
        answer: str,
        chunks: list[DocChunk]
    ) -> list[dict]:
        """Extract citations from answer."""
        citations = []

        # Find citation markers [1], [2], etc.
        markers = re.findall(r"\[(\d+)\]", answer)

        for marker in set(markers):
            idx = int(marker) - 1
            if 0 <= idx < len(chunks):
                chunk = chunks[idx]
                citations.append({
                    "marker": f"[{marker}]",
                    "path": chunk.doc_path,
                    "section": " > ".join(chunk.section_hierarchy)
                })

        return citations

    def _estimate_confidence(
        self,
        query: str,
        answer: str,
        retrieval_result: RetrievalResult
    ) -> float:
        """Estimate answer confidence."""
        # Factors:
        # 1. Retrieval scores
        avg_score = sum(retrieval_result.scores[:3]) / min(3, len(retrieval_result.scores))

        # 2. Whether answer indicates uncertainty
        uncertainty_phrases = [
            "I don't have",
            "not found in",
            "documentation doesn't",
            "unclear",
            "might be",
            "not sure"
        ]
        has_uncertainty = any(p in answer.lower() for p in uncertainty_phrases)

        # 3. Answer length (very short might indicate missing info)
        length_factor = min(1.0, len(answer) / 200)

        # Combine factors
        confidence = avg_score * 0.5 + (0 if has_uncertainty else 0.3) + length_factor * 0.2

        return min(1.0, max(0.0, confidence))

Key answer generation patterns:

Context formatting matters: The _format_context method adds citation markers [1], [2] and section paths. This structured format helps the LLM produce citations and lets users trace answers back to sources.
Grounded generation: The system prompt emphasizes "ONLY on the provided documentation excerpts." This is crucial—hallucination in documentation answers is worse than no answer. Users trust docs to be accurate.
Confidence as a feature: The confidence score isn't just internal metadata. Surface it in the UI: high-confidence answers can be shown prominently, while low-confidence ones might show "This answer may be incomplete—try refining your search."
When to skip generation: If retrieval_result.chunks is empty or all low-scoring, don't generate—just say "No relevant documentation found" and suggest alternative searches.

Streaming for better UX: For production, stream the answer generation. Users see tokens appear immediately rather than waiting 2-3 seconds for the full response. This dramatically improves perceived latency.

Production Optimization

A documentation search system that works in development might fall over in production. Real traffic brings latency requirements, cache invalidation challenges, and the need to understand what's working and what's not.

The Latency Challenge

Documentation search has stricter latency requirements than you might expect. Users searching docs are often in the middle of a task—coding, debugging, configuring. Every second they wait is a second of lost flow state.

Target latencies:

Autocomplete suggestions: < 100ms
Search results: < 500ms
AI-generated answers: < 2s (with streaming)

To hit these targets, we need caching at multiple levels.

Caching Strategy

We cache at two levels:

Query result cache: Store retrieval results for repeated queries. Documentation queries are highly repetitive—"how to install" gets asked constantly.
Embedding cache: Store embeddings for query strings. Computing embeddings is relatively expensive (50-100ms), and the same queries recur.

Cache invalidation is the hard part: When documentation updates, we need to invalidate cached results that included the changed content. Naive approaches (clear everything) hurt performance. Smart approaches (track which docs affect which cached queries) add complexity.

Python

import hashlib
from datetime import datetime, timedelta
from typing import Optional

class DocumentationCache:
    """Cache for documentation search."""

    def __init__(
        self,
        query_ttl: int = 3600,
        embedding_ttl: int = 86400
    ):
        self.query_cache = {}
        self.embedding_cache = {}
        self.query_ttl = query_ttl
        self.embedding_ttl = embedding_ttl

    def _hash_query(self, query: str, **kwargs) -> str:
        """Hash query and parameters."""
        key_data = f"{query}:{sorted(kwargs.items())}"
        return hashlib.md5(key_data.encode()).hexdigest()

    def get_query_result(
        self,
        query: str,
        **kwargs
    ) -> Optional[RetrievalResult]:
        """Get cached query result."""
        key = self._hash_query(query, **kwargs)

        if key in self.query_cache:
            result, timestamp = self.query_cache[key]
            if datetime.now() - timestamp < timedelta(seconds=self.query_ttl):
                return result
            del self.query_cache[key]

        return None

    def set_query_result(
        self,
        query: str,
        result: RetrievalResult,
        **kwargs
    ):
        """Cache query result."""
        key = self._hash_query(query, **kwargs)
        self.query_cache[key] = (result, datetime.now())

    def get_embedding(self, text: str) -> Optional[np.ndarray]:
        """Get cached embedding."""
        key = hashlib.md5(text.encode()).hexdigest()

        if key in self.embedding_cache:
            embedding, timestamp = self.embedding_cache[key]
            if datetime.now() - timestamp < timedelta(seconds=self.embedding_ttl):
                return embedding
            del self.embedding_cache[key]

        return None

    def set_embedding(self, text: str, embedding: np.ndarray):
        """Cache embedding."""
        key = hashlib.md5(text.encode()).hexdigest()
        self.embedding_cache[key] = (embedding, datetime.now())

    def invalidate_doc(self, doc_path: str):
        """Invalidate cache for a document."""
        # Remove query results that include this doc
        keys_to_remove = []
        for key, (result, _) in self.query_cache.items():
            if any(c.doc_path == doc_path for c in result.chunks):
                keys_to_remove.append(key)

        for key in keys_to_remove:
            del self.query_cache[key]

Cache design decisions:

TTL-based expiration: Query results expire after 1 hour (configurable). This balances freshness with performance—most docs don't change hourly.
Content-aware invalidation: The invalidate_doc method only removes cached results that actually included the changed document. This is more surgical than clearing everything.
In-memory vs. distributed: This implementation uses in-memory dicts, which works for single-instance deployments. For multi-instance, swap to Redis with the same interface.
What NOT to cache: Don't cache low-confidence results—they're likely to be wrong, and caching them means serving wrong answers faster.

Incremental Index Updates

Documentation changes frequently—typo fixes, new sections, updated examples. Rebuilding the entire index for every change is wasteful and slow. Instead, we track which documents changed and update only those.

The challenge: When a document changes, we need to:

Remove old chunks from that document
Re-chunk the updated document
Re-embed the new chunks
Invalidate any cached queries that included the old chunks

When to full rebuild vs. incremental:

Single doc change: Incremental update
Few docs changed: Incremental updates
Major restructuring (>30% of docs): Full rebuild is faster

Python

class IncrementalIndexManager:
    """Manage incremental documentation updates."""

    def __init__(self, index: DocumentationIndex, cache: DocumentationCache):
        self.index = index
        self.cache = cache
        self.doc_hashes = {}  # doc_path -> content hash

    def check_updates(self, doc_paths: list[str]) -> list[str]:
        """Check which documents need updating."""
        updated = []

        for path in doc_paths:
            with open(path, "r") as f:
                content = f.read()

            content_hash = hashlib.md5(content.encode()).hexdigest()

            if path not in self.doc_hashes or self.doc_hashes[path] != content_hash:
                updated.append(path)
                self.doc_hashes[path] = content_hash

        return updated

    def update_documents(self, doc_paths: list[str], chunker: DocumentationChunker):
        """Update specific documents in the index."""
        for path in doc_paths:
            # Remove old chunks
            self.index.chunks = [
                c for c in self.index.chunks
                if c.doc_path != path
            ]

            # Add new chunks
            with open(path, "r") as f:
                content = f.read()

            new_chunks = chunker.chunk_markdown(content, path)
            self.index.add_documents(new_chunks)

            # Invalidate cache
            self.cache.invalidate_doc(path)

    def rebuild_if_needed(self, threshold: float = 0.3) -> bool:
        """Rebuild full index if too many updates."""
        # If more than threshold% of docs updated, full rebuild is more efficient
        updated_ratio = len(self.doc_hashes) / max(len(self.index.chunks), 1)

        if updated_ratio > threshold:
            self._full_rebuild()
            return True

        return False

    def _full_rebuild(self):
        """Perform full index rebuild."""
        self.index._rebuild_embeddings()
        self.index._rebuild_keyword_index()
        self.cache.query_cache.clear()

Incremental update workflow:

On git push/deploy: CI/CD detects which markdown files changed
Call update endpoint: POST /index/update with the changed file paths
Manager handles the rest: Re-chunks, re-embeds, invalidates cache

Cost savings: For a 1,000-page documentation site, incremental updates when editing a single page save ~99% of embedding compute costs vs. full rebuild. At $0.0001/1K tokens, this adds up.

Gotcha—structural changes: If you rename sections or move content between pages, incremental updates might miss cross-references. Run a full rebuild weekly or when major restructuring happens.

Search Analytics

You can't improve what you don't measure. Search analytics reveal what users are actually looking for, where the system fails, and what documentation is missing.

Key metrics to track:

Metric	What It Tells You
Zero-result rate	Queries where we found nothing—documentation gaps or retrieval failures
Low-confidence rate	Queries where we found something but aren't sure it's right
Click-through rate	Whether users find results useful (requires UI integration)
Latency percentiles	P50 for typical experience, P95/P99 for worst cases
Popular queries	What users actually search for vs. what you expect

The most valuable insight: Zero-result and low-confidence queries often reveal missing documentation. If 50 users search for "webhook setup" and get poor results, you probably need a webhook guide.

Python

from collections import defaultdict
from datetime import datetime

class SearchAnalytics:
    """Track and analyze search patterns."""

    def __init__(self):
        self.queries = []
        self.clicks = defaultdict(int)
        self.zero_results = []
        self.low_confidence = []

    def log_search(
        self,
        query: str,
        result_count: int,
        confidence: float,
        latency_ms: float
    ):
        """Log a search event."""
        self.queries.append({
            "query": query,
            "result_count": result_count,
            "confidence": confidence,
            "latency_ms": latency_ms,
            "timestamp": datetime.now()
        })

        if result_count == 0:
            self.zero_results.append(query)

        if confidence < 0.5:
            self.low_confidence.append(query)

    def log_click(self, query: str, doc_path: str):
        """Log a result click."""
        self.clicks[f"{query}:{doc_path}"] += 1

    def get_popular_queries(self, limit: int = 20) -> list[tuple[str, int]]:
        """Get most popular queries."""
        from collections import Counter
        query_counts = Counter(q["query"].lower() for q in self.queries)
        return query_counts.most_common(limit)

    def get_failed_queries(self) -> list[str]:
        """Get queries with no results or low confidence."""
        return list(set(self.zero_results + self.low_confidence))

    def get_stats(self) -> dict:
        """Get search statistics."""
        if not self.queries:
            return {}

        latencies = [q["latency_ms"] for q in self.queries]
        confidences = [q["confidence"] for q in self.queries]

        return {
            "total_searches": len(self.queries),
            "zero_result_rate": len(self.zero_results) / len(self.queries),
            "low_confidence_rate": len(self.low_confidence) / len(self.queries),
            "avg_latency_ms": sum(latencies) / len(latencies),
            "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
            "avg_confidence": sum(confidences) / len(confidences)
        }

Making analytics actionable:

Weekly review ritual: Every week, pull get_failed_queries() and review the top 10. Are these documentation gaps or retrieval failures? Either write the missing docs or tune retrieval.
Popular queries drive priorities: get_popular_queries() shows what users actually search for. If your top query is "authentication" but your auth docs are weak, that's high-impact work.
Click logging requires UI integration: The log_click method needs frontend instrumentation. Worth the effort—click-through rate is the most reliable signal of result quality.
Export for deeper analysis: Periodically export self.queries to a data warehouse. Run cohort analysis, build ML models, correlate with user outcomes.

Complete Search Service

Putting it all together, here's how the components connect into a production API. The key insight is that each component we've built (index, cache, analyzer, retriever, generator, analytics) plugs together cleanly:

Request flow:

Cache check: Skip expensive retrieval if we've seen this query recently
Multi-step retrieval: Query analysis → hybrid search → link expansion → reranking
Answer generation: Synthesize response from retrieved chunks
Analytics logging: Track for later analysis and improvement

Why FastAPI? It's async-native (important for LLM calls), has automatic OpenAPI docs, and handles validation via Pydantic. But the pattern works with Flask, Django, or any framework.

Python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional

app = FastAPI()

# Initialize components
index = DocumentationIndex()
cache = DocumentationCache()
query_analyzer = QueryAnalyzer(client)
retriever = MultiStepRetriever(index, query_analyzer)
generator = AnswerGenerator(client)
analytics = SearchAnalytics()

class SearchRequest(BaseModel):
    query: str
    max_results: int = 10
    include_answer: bool = True

class SearchResponse(BaseModel):
    answer: Optional[str]
    confidence: float
    sources: list[dict]
    citations: list[dict]
    latency_ms: float

@app.post("/search", response_model=SearchResponse)
async def search(request: SearchRequest):
    """Search documentation."""
    import time
    start = time.time()

    # Check cache
    cached = cache.get_query_result(
        request.query,
        max_results=request.max_results
    )

    if cached:
        retrieval_result = cached
    else:
        # Perform retrieval
        retrieval_result = retriever.retrieve(
            request.query,
            max_results=request.max_results
        )
        cache.set_query_result(
            request.query,
            retrieval_result,
            max_results=request.max_results
        )

    # Generate answer
    if request.include_answer and retrieval_result.chunks:
        answer_result = generator.generate(
            request.query,
            retrieval_result
        )
    else:
        answer_result = {
            "answer": None,
            "confidence": 0,
            "citations": [],
            "sources": []
        }

    latency_ms = (time.time() - start) * 1000

    # Log analytics
    analytics.log_search(
        request.query,
        len(retrieval_result.chunks),
        answer_result["confidence"],
        latency_ms
    )

    return SearchResponse(
        answer=answer_result["answer"],
        confidence=answer_result["confidence"],
        sources=answer_result["sources"],
        citations=answer_result["citations"],
        latency_ms=latency_ms
    )

@app.post("/index/update")
async def update_index(doc_paths: list[str]):
    """Update index for specific documents."""
    chunker = DocumentationChunker()
    manager = IncrementalIndexManager(index, cache)
    manager.update_documents(doc_paths, chunker)
    return {"updated": len(doc_paths)}

@app.get("/analytics")
async def get_analytics():
    """Get search analytics."""
    return analytics.get_stats()

Deployment considerations:

Horizontal scaling: The service is stateless (cache/index can be shared via Redis/vector DB). Run multiple instances behind a load balancer.
Index updates: The /index/update endpoint handles incremental updates. Trigger it from CI/CD when docs change.
Monitoring: Export analytics to Prometheus/Grafana. Alert on latency spikes and zero-result rate increases.
Cost control: Cache aggressively. Use smaller models for query analysis, larger models for answer generation.

Documentation Versioning

Real documentation has versions. Users on v1.x need different answers than users on v2.x. A search system that ignores this returns confusing, potentially harmful results—telling a v1 user to use a v2-only feature.

The Version Challenge

Versioned documentation creates several problems:

Version detection: How do we know which version the user needs? URL path? Explicit selector? Inference from their question?
Cross-version search: Sometimes users want to see how things changed. "What's different in v2?" requires comparing versions.
Version aliases: "latest", "stable", "LTS"—these need to resolve to actual versions.
Deprecation handling: Content might exist in old versions but be deprecated. We should warn users, not just serve stale content.

Multi-Version Index Management

The key insight: maintain separate indices per version, with a routing layer that directs queries appropriately.

Python

from dataclasses import dataclass
from typing import Optional
from datetime import datetime

@dataclass
class DocVersion:
    version: str
    release_date: datetime
    is_latest: bool = False
    is_default: bool = False
    status: str = "stable"  # stable, beta, deprecated, eol

class VersionedDocumentationIndex:
    """Manage documentation across multiple versions."""

    def __init__(self):
        self.versions: dict[str, DocVersion] = {}
        self.indices: dict[str, DocumentationIndex] = {}
        self.version_aliases: dict[str, str] = {}  # "latest" -> "v2.0"

    def add_version(
        self,
        version: str,
        index: DocumentationIndex,
        config: DocVersion
    ):
        """Add a new documentation version."""
        self.versions[version] = config
        self.indices[version] = index

        # Update aliases
        if config.is_latest:
            self.version_aliases["latest"] = version
        if config.is_default:
            self.version_aliases["default"] = version

    def resolve_version(self, version_query: str) -> str:
        """Resolve version query to actual version."""
        # Handle aliases
        if version_query in self.version_aliases:
            return self.version_aliases[version_query]

        # Handle version patterns (e.g., "v2.x" -> latest v2)
        if version_query.endswith(".x"):
            major = version_query[:-2]
            matching = [
                v for v in self.versions
                if v.startswith(major)
            ]
            if matching:
                return sorted(matching)[-1]  # Latest matching

        # Direct version
        if version_query in self.versions:
            return version_query

        # Default to latest
        return self.version_aliases.get("latest", list(self.versions.keys())[0])

    def search(
        self,
        query: str,
        version: str = "default",
        cross_version: bool = False,
        top_k: int = 10
    ) -> dict:
        """Search documentation with version awareness."""

        target_version = self.resolve_version(version)
        results = {}

        if cross_version:
            # Search across all stable versions
            for ver, config in self.versions.items():
                if config.status in ["stable", "beta"]:
                    version_results = self.indices[ver].search(query, top_k=5)
                    results[ver] = version_results
        else:
            # Search single version
            results[target_version] = self.indices[target_version].search(
                query, top_k=top_k
            )

        return {
            "requested_version": version,
            "resolved_version": target_version,
            "results": results,
            "cross_version": cross_version
        }

    def find_version_differences(
        self,
        query: str,
        version_a: str,
        version_b: str
    ) -> dict:
        """Find differences in content between versions."""

        results_a = self.indices[version_a].search(query, top_k=5)
        results_b = self.indices[version_b].search(query, top_k=5)

        # Compare content
        content_a = [r[0].content for r in results_a]
        content_b = [r[0].content for r in results_b]

        differences = []
        for chunk_a in results_a:
            matching = [
                r for r in results_b
                if r[0].doc_path == chunk_a[0].doc_path and
                   r[0].section_hierarchy == chunk_a[0].section_hierarchy
            ]
            if matching:
                if chunk_a[0].content != matching[0][0].content:
                    differences.append({
                        "path": chunk_a[0].doc_path,
                        "section": chunk_a[0].section_hierarchy,
                        "version_a": chunk_a[0].content[:200],
                        "version_b": matching[0][0].content[:200]
                    })
            else:
                differences.append({
                    "path": chunk_a[0].doc_path,
                    "section": chunk_a[0].section_hierarchy,
                    "only_in": version_a
                })

        return {
            "version_a": version_a,
            "version_b": version_b,
            "differences": differences
        }

Version resolution patterns:

"latest" → Most recent stable version
"v2.x" → Latest v2 minor version
"stable" → Current recommended version (might not be latest)

Cross-version search use cases:

Migration guides: "What changed between v1 and v2?"
Compatibility: "Does this feature exist in v1.5?"
Regression investigation: "When did this behavior change?"

Migration Assistance

One of the most valuable features for versioned docs: helping users migrate. This combines version comparison with intelligent answer generation.

Python

class VersionMigrationAssistant:
    """Help users migrate between documentation versions."""

    def __init__(self, client, versioned_index: VersionedDocumentationIndex):
        self.client = client
        self.index = versioned_index

    def generate_migration_guide(
        self,
        topic: str,
        from_version: str,
        to_version: str
    ) -> dict:
        """Generate a migration guide for a topic."""

        # Get relevant content from both versions
        old_results = self.index.indices[from_version].search(topic, top_k=5)
        new_results = self.index.indices[to_version].search(topic, top_k=5)

        old_content = "\n\n".join([r[0].content for r in old_results])
        new_content = "\n\n".join([r[0].content for r in new_results])

        prompt = f"""Compare these two versions of documentation and identify changes.

## Version {from_version}:
{old_content}

## Version {to_version}:
{new_content}

Provide:
1. Breaking changes that require code updates
2. New features or options available
3. Deprecated features to avoid
4. Step-by-step migration instructions"""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "topic": topic,
            "from_version": from_version,
            "to_version": to_version,
            "migration_guide": response.choices[0].message.content
        }

    def detect_deprecated_usage(
        self,
        code_snippet: str,
        current_version: str
    ) -> list[dict]:
        """Detect deprecated API usage in code."""

        # Search for deprecation notices
        deprecation_results = self.index.indices[current_version].search(
            "deprecated removed breaking change",
            top_k=20
        )

        deprecation_info = "\n\n".join([
            r[0].content for r in deprecation_results
            if "deprecated" in r[0].content.lower()
        ])

        prompt = f"""Analyze this code for deprecated API usage based on the documentation.

Code:

{code_snippet}

Code


Deprecation information from docs:
{deprecation_info}

List any deprecated APIs used and their replacements."""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "code": code_snippet,
            "version": current_version,
            "analysis": response.choices[0].message.content
        }

Migration assistance use cases:

Upgrade planning: Before upgrading, users can see what changes affect their code
Breaking change detection: Automatically flag API calls that won't work in the new version
Migration guides: Generate step-by-step migration instructions from version diffs
Deprecation warnings: Surface deprecated features before they're removed

Integration with search: When a user searches and gets results from an old version, proactively offer: "This documentation is for v1.x. You might also want to see what changed in v2 (/docs/migration)."

API Reference Search

API reference documentation is different from guides and tutorials. It's highly structured—endpoints, parameters, return types, examples. Users search it differently too: they know what they're looking for (a specific method, parameter, or endpoint) and need precise answers.

Why API Reference Needs Special Treatment

General documentation search struggles with API reference because:

Structure over prose: API docs are tables, signatures, and code—not flowing text that embeds well.
Exact matching matters: A user searching for POST /users needs that exact endpoint, not semantically similar content about "creating accounts."
Parameter-level granularity: Users often need details about a single parameter, not an entire endpoint.
Cross-referencing: Understanding an endpoint often requires understanding related types, error codes, and authentication.

Structured API Documentation Model

We model API documentation as structured objects, not just text chunks:

Python

from pydantic import BaseModel, Field
from typing import Optional, Literal

class APIParameter(BaseModel):
    name: str
    type: str
    required: bool = True
    default: Optional[str] = None
    description: str

class APIEndpoint(BaseModel):
    method: Literal["GET", "POST", "PUT", "DELETE", "PATCH"]
    path: str
    description: str
    parameters: list[APIParameter] = Field(default_factory=list)
    request_body: Optional[dict] = None
    response_schema: Optional[dict] = None
    examples: list[dict] = Field(default_factory=list)
    auth_required: bool = True
    rate_limit: Optional[str] = None
    deprecated: bool = False

class APIClass(BaseModel):
    name: str
    module: str
    description: str
    methods: list["APIMethod"] = Field(default_factory=list)
    properties: list[APIParameter] = Field(default_factory=list)
    inheritance: list[str] = Field(default_factory=list)
    examples: list[str] = Field(default_factory=list)

class APIMethod(BaseModel):
    name: str
    signature: str
    description: str
    parameters: list[APIParameter] = Field(default_factory=list)
    returns: Optional[str] = None
    raises: list[str] = Field(default_factory=list)
    examples: list[str] = Field(default_factory=list)
    async_method: bool = False
    deprecated: bool = False

class APIReferenceIndex:
    """Specialized index for API reference documentation."""

    def __init__(self, embedding_model):
        self.encoder = embedding_model
        self.endpoints: list[APIEndpoint] = []
        self.classes: list[APIClass] = []
        self.methods: list[APIMethod] = []

        self.endpoint_embeddings = {}
        self.class_embeddings = {}
        self.method_embeddings = {}

    def add_endpoint(self, endpoint: APIEndpoint):
        """Add an API endpoint to the index."""
        self.endpoints.append(endpoint)

        # Create searchable text
        text = f"{endpoint.method} {endpoint.path}\n{endpoint.description}"
        for param in endpoint.parameters:
            text += f"\nParam: {param.name} ({param.type}) - {param.description}"

        embedding = self.encoder.encode(text)
        self.endpoint_embeddings[len(self.endpoints) - 1] = embedding

    def add_class(self, api_class: APIClass):
        """Add an API class to the index."""
        self.classes.append(api_class)

        # Create searchable text
        text = f"class {api_class.name}\n{api_class.description}"
        for method in api_class.methods:
            text += f"\nMethod: {method.name} - {method.description[:100]}"

        embedding = self.encoder.encode(text)
        self.class_embeddings[len(self.classes) - 1] = embedding

        # Also index individual methods
        for method in api_class.methods:
            self._add_method(method, api_class.name)

    def _add_method(self, method: APIMethod, class_name: str):
        """Add a method to the index."""
        self.methods.append(method)

        text = f"{class_name}.{method.name}{method.signature}\n{method.description}"
        for param in method.parameters:
            text += f"\nParam: {param.name} ({param.type}) - {param.description}"

        embedding = self.encoder.encode(text)
        self.method_embeddings[len(self.methods) - 1] = embedding

    def search_endpoints(self, query: str, top_k: int = 5) -> list[tuple[APIEndpoint, float]]:
        """Search for API endpoints."""
        query_embed = self.encoder.encode(query)

        results = []
        for idx, embed in self.endpoint_embeddings.items():
            similarity = self._cosine_similarity(query_embed, embed)
            results.append((self.endpoints[idx], similarity))

        results.sort(key=lambda x: -x[1])
        return results[:top_k]

    def search_methods(self, query: str, top_k: int = 5) -> list[tuple[APIMethod, float]]:
        """Search for API methods."""
        query_embed = self.encoder.encode(query)

        results = []
        for idx, embed in self.method_embeddings.items():
            similarity = self._cosine_similarity(query_embed, embed)
            results.append((self.methods[idx], similarity))

        results.sort(key=lambda x: -x[1])
        return results[:top_k]

    def search_classes(self, query: str, top_k: int = 5) -> list[tuple[APIClass, float]]:
        """Search for API classes."""
        query_embed = self.encoder.encode(query)

        results = []
        for idx, embed in self.class_embeddings.items():
            similarity = self._cosine_similarity(query_embed, embed)
            results.append((self.classes[idx], similarity))

        results.sort(key=lambda x: -x[1])
        return results[:top_k]

    def find_by_signature(self, signature_pattern: str) -> list:
        """Find methods by signature pattern."""
        import re
        pattern = re.compile(signature_pattern, re.IGNORECASE)

        matches = []
        for method in self.methods:
            if pattern.search(method.signature):
                matches.append(method)

        return matches

    def _cosine_similarity(self, a, b) -> float:
        import numpy as np
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

class APISearchAgent:
    """Intelligent API reference search."""

    def __init__(self, client, api_index: APIReferenceIndex, doc_index: DocumentationIndex):
        self.client = client
        self.api_index = api_index
        self.doc_index = doc_index

    def answer_api_question(self, question: str) -> dict:
        """Answer a question about the API."""

        # Determine question type
        question_lower = question.lower()

        if any(word in question_lower for word in ["how to", "how do", "example"]):
            return self._answer_howto(question)
        elif any(word in question_lower for word in ["what is", "explain", "describe"]):
            return self._answer_concept(question)
        elif any(word in question_lower for word in ["parameter", "argument", "option"]):
            return self._answer_parameter(question)
        else:
            return self._answer_general(question)

    def _answer_howto(self, question: str) -> dict:
        """Answer how-to questions with examples."""

        # Search for relevant methods
        methods = self.api_index.search_methods(question, top_k=3)

        # Search documentation for examples
        doc_results = self.doc_index.search(question + " example", top_k=5)

        # Build context
        api_context = []
        for method, score in methods:
            api_context.append(f"Method: {method.name}\nSignature: {method.signature}\n"
                             f"Description: {method.description}")
            if method.examples:
                api_context.append(f"Examples:\n" + "\n".join(method.examples))

        doc_context = [chunk.content for chunk, _ in doc_results]

        prompt = f"""Question: {question}

API Reference:
{chr(10).join(api_context)}

Documentation:
{chr(10).join(doc_context)}

Provide a clear answer with code examples."""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "answer": response.choices[0].message.content,
            "relevant_methods": [m[0].name for m in methods],
            "sources": [{"path": r[0].doc_path, "section": r[0].section_hierarchy}
                       for r in doc_results[:3]]
        }

    def _answer_concept(self, question: str) -> dict:
        """Answer conceptual questions."""

        classes = self.api_index.search_classes(question, top_k=3)
        doc_results = self.doc_index.search(question, top_k=5)

        context = []
        for cls, score in classes:
            context.append(f"Class: {cls.name}\nModule: {cls.module}\n"
                          f"Description: {cls.description}")

        for chunk, score in doc_results:
            context.append(chunk.content)

        prompt = f"""Question: {question}

Context:
{chr(10).join(context)}

Provide a clear explanation."""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "answer": response.choices[0].message.content,
            "relevant_classes": [c[0].name for c in classes],
            "sources": [{"path": r[0].doc_path, "section": r[0].section_hierarchy}
                       for r in doc_results[:3]]
        }

    def _answer_parameter(self, question: str) -> dict:
        """Answer parameter-specific questions."""

        methods = self.api_index.search_methods(question, top_k=5)

        # Extract parameter info
        param_info = []
        for method, score in methods:
            for param in method.parameters:
                if any(word in param.name.lower() or word in param.description.lower()
                       for word in question.lower().split()):
                    param_info.append({
                        "method": method.name,
                        "parameter": param.name,
                        "type": param.type,
                        "required": param.required,
                        "default": param.default,
                        "description": param.description
                    })

        return {
            "answer": self._format_parameter_answer(param_info),
            "parameters": param_info
        }

    def _answer_general(self, question: str) -> dict:
        """General API question answering."""
        methods = self.api_index.search_methods(question, top_k=3)
        endpoints = self.api_index.search_endpoints(question, top_k=3)
        doc_results = self.doc_index.search(question, top_k=5)

        context = []

        for method, score in methods:
            context.append(f"Method: {method.signature}\n{method.description}")

        for endpoint, score in endpoints:
            context.append(f"{endpoint.method} {endpoint.path}\n{endpoint.description}")

        for chunk, score in doc_results:
            context.append(chunk.content)

        prompt = f"""Question: {question}

API Reference and Documentation:
{chr(10).join(context[:10])}

Provide a comprehensive answer."""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "answer": response.choices[0].message.content,
            "relevant_methods": [m[0].name for m in methods],
            "relevant_endpoints": [f"{e[0].method} {e[0].path}" for e in endpoints],
            "sources": [{"path": r[0].doc_path, "section": r[0].section_hierarchy}
                       for r in doc_results[:3]]
        }

    def _format_parameter_answer(self, params: list[dict]) -> str:
        """Format parameter information as readable answer."""
        if not params:
            return "No matching parameters found."

        answer = "Here are the relevant parameters:\n\n"
        for p in params:
            answer += f"**{p['method']}.{p['parameter']}**\n"
            answer += f"- Type: `{p['type']}`\n"
            answer += f"- Required: {p['required']}\n"
            if p['default']:
                answer += f"- Default: `{p['default']}`\n"
            answer += f"- Description: {p['description']}\n\n"

        return answer

API search patterns:

The APISearchAgent demonstrates how to route questions to the right strategy:

How-to questions → Search methods + documentation examples, emphasize code
Concept questions → Search classes + documentation explanations, emphasize prose
Parameter questions → Search method signatures, return structured parameter info
General questions → Search everything, synthesize comprehensive answer

This routing is simple but effective. More sophisticated approaches use LLMs to classify questions, but rule-based routing handles 80% of cases with zero latency overhead.

Structured output is key: Instead of returning free-text answers, the API search returns structured data (relevant methods, endpoints, sources). This lets the UI render rich results—method signatures, parameter tables, "see also" links—rather than just a wall of text.

User Feedback Integration

The best documentation search systems learn from their users. Every "this wasn't helpful" click is a signal. Every unanswered query reveals a gap. Building feedback loops turns your search system into a continuously improving system.

Why Feedback Matters

Static search systems degrade over time. Documentation changes, user expectations evolve, and new features create new query patterns. Without feedback, you're flying blind—optimizing for benchmarks that may not reflect real usage.

What feedback reveals:

Feedback Signal	What It Means	Action
"Not helpful" on high-confidence result	Retrieval or generation failure	Review and fix
"Helpful" on low-confidence result	Confidence estimation is wrong	Adjust thresholds
"Incorrect" feedback	Serious issue—wrong information	Urgent review
"Outdated" feedback	Documentation staleness	Update docs
"Missing info" feedback	Documentation gap	Write new content

Feedback Collection System

The feedback system needs to capture enough context to be actionable—not just "good/bad" but why and what the user was actually looking for.

Python

from dataclasses import dataclass
from datetime import datetime
from typing import Optional, Literal
from enum import Enum

class FeedbackType(str, Enum):
    HELPFUL = "helpful"
    NOT_HELPFUL = "not_helpful"
    INCORRECT = "incorrect"
    OUTDATED = "outdated"
    MISSING_INFO = "missing_info"

@dataclass
class SearchFeedback:
    query: str
    answer: str
    feedback_type: FeedbackType
    comment: Optional[str]
    user_id: Optional[str]
    timestamp: datetime
    search_results: list[str]  # doc paths
    confidence: float

class FeedbackCollector:
    """Collect and store user feedback on search results."""

    def __init__(self, storage_path: str = "feedback.jsonl"):
        self.storage_path = storage_path
        self.feedback_buffer: list[SearchFeedback] = []

    def submit_feedback(
        self,
        query: str,
        answer: str,
        feedback_type: FeedbackType,
        search_results: list[str],
        confidence: float,
        comment: str = None,
        user_id: str = None
    ) -> str:
        """Submit feedback for a search result."""

        feedback = SearchFeedback(
            query=query,
            answer=answer,
            feedback_type=feedback_type,
            comment=comment,
            user_id=user_id,
            timestamp=datetime.now(),
            search_results=search_results,
            confidence=confidence
        )

        self.feedback_buffer.append(feedback)
        self._persist_feedback(feedback)

        return f"feedback_{len(self.feedback_buffer)}"

    def _persist_feedback(self, feedback: SearchFeedback):
        """Persist feedback to storage."""
        import json

        with open(self.storage_path, "a") as f:
            f.write(json.dumps({
                "query": feedback.query,
                "answer": feedback.answer[:500],
                "feedback_type": feedback.feedback_type.value,
                "comment": feedback.comment,
                "user_id": feedback.user_id,
                "timestamp": feedback.timestamp.isoformat(),
                "search_results": feedback.search_results,
                "confidence": feedback.confidence
            }) + "\n")

    def get_feedback_stats(self) -> dict:
        """Get feedback statistics."""
        from collections import Counter

        type_counts = Counter(f.feedback_type for f in self.feedback_buffer)
        total = len(self.feedback_buffer)

        return {
            "total_feedback": total,
            "helpful_rate": type_counts.get(FeedbackType.HELPFUL, 0) / max(total, 1),
            "by_type": {t.value: c for t, c in type_counts.items()},
            "avg_confidence_helpful": self._avg_confidence(FeedbackType.HELPFUL),
            "avg_confidence_not_helpful": self._avg_confidence(FeedbackType.NOT_HELPFUL)
        }

    def _avg_confidence(self, feedback_type: FeedbackType) -> float:
        matching = [f for f in self.feedback_buffer if f.feedback_type == feedback_type]
        if not matching:
            return 0
        return sum(f.confidence for f in matching) / len(matching)

Learning from Feedback

Collecting feedback is useless if you don't act on it. The feedback learner automatically extracts signals and adjusts retrieval accordingly.

Automatic adjustments:

Document boosting: If users consistently find results from a particular doc helpful, boost it in rankings. If results are consistently unhelpful, demote it.
Query correction: If users provide "what they were actually looking for" in feedback, use that to expand future similar queries.
Negative examples: Track query-answer pairs that failed so you can evaluate whether changes actually fix them.

Python

class FeedbackLearner:
    """Learn from feedback to improve search."""

    def __init__(self, feedback_collector: FeedbackCollector):
        self.collector = feedback_collector
        self.query_corrections: dict[str, str] = {}  # query -> better query
        self.doc_boosts: dict[str, float] = {}  # doc_path -> boost factor
        self.negative_examples: list[tuple[str, str]] = []  # (query, bad_answer)

    def analyze_feedback(self):
        """Analyze feedback to extract learning signals."""

        for feedback in self.collector.feedback_buffer:
            if feedback.feedback_type == FeedbackType.HELPFUL:
                # Boost documents that were helpful
                for doc in feedback.search_results[:3]:
                    self.doc_boosts[doc] = self.doc_boosts.get(doc, 1.0) * 1.1

            elif feedback.feedback_type == FeedbackType.NOT_HELPFUL:
                # Reduce boost for unhelpful docs
                for doc in feedback.search_results[:3]:
                    self.doc_boosts[doc] = self.doc_boosts.get(doc, 1.0) * 0.9

            elif feedback.feedback_type == FeedbackType.INCORRECT:
                # Track as negative example
                self.negative_examples.append((feedback.query, feedback.answer))

            elif feedback.feedback_type == FeedbackType.MISSING_INFO:
                # Track queries that need better coverage
                if feedback.comment:
                    # User might have provided what they were looking for
                    self._extract_query_improvement(feedback)

    def _extract_query_improvement(self, feedback: SearchFeedback):
        """Extract query improvement from feedback comment."""
        if feedback.comment and len(feedback.comment) > 10:
            # Store as potential query expansion
            self.query_corrections[feedback.query] = feedback.comment

    def apply_doc_boosts(
        self,
        results: list[tuple[DocChunk, float]]
    ) -> list[tuple[DocChunk, float]]:
        """Apply learned boosts to search results."""

        boosted = []
        for chunk, score in results:
            boost = self.doc_boosts.get(chunk.doc_path, 1.0)
            boosted.append((chunk, score * boost))

        boosted.sort(key=lambda x: -x[1])
        return boosted

    def get_query_expansion(self, query: str) -> Optional[str]:
        """Get learned query expansion."""
        return self.query_corrections.get(query)

    def export_training_data(self) -> dict:
        """Export data for fine-tuning or evaluation."""
        return {
            "positive_examples": [
                {"query": f.query, "docs": f.search_results, "answer": f.answer}
                for f in self.collector.feedback_buffer
                if f.feedback_type == FeedbackType.HELPFUL
            ],
            "negative_examples": self.negative_examples,
            "query_corrections": self.query_corrections,
            "doc_boosts": self.doc_boosts
        }

Active Learning Loop

The active learning loop combines feedback collection and learning into a continuous improvement cycle. It also identifies documentation gaps—queries that consistently fail despite the system's best efforts.

The improvement cycle:

Code

User searches → System returns results → User provides feedback
                                               ↓
    ← Apply boosts ← Learn from feedback ← Store feedback

Gap identification: When multiple users ask similar questions and get poor results, that's a documentation gap—not a retrieval failure. The active learning loop surfaces these gaps for content teams to address.

Python

class ActiveLearningLoop:
    """Active learning loop to continuously improve search."""

    def __init__(
        self,
        retriever: MultiStepRetriever,
        generator: AnswerGenerator,
        feedback_learner: FeedbackLearner
    ):
        self.retriever = retriever
        self.generator = generator
        self.learner = feedback_learner

    def search_with_learning(
        self,
        query: str,
        **kwargs
    ) -> dict:
        """Search with feedback-based improvements."""

        # Check for query expansion from feedback
        expanded_query = self.learner.get_query_expansion(query)
        if expanded_query:
            query = f"{query} {expanded_query}"

        # Perform retrieval
        retrieval_result = self.retriever.retrieve(query, **kwargs)

        # Apply learned doc boosts
        boosted_results = self.learner.apply_doc_boosts(
            list(zip(retrieval_result.chunks, retrieval_result.scores))
        )

        retrieval_result.chunks = [r[0] for r in boosted_results]
        retrieval_result.scores = [r[1] for r in boosted_results]

        # Generate answer
        answer_result = self.generator.generate(query, retrieval_result)

        return {
            "answer": answer_result["answer"],
            "confidence": answer_result["confidence"],
            "sources": answer_result["sources"],
            "query_expanded": expanded_query is not None
        }

    def identify_gaps(self) -> list[dict]:
        """Identify documentation gaps from feedback."""

        gaps = []

        # Queries with low confidence but high frequency
        from collections import Counter
        query_counts = Counter(
            f.query for f in self.learner.collector.feedback_buffer
        )

        for query, count in query_counts.most_common(20):
            matching = [
                f for f in self.learner.collector.feedback_buffer
                if f.query == query
            ]
            avg_confidence = sum(f.confidence for f in matching) / len(matching)

            if avg_confidence < 0.5 and count >= 3:
                gaps.append({
                    "query": query,
                    "frequency": count,
                    "avg_confidence": avg_confidence,
                    "feedback_types": [f.feedback_type.value for f in matching]
                })

        return gaps

    def generate_improvement_report(self) -> str:
        """Generate a report of suggested improvements."""

        gaps = self.identify_gaps()
        stats = self.learner.collector.get_feedback_stats()

        report = "# Documentation Search Improvement Report\n\n"

        report += "## Overall Statistics\n"
        report += f"- Total feedback: {stats['total_feedback']}\n"
        report += f"- Helpful rate: {stats['helpful_rate']:.1%}\n"
        report += f"- By type: {stats['by_type']}\n\n"

        if gaps:
            report += "## Documentation Gaps\n"
            report += "These queries frequently return low-confidence results:\n\n"
            for gap in gaps[:10]:
                report += f"- **{gap['query']}** (asked {gap['frequency']}x, "
                report += f"avg confidence: {gap['avg_confidence']:.2f})\n"

        if self.learner.query_corrections:
            report += "\n## Query Corrections\n"
            report += "Users suggested these query improvements:\n\n"
            for original, correction in list(self.learner.query_corrections.items())[:10]:
                report += f"- \"{original}\" → \"{correction}\"\n"

        return report

Real-Time Search Quality Monitoring

In production, you need to know when search quality degrades—before users complain. Real-time monitoring catches issues like:

Latency spikes: Maybe an embedding service is slow, or the index got too large
Confidence drops: A documentation update might have broken something
Zero-result increases: New queries the system can't handle
Feedback pattern changes: Sudden increase in "not helpful" feedback

The monitoring system tracks rolling windows of metrics and alerts when thresholds are exceeded.

Python

from dataclasses import dataclass
from datetime import datetime, timedelta
import statistics

@dataclass
class SearchQualityMetric:
    timestamp: datetime
    query: str
    latency_ms: float
    result_count: int
    confidence: float
    had_feedback: bool
    feedback_positive: Optional[bool]

class SearchQualityMonitor:
    """Monitor search quality in real-time."""

    def __init__(self, window_minutes: int = 60):
        self.metrics: list[SearchQualityMetric] = []
        self.window_minutes = window_minutes
        self.alerts: list[dict] = []

    def record_search(
        self,
        query: str,
        latency_ms: float,
        result_count: int,
        confidence: float
    ):
        """Record a search event."""
        self.metrics.append(SearchQualityMetric(
            timestamp=datetime.now(),
            query=query,
            latency_ms=latency_ms,
            result_count=result_count,
            confidence=confidence,
            had_feedback=False,
            feedback_positive=None
        ))

        # Clean old metrics
        self._cleanup_old_metrics()

        # Check for alerts
        self._check_alerts()

    def record_feedback(self, query: str, positive: bool):
        """Record feedback for a recent search."""
        # Find matching recent metric
        for metric in reversed(self.metrics):
            if metric.query == query and not metric.had_feedback:
                metric.had_feedback = True
                metric.feedback_positive = positive
                break

    def _cleanup_old_metrics(self):
        """Remove metrics outside the window."""
        cutoff = datetime.now() - timedelta(minutes=self.window_minutes)
        self.metrics = [m for m in self.metrics if m.timestamp > cutoff]

    def _check_alerts(self):
        """Check for quality degradation."""
        if len(self.metrics) < 10:
            return

        recent = self.metrics[-10:]

        # Check latency
        avg_latency = statistics.mean(m.latency_ms for m in recent)
        if avg_latency > 2000:  # 2 second threshold
            self._add_alert("high_latency", f"Average latency {avg_latency:.0f}ms")

        # Check confidence
        avg_confidence = statistics.mean(m.confidence for m in recent)
        if avg_confidence < 0.4:
            self._add_alert("low_confidence", f"Average confidence {avg_confidence:.2f}")

        # Check zero results
        zero_result_rate = sum(1 for m in recent if m.result_count == 0) / len(recent)
        if zero_result_rate > 0.2:
            self._add_alert("high_zero_results", f"Zero result rate {zero_result_rate:.1%}")

    def _add_alert(self, alert_type: str, message: str):
        """Add an alert."""
        self.alerts.append({
            "type": alert_type,
            "message": message,
            "timestamp": datetime.now().isoformat()
        })

    def get_dashboard_metrics(self) -> dict:
        """Get metrics for dashboard display."""
        if not self.metrics:
            return {"error": "No data"}

        recent = self.metrics[-100:]

        # Feedback metrics
        with_feedback = [m for m in recent if m.had_feedback]
        positive_feedback = [m for m in with_feedback if m.feedback_positive]

        return {
            "window_minutes": self.window_minutes,
            "total_searches": len(recent),
            "avg_latency_ms": statistics.mean(m.latency_ms for m in recent),
            "p95_latency_ms": sorted(m.latency_ms for m in recent)[int(len(recent) * 0.95)],
            "avg_confidence": statistics.mean(m.confidence for m in recent),
            "zero_result_rate": sum(1 for m in recent if m.result_count == 0) / len(recent),
            "feedback_rate": len(with_feedback) / len(recent) if recent else 0,
            "positive_feedback_rate": len(positive_feedback) / len(with_feedback) if with_feedback else 0,
            "active_alerts": self.alerts[-5:]
        }

    def get_slow_queries(self, threshold_ms: float = 1000) -> list[dict]:
        """Get queries that are consistently slow."""
        from collections import defaultdict

        query_latencies = defaultdict(list)
        for m in self.metrics:
            query_latencies[m.query].append(m.latency_ms)

        slow = []
        for query, latencies in query_latencies.items():
            if len(latencies) >= 2 and statistics.mean(latencies) > threshold_ms:
                slow.append({
                    "query": query,
                    "avg_latency_ms": statistics.mean(latencies),
                    "count": len(latencies)
                })

        return sorted(slow, key=lambda x: -x["avg_latency_ms"])[:10]

    def get_low_quality_queries(self, threshold: float = 0.4) -> list[dict]:
        """Get queries with consistently low confidence."""
        from collections import defaultdict

        query_confidences = defaultdict(list)
        for m in self.metrics:
            query_confidences[m.query].append(m.confidence)

        low_quality = []
        for query, confidences in query_confidences.items():
            if len(confidences) >= 2 and statistics.mean(confidences) < threshold:
                low_quality.append({
                    "query": query,
                    "avg_confidence": statistics.mean(confidences),
                    "count": len(confidences)
                })

        return sorted(low_quality, key=lambda x: x["avg_confidence"])[:10]

Using the monitor effectively:

Dashboard integration: Call get_dashboard_metrics() periodically and display in your ops dashboard. The metrics give you at-a-glance system health.
Alerting thresholds: The _check_alerts method fires when recent searches show degradation. Wire these to PagerDuty/Slack for on-call notification.
Proactive debugging: get_slow_queries() and get_low_quality_queries() identify specific problem areas. Use these weekly to prioritize improvements.
Feedback correlation: By tracking both search metrics and feedback on the same queries, you can identify whether low confidence actually correlates with user dissatisfaction—calibrating your confidence estimation.

What to do with alerts:

Alert	Likely Cause	Action
`high_latency`	Embedding service slow, index too large	Check external services, consider index sharding
`low_confidence`	Doc update broke retrieval, new query patterns	Review recent changes, check failed queries
`high_zero_results`	New terminology, missing docs	Add query expansions, write missing content

Advanced Techniques

Once you have a basic documentation search working, these advanced techniques can significantly improve quality for specific use cases.

HyDE: Hypothetical Document Embeddings

HyDE addresses the query-document mismatch problem. Instead of embedding the user's question directly, we first generate a hypothetical answer, then embed that. The hypothesis is closer to how documentation is actually written.

Why it works: When a user asks "why is my app crashing?", they use question phrasing. But documentation says "Common causes of crashes include...". HyDE generates a hypothetical document that bridges this gap.

Python

class HyDEQueryExpander:
    """Expand queries using Hypothetical Document Embeddings."""

    def __init__(self, client, encoder):
        self.client = client
        self.encoder = encoder

    def expand_with_hyde(self, query: str, doc_context: str = "") -> np.ndarray:
        """Generate hypothetical document and embed it."""

        prompt = f"""You are a technical documentation writer.

Given this user question, write a short documentation excerpt (2-3 sentences)
that would answer it. Write it as if it's from official documentation—
factual, direct, no hedging.

Question: {query}
{f"Documentation context: {doc_context}" if doc_context else ""}

Documentation excerpt:"""

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",  # Fast model is fine for this
            messages=[{"role": "user", "content": prompt}],
            max_tokens=150
        )

        hypothetical_doc = response.choices[0].message.content

        # Embed the hypothetical document instead of the query
        hyde_embedding = self.encoder.encode(hypothetical_doc)

        return hyde_embedding

    def expand_with_multiple_hypotheses(
        self,
        query: str,
        num_hypotheses: int = 3
    ) -> np.ndarray:
        """Generate multiple hypotheses and average embeddings."""

        embeddings = []
        for i in range(num_hypotheses):
            prompt = f"""You are a technical documentation writer.

Write a documentation excerpt that answers this question.
Variation {i+1}: {"Focus on getting started." if i == 0 else "Focus on troubleshooting." if i == 1 else "Focus on advanced configuration."}

Question: {query}

Documentation excerpt:"""

            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=150
            )

            hypo = response.choices[0].message.content
            embeddings.append(self.encoder.encode(hypo))

        # Average embeddings
        return np.mean(embeddings, axis=0)

When to use HyDE:

High vocabulary mismatch (user questions vs. documentation style)
Low initial retrieval quality
Willing to accept ~200ms additional latency per query

When NOT to use HyDE:

API reference queries (exact terms matter)
Very short queries (not enough signal for generation)
Latency-critical applications

HyPE: The 2025 Evolution

HyPE (Hypothetical Passage Embeddings) is a newer technique that flips HyDE's approach:

Technique	When It Runs	What It Generates	Latency Impact
HyDE	Query time	Hypothetical document from query	+200ms per query
HyPE	Index time	Hypothetical queries from documents	Zero query-time cost

How HyPE works: Instead of generating a hypothetical document for each query, HyPE pre-generates hypothetical queries that each document could answer during indexing. At query time, you're matching query-to-query rather than query-to-document.

HyPE results: Research shows HyPE improves retrieval precision by up to 42 percentage points and recall by up to 45 points on certain datasets—without any query-time generation cost.

Python

class HyPEIndexer:
    """Index documents with hypothetical query embeddings."""

    def __init__(self, client, encoder):
        self.client = client
        self.encoder = encoder

    def generate_hypothetical_queries(self, doc_chunk: str, num_queries: int = 3) -> list[str]:
        """Generate queries this document could answer."""

        prompt = f"""Given this documentation excerpt, generate {num_queries} questions
that a user might ask that this content would answer.

Documentation:
{doc_chunk}

Generate realistic search queries (not full sentences). Return one per line."""

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=200
        )

        queries = response.choices[0].message.content.strip().split("\n")
        return [q.strip() for q in queries if q.strip()][:num_queries]

    def index_with_hype(self, chunks: list[DocChunk]) -> dict:
        """Create index with both document and hypothetical query embeddings."""

        index = {"doc_embeddings": [], "query_embeddings": [], "chunks": []}

        for chunk in chunks:
            # Standard document embedding
            doc_emb = self.encoder.encode(chunk.content)
            index["doc_embeddings"].append(doc_emb)

            # Generate and embed hypothetical queries
            hypo_queries = self.generate_hypothetical_queries(chunk.content)
            query_embs = [self.encoder.encode(q) for q in hypo_queries]

            # Average the query embeddings
            avg_query_emb = np.mean(query_embs, axis=0)
            index["query_embeddings"].append(avg_query_emb)

            index["chunks"].append(chunk)

        return index

When to use HyPE over HyDE:

Query latency is critical
You have compute budget for indexing
Documents are relatively static (re-indexing isn't frequent)

Combine both: Use HyPE for the initial retrieval stage, then HyDE for query expansion on complex queries that return few results.

Semantic Chunking

Instead of fixed-size chunks, semantic chunking splits at natural boundaries where the topic changes:

Python

from sklearn.metrics.pairwise import cosine_similarity

class SemanticChunker:
    """Chunk documents based on semantic similarity between sentences."""

    def __init__(self, encoder, similarity_threshold: float = 0.5):
        self.encoder = encoder
        self.similarity_threshold = similarity_threshold

    def chunk_semantically(self, content: str, doc_path: str) -> list[DocChunk]:
        """Chunk content based on semantic breaks."""

        # Split into sentences
        sentences = self._split_sentences(content)
        if len(sentences) < 2:
            return [self._create_chunk(content, doc_path, 0)]

        # Embed all sentences
        embeddings = self.encoder.encode(sentences)

        # Find semantic breaks
        chunks = []
        current_chunk = [sentences[0]]
        current_start = 0

        for i in range(1, len(sentences)):
            # Compare current sentence to previous
            sim = cosine_similarity(
                [embeddings[i]],
                [embeddings[i-1]]
            )[0][0]

            if sim < self.similarity_threshold:
                # Semantic break detected
                chunks.append(self._create_chunk(
                    " ".join(current_chunk),
                    doc_path,
                    current_start
                ))
                current_chunk = [sentences[i]]
                current_start = i
            else:
                current_chunk.append(sentences[i])

        # Add final chunk
        if current_chunk:
            chunks.append(self._create_chunk(
                " ".join(current_chunk),
                doc_path,
                current_start
            ))

        return chunks

    def _split_sentences(self, text: str) -> list[str]:
        """Split text into sentences."""
        import re
        sentences = re.split(r'(?<=[.!?])\s+', text)
        return [s.strip() for s in sentences if s.strip()]

    def _create_chunk(self, content: str, doc_path: str, index: int) -> DocChunk:
        """Create a DocChunk from content."""
        return DocChunk(
            id=f"{doc_path}:semantic:{index}",
            content=content,
            doc_path=doc_path,
            section_hierarchy=[],  # Would need to be passed in
            chunk_type="text",
            links=[]
        )

When semantic chunking helps:

Long-form documentation with multiple topics per section
FAQ pages where each Q&A pair should be a chunk
Changelog/release notes with distinct entries

Query Rewriting with LLMs

Sometimes the user's query needs transformation before search. Query rewriting handles ambiguity, expands context, and fixes common mistakes:

Python

class QueryRewriter:
    """Rewrite queries for better retrieval."""

    def __init__(self, client):
        self.client = client

    def rewrite(self, query: str, conversation_history: list = None) -> str:
        """Rewrite query for improved retrieval."""

        system_prompt = """You are a search query optimizer for technical documentation.

Rewrite the user's query to be more searchable:
1. Expand abbreviations (e.g., "auth" → "authentication")
2. Add implicit context from conversation
3. Fix typos and unclear phrasing
4. Convert questions to documentation-style phrases
5. Keep technical terms precise

Return ONLY the rewritten query, nothing else."""

        messages = [{"role": "system", "content": system_prompt}]

        # Add conversation context if available
        if conversation_history:
            context = "\n".join([
                f"{msg['role']}: {msg['content']}"
                for msg in conversation_history[-3:]  # Last 3 exchanges
            ])
            messages.append({
                "role": "user",
                "content": f"Conversation context:\n{context}\n\nQuery to rewrite: {query}"
            })
        else:
            messages.append({"role": "user", "content": f"Query to rewrite: {query}"})

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            max_tokens=100
        )

        return response.choices[0].message.content.strip()

    def generate_search_variants(self, query: str, num_variants: int = 3) -> list[str]:
        """Generate multiple search query variants."""

        prompt = f"""Generate {num_variants} different ways to search for documentation about:

"{query}"

Requirements:
- Each variant should capture a different aspect or phrasing
- Keep technical accuracy
- Focus on terms likely to appear in documentation

Return one query per line, no numbering or explanation."""

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=200
        )

        variants = response.choices[0].message.content.strip().split("\n")
        return [v.strip() for v in variants if v.strip()][:num_variants]

Query rewriting best practices:

Preserve intent: Rewriting should clarify, not change the meaning. "fast API" → "FastAPI framework" not "high-performance API design"
Use conversation context: If the user previously asked about Python and now asks "how do I install it?", expand to "how do I install FastAPI in Python"
Don't over-expand: Adding too many terms dilutes the query. Limit to 2-3 key expansions.
Cache rewrites: The same queries recur. Cache rewritten versions to avoid repeated LLM calls.
A/B test: Rewriting can hurt as well as help. Test whether it improves your specific metrics before deploying.

Evaluation & Benchmarking

You can't improve what you don't measure. Systematic evaluation is essential for documentation search.

Key Metrics

Metric	Definition	Target	How to Measure
Recall@k	% of relevant docs in top-k results	>80%	Human-labeled test set
MRR	Mean Reciprocal Rank of first relevant result	>0.7	Human-labeled test set
Answer correctness	% of generated answers that are factually correct	>90%	Human review or LLM-as-judge
Answer completeness	% of answers that fully address the query	>75%	Human review
Latency P95	95th percentile response time	<2s	Production monitoring
Zero-result rate	% of queries with no results	<5%	Production monitoring

Building a Test Set

A good test set is your most valuable evaluation asset. Build it from real queries:

Python

from dataclasses import dataclass
from typing import Optional

@dataclass
class TestCase:
    query: str
    relevant_doc_ids: list[str]  # Ground truth relevant documents
    expected_answer_contains: list[str]  # Key phrases answer should include
    expected_intent: str
    difficulty: str  # "easy", "medium", "hard"
    category: str  # "howto", "reference", "troubleshoot", etc.

class TestSetBuilder:
    """Build and manage documentation search test sets."""

    def __init__(self):
        self.test_cases: list[TestCase] = []

    def add_from_search_logs(
        self,
        queries: list[str],
        clicked_docs: dict[str, list[str]],
        min_clicks: int = 2
    ):
        """Create test cases from search logs with click data."""

        for query in queries:
            if query in clicked_docs and len(clicked_docs[query]) >= min_clicks:
                # Assume clicked docs are relevant
                self.test_cases.append(TestCase(
                    query=query,
                    relevant_doc_ids=clicked_docs[query],
                    expected_answer_contains=[],  # Fill manually
                    expected_intent="unknown",
                    difficulty="medium",
                    category="from_logs"
                ))

    def add_synthetic(
        self,
        doc_chunk: DocChunk,
        num_queries: int = 3,
        client = None
    ):
        """Generate synthetic test queries from a document chunk."""

        prompt = f"""Given this documentation excerpt, generate {num_queries} realistic
user questions that this content would answer.

Documentation:
{doc_chunk.content}

Generate questions that a real user might ask, varying in:
- Phrasing (some formal, some casual)
- Specificity (some general, some very specific)
- Intent (how-to, what-is, troubleshooting)

Return one question per line."""

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )

        questions = response.choices[0].message.content.strip().split("\n")

        for q in questions[:num_queries]:
            self.test_cases.append(TestCase(
                query=q.strip(),
                relevant_doc_ids=[doc_chunk.id],
                expected_answer_contains=[],
                expected_intent="synthetic",
                difficulty="medium",
                category="synthetic"
            ))

    def to_dict(self) -> list[dict]:
        """Export test set to JSON-serializable format."""
        return [
            {
                "query": tc.query,
                "relevant_doc_ids": tc.relevant_doc_ids,
                "expected_answer_contains": tc.expected_answer_contains,
                "expected_intent": tc.expected_intent,
                "difficulty": tc.difficulty,
                "category": tc.category
            }
            for tc in self.test_cases
        ]

Building effective test sets:

Source from real queries: Search logs with click data are gold. Users who clicked and stayed found what they needed.
Include failure cases: Deliberately add queries that historically failed. These catch regressions.
Cover all intents: Ensure representation across how-to, reference, troubleshooting, and concept queries.
Vary difficulty: Include obvious queries ("how to install") and hard ones ("why does X fail when Y is configured with Z?").
Synthetic augmentation: Use LLMs to generate query variations from your best docs. Helps coverage but shouldn't replace real queries.

Recommended test set size: Start with 100-200 queries. Aim for 500+ for production systems. More is better for statistical significance.

Running Evaluations

Python

from dataclasses import dataclass

@dataclass
class EvaluationResult:
    recall_at_5: float
    recall_at_10: float
    mrr: float
    avg_latency_ms: float
    zero_result_rate: float
    per_category_recall: dict[str, float]

class SearchEvaluator:
    """Evaluate documentation search quality."""

    def __init__(self, retriever: MultiStepRetriever):
        self.retriever = retriever

    def evaluate(self, test_cases: list[TestCase]) -> EvaluationResult:
        """Run evaluation on test set."""
        import time

        recalls_5 = []
        recalls_10 = []
        reciprocal_ranks = []
        latencies = []
        zero_results = 0
        category_recalls = {}

        for tc in test_cases:
            start = time.time()
            result = self.retriever.retrieve(tc.query, max_results=10)
            latency = (time.time() - start) * 1000
            latencies.append(latency)

            retrieved_ids = [c.id for c in result.chunks]

            # Recall@5
            relevant_in_5 = len(set(retrieved_ids[:5]) & set(tc.relevant_doc_ids))
            recall_5 = relevant_in_5 / len(tc.relevant_doc_ids) if tc.relevant_doc_ids else 0
            recalls_5.append(recall_5)

            # Recall@10
            relevant_in_10 = len(set(retrieved_ids[:10]) & set(tc.relevant_doc_ids))
            recall_10 = relevant_in_10 / len(tc.relevant_doc_ids) if tc.relevant_doc_ids else 0
            recalls_10.append(recall_10)

            # MRR
            rr = 0
            for i, doc_id in enumerate(retrieved_ids):
                if doc_id in tc.relevant_doc_ids:
                    rr = 1 / (i + 1)
                    break
            reciprocal_ranks.append(rr)

            # Zero results
            if not result.chunks:
                zero_results += 1

            # Per-category tracking
            if tc.category not in category_recalls:
                category_recalls[tc.category] = []
            category_recalls[tc.category].append(recall_10)

        return EvaluationResult(
            recall_at_5=sum(recalls_5) / len(recalls_5),
            recall_at_10=sum(recalls_10) / len(recalls_10),
            mrr=sum(reciprocal_ranks) / len(reciprocal_ranks),
            avg_latency_ms=sum(latencies) / len(latencies),
            zero_result_rate=zero_results / len(test_cases),
            per_category_recall={
                cat: sum(vals) / len(vals)
                for cat, vals in category_recalls.items()
            }
        )

    def compare_configs(
        self,
        test_cases: list[TestCase],
        configs: dict[str, MultiStepRetriever]
    ) -> dict[str, EvaluationResult]:
        """Compare multiple retriever configurations."""
        results = {}
        for name, retriever in configs.items():
            self.retriever = retriever
            results[name] = self.evaluate(test_cases)
        return results

Using the evaluator effectively:

Run regularly: Evaluate weekly or after any retrieval changes. Catch regressions early.
Track per-category: Overall metrics hide category-specific problems. If troubleshooting queries degrade while how-to improves, you need to know.
Compare configurations: Use compare_configs to A/B test changes before deploying. Does adding reranking help? Does a new embedding model improve recall?
Set alerts: If Recall@10 drops below 0.70 or latency exceeds 2s, alert the team.
Investigate outliers: Look at specific queries that fail. Often a handful of edge cases tank your metrics.

LLM-as-Judge for Answer Quality

For evaluating generated answers, use an LLM judge:

Python

class AnswerEvaluator:
    """Evaluate answer quality using LLM-as-judge."""

    def __init__(self, client):
        self.client = client

    def evaluate_answer(
        self,
        query: str,
        answer: str,
        ground_truth_docs: list[str],
        expected_contains: list[str] = None
    ) -> dict:
        """Evaluate a single answer."""

        prompt = f"""Evaluate this documentation search answer.

Question: {query}

Answer provided:
{answer}

Ground truth documentation:
{chr(10).join(ground_truth_docs)}

Evaluate on these criteria (score 1-5 each):
1. **Correctness**: Is the answer factually accurate based on the documentation?
2. **Completeness**: Does it fully answer the question?
3. **Relevance**: Does it stay on topic without unnecessary information?
4. **Clarity**: Is it easy to understand?
5. **Grounding**: Are claims supported by the documentation (no hallucination)?

Also note:
- Any factual errors
- Missing important information
- Hallucinated content not in the docs

Return JSON:
{{
    "correctness": <1-5>,
    "completeness": <1-5>,
    "relevance": <1-5>,
    "clarity": <1-5>,
    "grounding": <1-5>,
    "errors": ["<error1>", ...],
    "missing": ["<missing1>", ...],
    "hallucinations": ["<hallucination1>", ...]
}}"""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )

        import json
        return json.loads(response.choices[0].message.content)

LLM-as-judge considerations:

Cost: GPT-4o evaluation costs ~$0.01-0.05 per answer. Budget for 100-500 evaluations per release.
Consistency: LLM judges can be inconsistent. Run each evaluation 2-3 times and average, or use a rubric-based prompt.
Calibration: Validate LLM judgments against human ratings on a sample. Adjust prompts if they diverge.
Specific criteria: Generic "rate this answer" is less useful than specific criteria (correctness, completeness, grounding).
Failure analysis: The errors, missing, and hallucinations fields are actionable. Use them to identify systematic issues.

When to use human evaluation instead: For high-stakes changes, sample 50-100 answers for human review. LLM judges are good for scale but miss subtle quality issues.

Regression Testing in CI/CD

Add search quality checks to your deployment pipeline:

Python

# test_search_quality.py
import pytest
from search_evaluator import SearchEvaluator, TestCase

@pytest.fixture
def test_set():
    """Load test set from file."""
    import json
    with open("test_cases.json") as f:
        data = json.load(f)
    return [TestCase(**tc) for tc in data]

@pytest.fixture
def evaluator():
    """Initialize evaluator with production config."""
    from search_service import retriever
    return SearchEvaluator(retriever)

def test_recall_at_10(evaluator, test_set):
    """Recall@10 should be above threshold."""
    result = evaluator.evaluate(test_set)
    assert result.recall_at_10 >= 0.75, f"Recall@10 dropped to {result.recall_at_10}"

def test_mrr(evaluator, test_set):
    """MRR should be above threshold."""
    result = evaluator.evaluate(test_set)
    assert result.mrr >= 0.65, f"MRR dropped to {result.mrr}"

def test_latency(evaluator, test_set):
    """P95 latency should be under 2 seconds."""
    result = evaluator.evaluate(test_set)
    assert result.avg_latency_ms < 2000, f"Latency increased to {result.avg_latency_ms}ms"

def test_zero_results(evaluator, test_set):
    """Zero-result rate should be low."""
    result = evaluator.evaluate(test_set)
    assert result.zero_result_rate < 0.05, f"Zero-result rate is {result.zero_result_rate}"

CI/CD integration tips:

Fast feedback: Run a small test set (50 queries) on every PR. Full evaluation on merge to main.
Baseline tracking: Store baseline metrics. Alert when new code degrades by >5% on any metric.
Block deploys: Make tests required. Don't deploy if search quality regresses.
Separate index tests: Test indexing separately from retrieval. Catch chunking bugs before they affect search.
Environment parity: Run tests against a staging index that mirrors production. Mock LLMs with cached responses for speed.

YAML

# Example GitHub Actions workflow
name: Search Quality Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run search quality tests
        run: pytest tests/test_search_quality.py -v
      - name: Upload metrics
        if: github.ref == 'refs/heads/main'
        run: python scripts/upload_metrics.py

Worked Example: End-to-End Query

Let's trace a real query through the entire system to see how all components work together.

The Query

User asks: "Why is my API returning 401 even though I set the authentication header?"

Step 1: Query Analysis

Python

analysis = query_analyzer.analyze(query)
# Returns:
# QueryAnalysis(
#     original_query="Why is my API returning 401...",
#     intent=QueryIntent.TROUBLESHOOT,
#     entities=["API", "401", "authentication header"],
#     expanded_queries=[
#         "authentication header not working",
#         "401 unauthorized error",
#         "API authentication troubleshooting"
#     ],
#     sub_queries=[
#         "What causes 401 errors?",
#         "How to set authentication headers correctly?",
#         "Common authentication mistakes"
#     ],
#     expected_content_types=["text", "code", "warning"],
#     confidence=0.85
# )

What happened: The analyzer:

Identified this as a troubleshooting query
Extracted key entities: "401", "authentication header"
Generated alternative phrasings that might match docs better
Created sub-queries to find related information

Step 2: Multi-Source Retrieval

Python

# For each expanded query, we search:
# Original: "Why is my API returning 401..."
# Expanded 1: "authentication header not working"
# Expanded 2: "401 unauthorized error"
# Expanded 3: "API authentication troubleshooting"

# Each query gets:
semantic_results = index.search(query, top_k=20)  # ~50 results total
keyword_results = index.keyword_search(query, top_k=10)  # ~30 results

# Combined: ~80 candidates before deduplication

Step 3: Merge & Deduplicate

Python

# After merging, we have ~35 unique chunks
# Top candidates:
# 1. "Troubleshooting > Authentication > 401 Errors" (score: 0.89)
# 2. "API Reference > Authentication > Headers" (score: 0.85)
# 3. "Getting Started > Authentication" (score: 0.78)
# 4. "FAQ > Why am I getting 401?" (score: 0.76)

Step 4: Content Type Filtering

Python

# Intent is TROUBLESHOOT, so we prioritize:
# - warning/note chunks (common mistakes)
# - code chunks (correct usage examples)
# - text chunks explaining error causes

# Filter keeps 28 chunks, drops 7 unrelated tables/diagrams

Step 5: Link Expansion

Python

# Top result "401 Errors" links to:
# - "Token Expiration" section
# - "API Key vs OAuth" comparison
# - "Environment Variables" setup

# We fetch these linked chunks with 0.8 score decay
# Now have 35 candidates

Step 6: Reranking

Python

reranker = DocumentationReranker()
reranked = reranker.rerank(query, merged[:30])

# After reranking, order changes:
# 1. "FAQ > Why am I getting 401?" (was #4, now #1)
#    - Cross-encoder recognized this directly answers the question
# 2. "Troubleshooting > 401 Errors" (was #1, now #2)
# 3. "Common Mistakes > Authentication" (was #8, now #3)
#    - Contains "forgetting to add Bearer prefix" - highly relevant

Step 7: Answer Generation

Python

answer = generator.generate(query, retrieval_result)

# Generated answer:
"""
A 401 error despite setting the authentication header usually indicates one of these issues:

1. **Missing "Bearer" prefix**: The header should be `Authorization: Bearer <token>`, not just the token [1].

2. **Expired token**: Tokens expire after 1 hour by default. Check `exp` claim in your JWT [2].

3. **Wrong header name**: Use `Authorization`, not `Auth` or `X-Auth-Token` [1].

4. **Environment mismatch**: Make sure you're using production keys for production API, not test keys [3].

To debug, check the response body—it includes a specific error code indicating which check failed [2].
"""

# Citations:
# [1] FAQ > Why am I getting 401?
# [2] Troubleshooting > Authentication > 401 Errors
# [3] Common Mistakes > Authentication

# Confidence: 0.87 (high - multiple relevant sources, specific answer)

What Made This Work

Query expansion found "401 unauthorized" even though the user said "returning 401"
Hybrid search caught the exact "401" keyword that semantic search might have generalized
Reranking promoted the FAQ entry that directly addressed the question
Citation tracking let us point to specific sources for each claim

What Could Go Wrong

Failure Mode	Symptom	Fix
No results for "401"	User wrote "4O1" (letter O)	Add fuzzy matching for error codes
Wrong auth method	Docs have OAuth, user uses API keys	Add sub-query for auth method detection
Outdated info	Token expiry changed from 1hr to 24hr	Ensure index is updated with doc changes
Missing context	User is on v1, docs default to v2	Add version detection/filtering

Conclusion

Building intelligent documentation search requires combining multiple techniques that each solve a specific problem:

Technique	Problem It Solves
Documentation-aware chunking	Preserves structure, code blocks, and cross-references that naive chunking destroys
Query understanding	Bridges the vocabulary gap between how users ask and how docs are written
Multi-step retrieval	Handles complex questions that need information from multiple sections
Hybrid search	Catches both conceptual matches (semantic) and exact technical terms (keyword)
Reranking	Improves precision after recall-optimized initial retrieval
Answer generation	Synthesizes coherent, cited responses from scattered chunks
Feedback loops	Continuously improves quality based on real user signals

Implementation Roadmap

Phase 1 - Foundation (1-2 weeks):

Implement documentation-aware chunking with hierarchy preservation
Build basic semantic search index
Create simple answer generation

Phase 2 - Quality (2-3 weeks):

Add query understanding (intent classification, entity extraction)
Implement hybrid search (semantic + keyword)
Add reranking with cross-encoder
Build basic analytics

Phase 3 - Production (2-3 weeks):

Add caching layer
Implement incremental index updates
Build feedback collection system
Set up monitoring and alerting

Phase 4 - Optimization (ongoing):

Analyze failed queries and documentation gaps
Tune retrieval parameters based on feedback
A/B test improvements
Add version support if needed

Key Metrics to Track

Measure success with these metrics:

Answer helpfulness rate: % of answers users mark as helpful (target: >70%)
Zero-result rate: % of queries with no results (target: <5%)
P95 latency: 95th percentile response time (target: <2s with answer)
Retrieval precision@5: % of top-5 results that are relevant (target: >60%)

Start with basic semantic search on well-chunked docs. Add query analysis, hybrid search, and reranking as you validate the approach. Monitor failed queries to identify gaps in documentation or retrieval. The best documentation search systems aren't built—they're grown through continuous iteration based on user feedback.

Frequently Asked Questions

The main costs are: For a typical documentation site with 10,000 daily searches: Cost optimization tips: Cache aggressively, use smaller models for query analysis, only run reranking on complex queries, batch index updates. Embeddings: One-time indexing (~~$0.02/1M tokens with text-embedding-3-small), plus query embeddings (~~$0.00001/query), LLM for answers: ~$0.01-0.05/query with GPT-4o, ~$0.001/query with GPT-4o-mini, Infrastructure: Vector DB hosting ($50-500/month depending on scale), compute for reranking, Embedding costs: ~$3/month, LLM costs: ~$50-300/month (depending on model choice), Infrastructure: ~$100/month.

Scale	Recommendation
<10K chunks	In-memory (NumPy/FAISS) is fine
10K-100K chunks	FAISS with disk-backed index
100K-1M chunks	Managed vector DB (Pinecone, Weaviate, Qdrant)
>1M chunks	Distributed vector DB with sharding

Layer multiple strategies: Key points include: Fuzzy matching: Use Levenshtein distance for keyword search; Query correction: "Did you mean X?" suggestions based on common corrections; Phonetic matching: Soundex/Metaphone for spoken-word typos; LLM correction: For complex queries, have the LLM fix obvious typos before search Most documentation search can tolerate typos because semantic search is somewhat robust to them..

For docs with access control: Key points include: Index with permissions: Tag each chunk with required roles/permissions; Filter at query time: Only return chunks the user can access; Separate indices: High-security option—different indices per access level; No caching across users: Disable result caching or make it user-specific Never expose private content through search suggestions or autocomplete..

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

EducationRAG

Building Production-Ready RAG Systems: Lessons from the Field

Production-focused guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.

16 min read

Agentic AIRAG

Agentic RAG: When Retrieval Meets Autonomous Reasoning

How to build RAG systems that don't just retrieve—they reason, plan, and iteratively refine their searches to solve complex information needs.

9 min read

EducationRAG

Building Semantic Memory for LLM Conversations: A Hierarchical RAG Approach

A practical guide to building a semantic search system for your LLM conversation history using hierarchical chunking, HyDE retrieval, knowledge graphs, and agentic research patterns.

6 min read

Table of Contents

Beyond Keyword Search

Documentation Search Architecture

Document Processing

Why Documentation Chunking is Different

The Documentation Chunk Model

The Chunking Algorithm

Building the Search Index

Embedding Model Selection

Query Understanding

Why Query Understanding Matters

Query Analysis Implementation

Multi-Step Retrieval

Why Single-Step Retrieval Fails

The Multi-Step Approach

Retrieval Pipeline Implementation

Hierarchical Navigation

Reranking Deep Dive

Why Bi-Encoders Aren't Enough

Cross-Encoder Reranking

When to Use Reranking

Reranking Model Selection (2025)

Answer Generation

What Good Documentation Answers Look Like

Confidence Estimation

Answer Generator Implementation

Production Optimization

The Latency Challenge

Caching Strategy

Incremental Index Updates

Search Analytics

Complete Search Service

Documentation Versioning

The Version Challenge

Multi-Version Index Management

Migration Assistance

API Reference Search

Why API Reference Needs Special Treatment

Structured API Documentation Model

User Feedback Integration

Why Feedback Matters

Feedback Collection System

Learning from Feedback

Active Learning Loop

Real-Time Search Quality Monitoring

Advanced Techniques

HyDE: Hypothetical Document Embeddings

HyPE: The 2025 Evolution

Semantic Chunking

Query Rewriting with LLMs

Evaluation & Benchmarking

Key Metrics

Building a Test Set

Running Evaluations

LLM-as-Judge for Answer Quality

Regression Testing in CI/CD

Worked Example: End-to-End Query

The Query

Step 1: Query Analysis

Step 2: Multi-Source Retrieval

Step 3: Merge & Deduplicate

Step 4: Content Type Filtering

Step 5: Link Expansion

Step 6: Reranking

Step 7: Answer Generation

What Made This Work

What Could Go Wrong

Conclusion

Implementation Roadmap

Key Metrics to Track

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building Production-Ready RAG Systems: Lessons from the Field

Agentic RAG: When Retrieval Meets Autonomous Reasoning

Building Semantic Memory for LLM Conversations: A Hierarchical RAG Approach