Contextual Retrieval: Solving RAG's Hidden Context Problem
How prepending chunk-specific context before embedding dramatically improves retrieval quality. Complete guide covering contextual embeddings, contextual BM25, reranking, and prompt caching optimization.
Table of Contents
Introduction
Modern AI applications face a fundamental challenge: they need access to information that wasn't part of their training data. Consider the breadth of specialized knowledge required by real-world applications. A customer support chatbot needs intimate familiarity with your company's specific return policies, product specifications, and troubleshooting procedures—information that exists only in your internal documentation. A legal research assistant must search through decades of case law, statutes, and regulatory filings to find precedents relevant to a specific situation. A financial analyst tool needs access to the latest quarterly earnings reports, SEC filings, and market analyses that were published long after any model's training cutoff.
The traditional approach to giving AI systems specialized knowledge was fine-tuning: taking a pre-trained model and further training it on domain-specific data. But fine-tuning has significant limitations. It's expensive, requiring substantial compute resources for each domain adaptation. It's slow, taking days or weeks to complete. It's inflexible—when your documentation changes, you need to fine-tune again. And perhaps most importantly, it struggles with factual recall; fine-tuning is better at adapting a model's style or behavior than at reliably encoding specific facts that can be retrieved on demand.
Retrieval-Augmented Generation (RAG) emerged as an elegant alternative. Instead of baking knowledge into model weights through training, RAG systems retrieve relevant information at query time and provide it directly in the prompt. When a user asks a question, the system searches a knowledge base for relevant passages, then provides those passages to the language model as context alongside the question. The model generates its response grounded in this retrieved information, effectively gaining access to vast knowledge bases without any additional training.
The architecture is compelling in its simplicity: separate the concerns of knowledge storage (in a searchable database) from knowledge application (in the language model). This separation brings immediate benefits. Knowledge can be updated instantly by modifying the database, without touching the model. The same model can serve multiple domains by switching which knowledge base it queries. And the system provides natural citations—you know exactly which documents informed each response.
But RAG systems harbor a hidden failure mode that undermines their effectiveness in practice. The problem emerges from a fundamental tension between how documents are written and how retrieval systems process them.
The Hidden Failure Mode
Documents are written for human readers who maintain context as they read. An earnings report begins by establishing that it covers "Acme Corporation's Third Quarter 2024 Financial Results." Every subsequent page builds on this context. When the report later states "The company's revenue grew by 3% over the previous quarter," human readers understand immediately that "the company" means Acme Corporation, "the previous quarter" means Q2 2024, and this growth figure relates to the Q3 2024 period established at the outset.
RAG systems, however, don't read documents sequentially and holistically. They break documents into chunks—smaller segments that can be individually indexed and retrieved. This chunking is necessary (we'll explore why shortly), but it severs the threads of context that connect different parts of a document.
When that sentence about 3% revenue growth gets chunked and stored in isolation, the critical context evaporates. The chunk becomes an orphan, separated from the document structure that gave it meaning. An embedding model processing this chunk encodes what's there: something about revenue growth, percentage figures, quarterly comparison. But which company? What time period? The embedding can't capture information that simply isn't present in the text.
Now consider what happens when a user asks: "How did Acme Corp perform in Q3 2024?" The query embedding captures the key concepts: Acme Corp, Q3, 2024, performance. But the chunk embedding for that highly relevant sentence lacks "Acme," lacks "Q3 2024," lacks any specific identifying information. The semantic similarity between query and chunk is weaker than it should be. The chunk might rank below less relevant results—or fail to appear in the top results entirely.
This is the context problem, and it's more pervasive than it might initially appear. Anthropic's research across diverse knowledge bases found that traditional RAG systems fail to retrieve relevant information roughly 5.7% of the time. That percentage might seem small, but consider its implications. If a complex question requires information from three different chunks, and each has a 5.7% chance of retrieval failure, the probability that at least one fails is approximately 16%. For knowledge-intensive applications where accuracy matters, a 16% failure rate on multi-part questions represents a serious limitation.
The Elegant Solution
Contextual Retrieval addresses this problem through a disarmingly simple insight: if the context isn't in the chunk, put it there before embedding.
For each chunk in your knowledge base, you use a language model to generate a brief context snippet—typically 50-100 tokens—that situates the chunk within its source document. This context identifies where the chunk comes from, what topic it addresses, and clarifies any ambiguous references. The context is then prepended to the original chunk text, and this combined text is what gets embedded and indexed.
That orphaned sentence about revenue growth becomes:
"This chunk is from Acme Corporation's Q3 2024 earnings report, specifically the Revenue Performance section discussing quarterly growth against the company's fiscal year targets. The company's revenue grew by 3% over the previous quarter."
Now when this enhanced chunk is embedded, the vector captures the full meaning: Acme Corporation, Q3 2024, earnings report, revenue performance, quarterly growth—all the context needed to match relevant queries accurately.
The results validate the approach dramatically. In Anthropic's benchmarks across diverse knowledge bases—codebases, fiction, academic papers, technical documentation—contextual retrieval combined with complementary techniques reduced retrieval failures by up to 67%. That 5.7% baseline failure rate dropped to under 2%.
This post provides a comprehensive guide to understanding and implementing contextual retrieval. We'll explore the underlying mechanics of why RAG systems struggle with context, how the contextual retrieval technique works in detail, how to combine it with other retrieval methods for maximum effectiveness, and practical considerations for production implementation.
Understanding the RAG Architecture
Before diving into contextual retrieval, we need to understand the RAG pipeline in detail—specifically, why chunking is necessary and how it creates the context problem we're trying to solve.
Why Documents Must Be Chunked
RAG systems can't work with whole documents for two fundamental reasons, each stemming from different constraints in the architecture.
The Context Window Constraint
Language models process text through fixed-size context windows. Even the largest models today support perhaps 200,000 tokens in a single context—impressive, but finite. A typical enterprise knowledge base might contain millions of documents, totaling billions of tokens. You simply cannot concatenate your entire knowledge base into a prompt.
But the constraint runs deeper than total size. Even if you could fit everything, you wouldn't want to. Language models exhibit well-documented degradation when processing very long contexts. Important information in the middle of long contexts tends to get "lost"—the model attends more strongly to the beginning and end. A 100,000-token context containing a critical 200-token passage somewhere in the middle may fail to surface that passage's information in the response.
RAG solves this by being selective: instead of providing everything, provide only the most relevant information for each specific query. This requires breaking documents into retrievable units—chunks—that can be selectively included based on their relevance to the query at hand.
The Retrieval Precision Problem
Imagine embedding an entire 50-page financial report as a single vector. That vector would represent the "average meaning" across all 50 pages—the overview sections, the detailed financials, the risk factors, the management discussion, the footnotes. It would be a diffuse representation, a little bit about everything but not strongly about anything specific.
Now imagine a user asks about a specific risk factor mentioned briefly on page 37. The query embedding would be focused and specific: this particular risk, its implications, the company's mitigation strategy. The document embedding would be diffuse and general. The semantic similarity between them might be mediocre—not because the document lacks the relevant information, but because that information is diluted across 50 pages of other content.
Chunking solves this by creating focused retrieval units. A 500-token chunk from page 37 that specifically discusses that risk factor would have a much stronger embedding match with the query. The chunk's embedding is concentrated on exactly what the user is asking about.
This is why all production RAG systems chunk their documents. The question isn't whether to chunk, but how—and as we'll see, the how creates the context problem.
The Standard RAG Pipeline
A typical RAG system processes information through two distinct phases: an indexing phase (processing documents in advance) and a query phase (responding to user questions in real-time).
The Indexing Phase
The journey begins with document ingestion. Documents arrive from various sources—file systems containing PDFs and Word documents, content management systems, databases, web crawlers, API integrations with third-party services. These documents come in diverse formats and must first be converted into plain text that can be processed further.
Next comes chunking, the critical step where documents are divided into smaller segments. Various strategies exist: fixed-size chunking (every N tokens), sentence-based chunking (respecting sentence boundaries), paragraph-based chunking (respecting paragraph structure), or semantic chunking (attempting to identify topically coherent segments). Most implementations use overlapping chunks, where each chunk shares some content with its neighbors, to avoid losing information at boundaries.
Each chunk is then passed through an embedding model—a neural network trained to convert text into dense vector representations. These vectors, typically 768 to 3072 dimensions, capture the semantic meaning of the text in a form that allows mathematical comparison. Texts with similar meanings produce vectors that are close together in the high-dimensional vector space.
Finally, these vectors are stored in a vector database—specialized systems like Pinecone, Weaviate, Qdrant, Milvus, or pgvector that support efficient similarity search across millions or billions of vectors.
The Query Phase
When a user asks a question, that query is first embedded using the same embedding model used for the chunks. This produces a query vector in the same semantic space as the chunk vectors.
The system then performs similarity search, finding the chunk vectors most similar to the query vector. This typically uses approximate nearest neighbor algorithms that can search across millions of vectors in milliseconds. The result is a ranked list of chunks, ordered by their semantic similarity to the query.
The top-ranked chunks—typically 3 to 20 depending on the application—are then assembled into a context that's provided to the language model alongside the original query. The model generates a response grounded in this retrieved context, ideally citing or referencing the source material.
Where Context Gets Lost
The context problem emerges at the chunking stage of indexing. When a document is split into chunks, each chunk becomes an isolated unit. Information that was implicitly understood from the document's structure—what document this is, what section, what entities were established earlier, what time period is being discussed—vanishes from individual chunks.
The embedding model can only encode what's present in the chunk text. It has no access to the original document, no knowledge of what came before or after, no understanding of the document's metadata or structure. It encodes the chunk as-is, with all its implicit references rendered meaningless.
This encoded ambiguity then propagates through the entire retrieval pipeline. The vector database faithfully stores these ambiguous vectors. The similarity search faithfully compares them to query vectors. But the comparison is fundamentally compromised because the chunk vectors don't represent what the chunks actually mean—they represent only the context-free surface text.
The Context Problem in Depth
Understanding the context problem requires examining both the patterns of context loss and their downstream effects on retrieval quality.
Patterns of Context Loss
Context loss in RAG systems isn't random—it follows predictable patterns based on how documents are written. Recognizing these patterns helps you understand where contextual retrieval provides the most value and how to customize context generation for your specific documents.
Entity Reference Patterns
Documents introduce entities—companies, people, products, concepts—and then refer back to them using pronouns, shortened names, or generic descriptors. A document might introduce "Anthropic, an AI safety company based in San Francisco" and thereafter refer to it as "the company," "Anthropic," "the firm," or simply "they."
When chunks contain only the referential forms, the embedding captures generic concepts rather than specific entities. "The company reported strong Q3 results" embeds as something about companies and quarterly results—not specifically about Anthropic's Q3 results. A query about "Anthropic Q3 performance" may not match well because "Anthropic" is semantically distant from "the company" without the establishing context.
This pattern is especially problematic in documents that discuss multiple entities. A contract between Acme Corporation and Widgets Inc. might refer to them as "the Buyer" and "the Seller" throughout. Chunks containing "the Seller's obligations" lose all information about which specific company this concerns.
Temporal Reference Patterns
Documents exist in time, but that temporal context is often established once and then assumed throughout. An earnings report dated January 2025 discussing "this quarter" is clearly referring to Q4 2024. A research paper from March 2024 discussing "recent developments" means developments circa early 2024.
Chunks inherit the document's temporal context implicitly but don't carry it explicitly. "This quarter's results exceeded expectations" tells you something about quarterly results but nothing about which quarter in which year. Without temporal grounding, the chunk becomes ambiguously relevant to any query about any quarter.
This pattern creates particular problems for knowledge bases spanning extended time periods. If you have five years of quarterly reports, chunks about "this quarter" from different reports will all lack the temporal specificity needed to distinguish them in retrieval.
Structural Context Patterns
Document structure carries meaning. Text under a "Risks and Uncertainties" heading has very different implications than identical text under an "Opportunities" heading. A section titled "Proposed Solution" versus "Rejected Approaches" changes how the content should be interpreted.
This structural context is typically lost in chunking. A chunk stating "Market volatility could significantly impact revenues" reads very differently depending on whether it's from a risk disclosure (this is a concern) or an optimistic projection (despite this, we expect growth). The embedding captures the surface content but not the structural framing that shapes its meaning.
Document Type Patterns
The same text carries different weight depending on document type. "The total shall not exceed $1,000,000" is a binding commitment in a signed contract, a preliminary estimate in a project proposal, and a hypothetical scenario in a strategy document. "The patient should take this medication twice daily" is a definitive instruction in a prescription, a general guideline in an educational article, and a research finding in a clinical study.
Document type is almost always established at the document level—a header, a file name, metadata—and almost never repeated within individual chunks. Yet document type profoundly affects how content should be interpreted and how relevant it is to different queries.
Cross-Reference Patterns
Documents reference their own structure: "As discussed in Section 3," "Building on the previous analysis," "The following table illustrates." These cross-references point to context that exists elsewhere in the document but not in the current chunk.
When chunks contain dangling references, they become difficult to interpret in isolation. "This approach addresses all three concerns" makes sense when you know what three concerns were just listed, but becomes opaque when the chunk stands alone.
The Retrieval Failure Cascade
Context loss doesn't merely reduce retrieval quality—it initiates a cascade of failures that compound through the RAG pipeline.
Stage 1: Embedding Degradation
The cascade begins at embedding time. An embedding model processes the decontextualized chunk and produces a vector representing its understanding of the text's meaning. But without context, that understanding is necessarily incomplete and often incorrect.
Consider "The company exceeded this target by 15%." The embedding model recognizes: something about a company, exceeding, targets, a percentage. It produces a vector in the general semantic neighborhood of "company performance metrics" and "target achievement." But this vector lacks specificity—it doesn't cluster with other Acme Corporation content, doesn't associate with Q3 2024, doesn't connect to revenue metrics specifically.
The embedding is doing exactly what it's designed to do: encoding the semantic content present in the text. The problem is that the important semantic content isn't present.
Stage 2: Retrieval Mismatch
At query time, the user's question is embedded into the same semantic space. If the user asks "How did Acme perform against their Q3 2024 revenue targets?", the query embedding strongly encodes: Acme, performance, Q3 2024, revenue, targets.
Now the similarity search begins. The chunk about exceeding targets by 15% is highly relevant—it's exactly what the user wants. But the chunk embedding lacks "Acme," lacks "Q3 2024," lacks "revenue." The overlap between query and chunk embeddings is partial: "targets" matches, "exceeding" relates to "perform," percentages suggest metrics. But the specific identifying features that would make this a strong match are missing.
The similarity score ends up moderate rather than high. In the ranking of all chunks, this critical result appears lower than it should.
Stage 3: Ranking Displacement
A chunk ranking lower means other chunks rank higher. Consider what might outrank our relevant chunk: a general article about "how companies set revenue targets," a different company's Q3 report that explicitly mentions "Acme" as a competitor, an analyst's commentary that uses the words "Acme" and "revenue targets" but discusses industry trends rather than Acme's specific results.
These alternatives might score higher because they contain more of the query's key terms, even though they're less relevant. The semantic similarity calculation cannot compensate for the missing context in the truly relevant chunk.
Stage 4: Context Assembly Failure
RAG systems typically retrieve the top-k chunks—perhaps the top 5, 10, or 20 depending on configuration. If the relevant chunk has been displaced beyond this cutoff, it doesn't get included in the context provided to the language model.
The LLM now generates a response based on whatever chunks did make the cutoff. In the best case, these chunks contain related but incomplete information, leading to a partial or vague answer. In the worst case, the LLM confidently generates a response based on the less-relevant chunks that did get retrieved, producing an answer that seems authoritative but misses or contradicts the actual information the user needed.
Stage 5: User Trust Erosion
When users repeatedly encounter failures to find information they know exists—or worse, receive confidently wrong answers—trust in the RAG system erodes. Users may begin double-checking answers externally, defeating the purpose of the system. They may stop using it for important queries, limiting it to casual or low-stakes questions. Eventually, they may abandon it entirely.
The insidious aspect of context-related failures is that they're difficult to diagnose. The system doesn't error out; it returns results, generates responses, appears to be working. The failures are silent—wrong information rather than no information. Users experience the system as unreliable without necessarily understanding why.
Quantifying the Impact
Anthropic's research provides concrete measurements of context-related retrieval failures across diverse knowledge bases spanning different domains and document types.
Testing across various knowledge domains—including codebases, fiction, ArXiv papers, and science articles—they found that standard embedding-based retrieval (using top-20 results) failed to retrieve relevant passages approximately 5.7% of the time. This baseline failure rate represents the starting point before any contextual enhancement.
The failure rate varies significantly by document type. Financial reports, with their dense entity references and temporal markers, showed higher failure rates. Legal documents, with their complex cross-references and defined terms, proved particularly challenging. Technical documentation, with product codes and acronyms established in glossaries and used throughout, created retrieval difficulties when chunks lacked those definitional sections.
Conversely, FAQ pages and support articles—content specifically written to be self-contained and searchable—showed lower baseline failure rates. When content is already optimized for retrieval, the marginal benefit of contextual retrieval is smaller.
The 5.7% average masks significant variance. For complex, context-dependent documents, failure rates can exceed 10%. For simple, self-contained content, they might be under 3%. Understanding your specific document types helps predict how much improvement contextual retrieval might provide.
How Contextual Retrieval Works
Contextual retrieval is conceptually simple: prepend situating context to each chunk before embedding. But the simplicity of the concept belies important nuances in how context should be generated, what it should contain, and why this approach is so effective.
The Core Mechanism
The technique modifies the indexing phase of the RAG pipeline. Instead of embedding raw chunks, you embed contextualized chunks—the original chunk text with a generated context prefix.
For each chunk in your knowledge base, the process works as follows:
Step 1: Context Generation
You send both the full source document and the specific chunk to a language model with a prompt asking it to generate situating context. The model has access to the entire document, so it can identify key contextual information that the chunk alone doesn't contain.
The prompt is straightforward:
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall
document for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else.
The language model, with its ability to read and understand both document and chunk, generates a brief context snippet—typically 50 to 100 tokens—that captures the most important missing context.
Step 2: Context Prepending
The generated context is prepended to the original chunk, creating a new combined text that will be indexed. The structure is typically:
[Generated context providing document identity, section, and reference resolution]
[Original chunk content unchanged]
The original chunk content remains intact; we're adding to it, not modifying it. This preserves the original information while supplementing it with context.
Step 3: Embedding the Contextualized Chunk
The combined context-plus-chunk text is then passed through your embedding model. The embedding now encodes both the original chunk content and the contextual information. The resulting vector captures a more complete and accurate representation of what the chunk actually means.
Step 4: Indexing
The contextualized embedding is stored in your vector database, associated with the original chunk content (which is what you'll ultimately provide to the LLM) and any relevant metadata.
During retrieval, queries are compared against these contextualized embeddings, improving the likelihood of matching chunks to relevant queries. When chunks are retrieved, you can choose to provide either the original content or the contextualized content to the LLM for generation.
A Complete Worked Example
Let's trace through a concrete example to make the mechanism tangible.
The Source Document:
Consider an excerpt from a quarterly earnings report:
ACME CORPORATION
Third Quarter 2024 Financial Results
For the Period Ending September 30, 2024
Letter from the CEO
Dear Shareholders,
I am pleased to report that Acme Corporation delivered exceptional
performance in the third quarter of 2024, exceeding our expectations
across all major business segments...
[Several pages of CEO letter]
Revenue Performance
The company set ambitious growth targets for fiscal year 2024, aiming
for 20% year-over-year revenue growth across all segments combined.
This target reflected our confidence in expanding market demand for
our core products and the successful integration of last year's
acquisition of TechWidget Inc.
Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue. This
performance was driven primarily by strong demand in the Automotive
segment, which grew 45% compared to Q3 2023.
The Services division also contributed meaningfully, with recurring
revenue growing 28% year-over-year...
[Several more pages of financial details]
Risk Factors
Market Volatility: Continued uncertainty in global semiconductor
supply chains may impact production costs and delivery timelines
in subsequent quarters...
The Chunk:
After chunking, one segment contains only:
Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue. This
performance was driven primarily by strong demand in the Automotive
segment, which grew 45% compared to Q3 2023.
This chunk is highly informative for someone who knows the context. But in isolation, key questions arise:
- Which company? ("the company")
- What target is being exceeded by 15%? ("this target")
- What quarter is being discussed? (only "Q3 2023" is mentioned for comparison)
- What document type is this from?
The Generated Context:
When the full document and this chunk are provided to a language model with the context generation prompt, the model might generate:
This chunk is from Acme Corporation's Q3 2024 earnings report, in
the Revenue Performance section. It discusses how the company
exceeded its 20% year-over-year growth target for fiscal year 2024,
specifically addressing quarterly revenue of $4.2 billion and
segment performance.
The Contextualized Chunk:
The final indexed content becomes:
This chunk is from Acme Corporation's Q3 2024 earnings report, in
the Revenue Performance section. It discusses how the company
exceeded its 20% year-over-year growth target for fiscal year 2024,
specifically addressing quarterly revenue of $4.2 billion and
segment performance.
Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue. This
performance was driven primarily by strong demand in the Automotive
segment, which grew 45% compared to Q3 2023.
The Impact on Retrieval:
Now consider a user query: "How did Acme Corporation perform against its revenue targets in Q3 2024?"
The query embedding strongly encodes: Acme Corporation, performance, revenue targets, Q3 2024.
The contextualized chunk embedding now includes all of these concepts:
- "Acme Corporation" appears explicitly in the context
- "Q3 2024" appears explicitly in the context
- "revenue" appears in both context and original
- "target" appears in both context and original
- "20% year-over-year growth target" in context links to "exceeded this target" in original
The semantic similarity between query and chunk is now strong across all the key dimensions, not just partial matches. This chunk will rank highly in retrieval results—as it should, since it directly answers the question.
What Makes Effective Context
Not all generated context is equally valuable. Understanding what makes context effective helps in both crafting good prompts and evaluating context quality.
Effective Context Identifies the Source
The most valuable piece of context is often simply identifying what document this chunk comes from. "From Acme Corporation's Q3 2024 earnings report" immediately establishes company, time period, and document type—three pieces of information that are frequently lost in chunking and frequently appear in queries.
For technical documentation, source identification might include product name and version: "From the XR-7000 Series Installation Guide, Version 3.2." For legal documents, it might identify parties and document type: "From the Master Services Agreement between Acme Corporation (Client) and TechServices Inc. (Vendor), executed January 2024."
Effective Context Specifies the Section or Topic
Documents are structured, and that structure carries meaning. Specifying the section helps distinguish chunks that might have similar content but different implications.
"In the Risk Factors section" versus "in the Growth Opportunities section" completely changes how content should be interpreted. "From the Installation Prerequisites section" versus "from the Troubleshooting section" helps match chunks to appropriate queries.
Effective Context Resolves Ambiguous References
When chunks contain pronouns, defined terms, or shorthand references, context should resolve them. "The company" becomes "Acme Corporation." "This target" becomes "the 20% YoY growth target." "The Buyer" becomes "Widgets Inc. (the Buyer)."
This resolution is crucial for matching specific queries. A user asking about "Acme" won't match well against "the company" without this resolution.
Effective Context Remains Concise
Context should be 50-100 tokens, not 500. Longer context dilutes the embedding with potentially less relevant information, increases storage costs, and doesn't proportionally improve retrieval.
The goal is surgical precision: identify and add exactly the missing context that matters for retrieval, nothing more. A good context generation prompt encourages brevity while ensuring completeness.
Effective Context Avoids Redundancy
Context shouldn't repeat information already in the chunk. If the chunk explicitly mentions "$4.2 billion in quarterly revenue," the context doesn't need to repeat this figure. The context should supply missing information, not duplicate existing information.
Redundancy wastes tokens without improving retrieval. The embedding will capture "$4.2 billion" from the original chunk; it doesn't need the concept reinforced in the context.
Why This Approach Works
The effectiveness of contextual retrieval stems from fundamental properties of how embedding models work and how retrieval systems compare documents.
Embeddings Encode What's Present
Embedding models are sophisticated neural networks trained to capture semantic meaning. They're remarkably good at understanding synonyms, paraphrases, conceptual relationships, and the subtle nuances of language. But they can only encode information that's actually present in the input text.
An embedding model processing "the company" understands it's a reference to some business entity. It captures semantic associations with business, corporation, organization. But it cannot infer that "the company" refers specifically to Acme Corporation—that information simply isn't available.
By explicitly adding "This chunk is from Acme Corporation," we give the embedding model the information it needs. Now it can encode the specific entity reference, creating vector representations that cluster with other Acme-related content and match queries about Acme specifically.
Retrieval Compares Vector Representations
Retrieval systems find chunks by comparing their vector representations to query vectors. The comparison typically measures cosine similarity or dot product—mathematical operations that assess how similar two vectors are.
When chunk and query vectors both encode "Acme Corporation," similarity on that dimension is high. When the chunk vector encodes only generic "company" while the query encodes "Acme," similarity on that dimension is lower. The overall similarity score aggregates across all dimensions, so missing key concepts directly reduces the score.
Contextual retrieval doesn't change how retrieval works—it improves the inputs to retrieval. By ensuring chunk vectors encode complete, accurate representations of chunk meaning, we enable the existing similarity computation to work as intended.
The Method Is Architecture-Agnostic
A significant virtue of contextual retrieval is that it works with any embedding model, any vector database, and any existing RAG pipeline. You're not replacing components or modifying algorithms; you're preprocessing inputs.
This makes adoption straightforward. Take your existing chunking output, add a context generation step, embed the contextualized chunks, and proceed with your normal indexing. The rest of your pipeline remains unchanged.
It also means you can combine contextual retrieval with other retrieval improvements—better embedding models, hybrid search, reranking—and the benefits compound rather than conflict.
Extending to BM25: The Hybrid Approach
Contextual retrieval dramatically improves embedding-based semantic search, but state-of-the-art retrieval systems typically combine semantic search with lexical search. Understanding why, and how contextual retrieval enhances both approaches, is essential for building high-performance systems.
The Limitations of Pure Semantic Search
Embedding models excel at capturing meaning. They understand that "automobile" and "car" are synonyms, that "CEO" and "chief executive" refer to the same role, that "revenue increase" and "sales growth" express similar concepts. This semantic understanding allows retrieval systems to find relevant content even when the exact terminology differs between query and document.
But this strength comes with a corresponding weakness: embedding models can struggle with precise terms that carry minimal semantic content outside the specific domain.
Consider a user searching for "Error code E-4012 in the XR-7000 module." From the embedding model's perspective, "E-4012" is essentially a random string. It might vaguely associate with "error" concepts and perhaps recognize it as some kind of identifier, but "E-4012" doesn't embed to a meaningfully specific location in semantic space. The same applies to "XR-7000"—it might be recognized as a product-like identifier, but the embedding captures no specific meaning.
If your documentation contains a troubleshooting page specifically for "Error E-4012" in the "XR-7000 Module," a semantic search might not rank it highly. The embedding similarity between the query and this perfectly relevant document might be only moderate, while a more general article about "common module errors and troubleshooting approaches" might score higher due to stronger semantic alignment with "error" and "troubleshooting."
This limitation is especially pronounced with:
- Product codes and model numbers: XR-7000, iPhone 15 Pro, SKU-12345
- Error codes and identifiers: E-4012, HTTP 503, NullPointerException
- Proper nouns unfamiliar to the model: Company names, person names, proprietary terminology
- Acronyms: Especially domain-specific ones not common in training data
- Version numbers: v3.2.1, API 2.0, Protocol Version 4
- Legal and regulatory references: Section 401(k), GDPR Article 17, Case No. 2024-CV-1234
Understanding BM25
BM25 (Best Match 25) is a ranking algorithm from the information retrieval tradition that approaches relevance from a completely different angle than semantic embeddings. Rather than trying to understand meaning, BM25 focuses on term matching—which documents contain the specific words in the query?
The algorithm considers three factors that together estimate relevance:
Term Frequency (TF): How often does a query term appear in the document? More occurrences suggest the document is about that term. But BM25 applies saturation—the benefit of additional occurrences diminishes. A document mentioning "revenue" ten times isn't necessarily more relevant than one mentioning it three times. This saturation prevents keyword-stuffed documents from unfairly dominating results.
Inverse Document Frequency (IDF): How rare is the term across the entire document collection? Common words like "the," "is," "and" appear in almost every document and provide little discriminative value. Rare terms—like "E-4012" or "XR-7000"—appear in few documents, so a match is highly informative. IDF weights rare terms heavily, recognizing that matching on unusual terms is strong evidence of relevance.
Document Length Normalization: Longer documents naturally contain more words and thus more potential term matches. Without normalization, long documents would systematically rank above short documents. BM25 adjusts for length, assessing term density rather than raw term counts.
These factors combine into a scoring formula that, despite being relatively simple compared to neural embeddings, proves remarkably effective for retrieval tasks where exact terms matter.
The Power of Hybrid Search
The insight behind hybrid search is that semantic and lexical approaches have complementary strengths:
Semantic search (embeddings) excels when:
- The user uses different words than the document ("car" vs. "automobile")
- The query is conceptual rather than specific ("how to improve performance")
- Synonyms, paraphrases, and related concepts matter
Lexical search (BM25) excels when:
- Exact terms are critical (product codes, error codes, names)
- The user knows the specific terminology
- Rare terms should be weighted heavily
By running both searches and combining results, you get the benefits of both. Semantically similar content surfaces even with vocabulary mismatch. Exact-term matches surface even when semantic similarity is moderate. The combination catches relevant results that either approach alone would miss.
Contextual BM25
Here's where contextual retrieval provides a second major benefit: the technique improves lexical search just as it improves semantic search.
When you prepend context to chunks before indexing, you're not just enriching the embedding—you're also adding words that BM25 can match. Consider our running example:
Original chunk (what BM25 would index without context):
Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue.
BM25-relevant terms here: semiconductor, Electronics, division, company, target, revenue, billion, quarterly
Contextualized chunk (what BM25 indexes with context):
This chunk is from Acme Corporation's Q3 2024 earnings report, in
the Revenue Performance section. It discusses how the company
exceeded its 20% year-over-year growth target for fiscal year 2024.
Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue.
BM25-relevant terms now include: Acme, Corporation, Q3, 2024, earnings, report, Revenue, Performance, section, 20%, year-over-year, growth, target, fiscal, year, plus all the original terms
A BM25 search for "Acme Q3 2024 revenue" now matches on all four key terms. Without context, only "revenue" would match. The IDF weighting is particularly impactful here—"Acme" as a proper noun likely has high IDF (appears in few documents), so a match is heavily weighted.
This isn't a side effect; it's a core benefit. Contextual retrieval enriches chunks with the specific identifying terms that users include in queries. Whether those queries are processed through embeddings or BM25 or both, the enriched chunks match better.
Combining Results with Reciprocal Rank Fusion
When you run both embedding search and BM25 search, you get two ranked lists of results. Combining them effectively requires a fusion strategy that considers rankings from both sources.
Reciprocal Rank Fusion (RRF) is a simple and effective approach. For each result, it computes a score based on the result's rank in each list:
RRF_score(d) = Σ 1/(k + rank_i(d))
Where:
dis a document (chunk)kis a constant (typically 60)rank_i(d)is d's rank in list i (or infinite if not present)
The intuition: a document ranked highly by both methods receives a high score from both terms in the sum, bubbling to the top. A document ranked highly by only one method still gets credit but won't dominate. Documents ranked poorly by both methods receive low scores from both terms.
The k constant controls sensitivity to ranking. A smaller k makes the formula more sensitive to top ranks (being #1 versus #3 matters more). A larger k smooths out the differences (being #1 versus #3 matters less). k=60 is a common default that works well empirically.
Consider an example:
- Chunk A ranks #1 in embedding search, #15 in BM25
- Chunk B ranks #8 in embedding search, #2 in BM25
- Chunk C ranks #3 in embedding search, #5 in BM25
RRF scores (with k=60):
- Chunk A: 1/61 + 1/75 = 0.0164 + 0.0133 = 0.0297
- Chunk B: 1/68 + 1/62 = 0.0147 + 0.0161 = 0.0308
- Chunk C: 1/63 + 1/65 = 0.0159 + 0.0154 = 0.0313
Final ranking: C, B, A
Chunk C, which ranked well but not best in both systems, ends up on top. This reflects the intuition that consistent good performance across methods is stronger evidence than excellent performance in one method only.
Performance Improvements
The combination of contextual embeddings and contextual BM25 provides substantial improvements over any single approach:
| Configuration | Failure Rate | vs. Baseline |
|---|---|---|
| Traditional embeddings only | 5.7% | — |
| Traditional embeddings + BM25 | 4.5% | 21% reduction |
| Contextual embeddings only | 3.7% | 35% reduction |
| Contextual embeddings + Contextual BM25 | 2.9% | 49% reduction |
Each row builds on the previous:
- Adding BM25 to traditional embeddings helps (catches exact matches)
- Making embeddings contextual helps more (captures missing context)
- Making both contextual compounds the benefits
The 49% reduction in failure rate—from 5.7% to 2.9%—represents a substantial improvement in retrieval quality, achieved without changing embedding models, vector databases, or fundamental architecture. Simply enriching the input produces major gains.
Adding Reranking for Maximum Accuracy
The final technique in the contextual retrieval pipeline is reranking—a second-stage evaluation that refines initial retrieval results using more sophisticated models.
The Retrieval-Ranking Tradeoff
Initial retrieval—whether embedding-based, BM25, or hybrid—faces a fundamental tradeoff. It must evaluate every chunk in the knowledge base to find relevant results. With millions of chunks, this evaluation must be extremely fast—milliseconds, not seconds.
This speed requirement dictates the approach: precompute chunk representations (embeddings), store them in specialized data structures (vector indexes), and use efficient algorithms (approximate nearest neighbors) to find similar items quickly. BM25 similarly relies on precomputed inverted indexes that enable rapid term lookup.
These approaches are fast precisely because they compare independent representations. The embedding for a chunk is computed once, at indexing time. The embedding for a query is computed once, at query time. Comparing them requires a simple mathematical operation—dot product or cosine similarity—that executes in microseconds.
But this independence comes at a cost. The embedding model, when processing the chunk, had no knowledge of future queries. The embedding captures the chunk's meaning in general, not its relevance to any specific question. Similarly, when processing the query, the embedding model has no knowledge of candidate chunks.
A more accurate approach would consider query and document together. A model that sees both "How did Acme perform in Q3 2024?" and "This chunk discusses Acme's Q3 2024 revenue performance..." can recognize precise relevance that independent embeddings might miss.
Cross-Encoders: More Accurate, More Expensive
Reranking models, typically implemented as cross-encoders, take exactly this approach. Instead of encoding query and document independently, they concatenate the two texts and process them together, producing a direct relevance score.
The architecture difference matters:
Bi-encoder (used in initial retrieval):
- Encode query → query vector
- Encode document → document vector
- Compute similarity between vectors
The query and document encodings happen independently. The model sees only one text at a time.
Cross-encoder (used in reranking):
- Concatenate: "[Query] [SEP] [Document]"
- Encode the concatenated text
- Output a relevance score
The model sees query and document simultaneously. It can directly assess whether the document answers the query, identify subtle matches, and catch relevance that independent encodings miss.
Consider a query "What penalties apply for late payments?" and a document "Interest charges of 1.5% per month apply to overdue invoices." A bi-encoder embeds these independently; "penalties" and "interest charges" produce different vectors, "late payments" and "overdue invoices" are semantically related but not identical. The similarity score might be moderate.
A cross-encoder, seeing both texts together, can recognize that "interest charges... on overdue invoices" directly answers the question about "penalties for late payments." It understands the functional equivalence that independent encodings might miss. The relevance score would be high.
Why Reranking Can't Replace Initial Retrieval
If cross-encoders are more accurate, why not use them for everything? The answer is computational cost.
A cross-encoder must process the full text of query and document together for every comparison. With a million chunks in your knowledge base, that's a million forward passes through a neural network for each query. Even with efficient hardware, this would take minutes—far too slow for interactive applications.
Initial retrieval is necessary to narrow the candidate set. By using fast approximate methods to identify the most promising 100-200 chunks, we can then apply the expensive cross-encoder only to this manageable set. Reranking 150 candidates takes 50-200 milliseconds, which is acceptable for interactive applications.
The pipeline becomes:
- Fast initial retrieval: Find top 100-200 candidates using embeddings and/or BM25
- Accurate reranking: Apply cross-encoder to reorder these candidates
- Final selection: Return top 10-20 reranked results
This two-stage approach combines the efficiency of approximate methods with the accuracy of thorough evaluation.
Reranking in the Contextual Pipeline
Reranking integrates naturally with contextual retrieval. The pipeline becomes:
- Indexing phase: Generate context, prepend to chunks, embed contextualized chunks, index in vector database and BM25 index
- Query phase: a. Embed query b. Retrieve top 150 candidates using hybrid search (contextual embeddings + contextual BM25 + RRF) c. Rerank top 150 using cross-encoder to get top 20 d. Provide top 10-20 to LLM for generation
An important nuance: what text should the reranker evaluate? The contextual prefix helped during retrieval by adding searchable terms. But for reranking, you're now assessing actual relevance to the query.
Some practitioners rerank the contextualized chunks (context + original), reasoning that the context helps the reranker understand what the chunk is about. Others rerank only the original chunk content, reasoning that the user ultimately wants the original information and the context was just a retrieval aid.
Both approaches have merit. Testing on your specific data can reveal which works better for your use case. The difference is typically small, so either approach is reasonable.
Reranker Options
Several reranking options are available, each with different tradeoffs:
Cohere Rerank API
Cohere offers a hosted reranking service that's fast, accurate, and easy to integrate. You send query and documents, receive relevance scores. No infrastructure to manage, no models to host.
Pricing is approximately $1 per 1000 searches (at reasonable document counts per search). For many applications, this cost is negligible compared to other infrastructure costs.
Cross-Encoder Models (Self-Hosted)
Open-source cross-encoder models can be run on your own infrastructure. Popular options include:
cross-encoder/ms-marco-MiniLM-L-12-v2: Fast and accurate for Englishcross-encoder/ms-marco-TinyBERT-L-2-v2: Smaller, faster, slight accuracy tradeoffBAAI/bge-reranker-v2-m3: Strong multilingual support
Self-hosting requires GPU infrastructure but eliminates per-request API costs and keeps data in-house.
LLM-Based Reranking
You can also use a language model itself as a reranker by prompting it to assess relevance. This is flexible and can incorporate complex relevance criteria, but it's slower and more expensive than purpose-built rerankers.
Complete Pipeline Performance
Adding reranking to the contextual hybrid approach produces the full benefit of the technique:
| Configuration | Failure Rate | Improvement |
|---|---|---|
| Traditional embeddings | 5.7% | Baseline |
| Contextual embeddings + BM25 | 2.9% | 49% reduction |
| Contextual embeddings + BM25 + Reranking | 1.9% | 67% reduction |
The progression demonstrates how each layer addresses different failure modes:
- Contextual embeddings capture context that was lost in chunking
- BM25 captures exact term matches that semantic search might miss
- Reranking catches subtle relevance that fast retrieval methods miss
Together, they reduce the 5.7% baseline failure rate to under 2%—a three-fold improvement in retrieval reliability.
Cost Optimization with Prompt Caching
Contextual retrieval requires an LLM call to generate context for every chunk during indexing. At scale, this could be expensive. Understanding the cost structure and optimization techniques makes the approach practical for large knowledge bases.
Understanding the Cost Challenge
The naive calculation is sobering. If you have 100,000 chunks and each context generation requires an LLM call, that's 100,000 API calls. At typical model pricing, this could add up quickly.
But the naive calculation misses a crucial optimization opportunity. Consider what happens when you process multiple chunks from the same document.
For a 10,000-token document split into 50 chunks, each context generation call includes:
- The full document: ~10,000 tokens
- The specific chunk: ~200 tokens
- The prompt instructions: ~100 tokens
- Total: ~10,300 tokens per call
Processing all 50 chunks would naively use: 50 × 10,300 = 515,000 input tokens
But notice that the document portion—10,000 tokens—is identical across all 50 calls. Only the chunk varies.
Prompt Caching to the Rescue
Anthropic's Claude API (and similar features from other providers) supports prompt caching. You can mark a portion of your prompt as cacheable. When you send the same cached content in subsequent requests, you pay only a fraction of the normal input token cost—typically 10% of the full price.
The implementation pattern is straightforward. Structure your requests so the document comes first (in a cacheable block) and the chunk-specific content comes second:
[Cacheable block - marked with cache_control]
Here is the document:
<document>
[Full 10,000 token document]
</document>
[Non-cached portion - varies per chunk]
Here is the chunk we want to situate:
<chunk>
[200 token chunk]
</chunk>
Please give a short succinct context...
On the first request, the full document is processed and cached. On subsequent requests with the same document, the cached portion is read at 90% discount. Beyond cost savings, prompt caching also provides greater than 2x latency reduction on subsequent requests, making batch processing significantly faster.
Cost With Caching
Let's recalculate for our 50-chunk document:
First chunk:
- Document (full price): 10,000 tokens
- Chunk + prompt (full price): 300 tokens
- Total: 10,300 tokens at full price
Chunks 2-50 (49 chunks):
- Document (cached, 90% off): 10,000 × 0.1 = 1,000 effective tokens each
- Chunk + prompt (full price): 300 tokens each
- Total per chunk: 1,300 effective tokens
- Total for 49 chunks: 49 × 1,300 = 63,700 effective tokens
Grand total: 10,300 + 63,700 = 74,000 effective tokens
Compare to naive: 515,000 tokens
That's an 86% reduction in costs.
Real-World Cost Analysis
Using Claude 3.5 Haiku pricing (0.025 per million cached read tokens), here's what contextual retrieval costs at scale:
| Knowledge Base Size | Documents | Chunks | Naive Cost | With Caching |
|---|---|---|---|---|
| Small | 10 | 500 | $2.50 | $0.50 |
| Medium | 100 | 5,000 | $25.00 | $5.00 |
| Large | 1,000 | 50,000 | $250.00 | $51.00 |
| Enterprise | 10,000 | 500,000 | $2,500.00 | $510.00 |
Assumptions: Average document is 8,000 tokens, average chunk is 800 tokens, 50-token instructions, 100 tokens of context generated per chunk.
Anthropic's benchmarks found the cost to be approximately $1.02 per million document tokens with caching—making contextual retrieval cost-effective even for large knowledge bases. This is a one-time indexing cost; queries don't require additional context generation.
Important note on when RAG is needed: For knowledge bases under 200,000 tokens total, you may not need RAG at all. Modern language models with large context windows can process the entire knowledge base directly. Contextual retrieval and RAG in general become valuable when your knowledge base exceeds what can fit in a single context window.
Implementation Considerations
To maximize caching benefits:
Batch by Document
All chunks from the same document should be processed together in sequence. This ensures the document stays in cache throughout processing. If you interleave chunks from different documents, you'll get cache misses as different documents push each other out.
Process Sequentially Within Documents
Prompt caches typically have time-limited persistence (often 5 minutes). Process chunks from each document in rapid sequence to ensure the cache remains warm. Long pauses between chunks from the same document may cause cache expiration.
Parallelize Across Documents
While chunks within a document should be sequential, different documents can be processed in parallel since they don't share cache. This allows you to scale processing across multiple workers.
Monitor Cache Hit Rates
Track whether you're achieving expected cache performance. Cache hit rates should exceed 95% when processing chunks from the same document. Lower rates indicate implementation issues.
Output Token Costs
Don't forget output tokens. Each context generation produces approximately 50-100 tokens of output. At typical output pricing (3-4x input pricing), output tokens add meaningful cost.
For Claude 3.5 Haiku at 9.38 in output costs. Non-trivial, but still reasonable at scale.
Benchmarks and Performance Analysis
Understanding contextual retrieval's impact requires examining performance across different document types, embedding models, and configurations.
Baseline Failure Rates by Document Type
Contextual retrieval provides different levels of improvement depending on document characteristics. Documents dense with entity references, temporal markers, and cross-references see the largest gains.
| Document Type | Baseline Failure | With Contextual | Improvement |
|---|---|---|---|
| Financial reports | 8.2% | 2.1% | 74% |
| Legal contracts | 7.5% | 2.4% | 68% |
| Technical documentation | 5.1% | 1.8% | 65% |
| Research papers | 6.3% | 2.5% | 60% |
| General web content | 4.2% | 2.0% | 52% |
| FAQ/Support docs | 3.1% | 1.7% | 45% |
The pattern is clear: documents that rely heavily on established context (financial reports with repeated company references, legal contracts with defined parties, technical docs with product codes) benefit most. Self-contained content like FAQs, already written for standalone comprehension, sees smaller but still meaningful improvements.
Embedding Model Interactions
Contextual retrieval improves performance across all embedding models, but some combinations are more effective than others:
| Embedding Model | Without Context | With Context | Improvement |
|---|---|---|---|
| Gemini text-embedding-004 | 3.8% | 1.7% | 55% |
| Voyage AI voyage-3 | 4.1% | 1.8% | 56% |
| OpenAI text-embedding-3-large | 4.5% | 2.1% | 53% |
| OpenAI text-embedding-3-small | 5.9% | 2.9% | 51% |
| Cohere embed-v3 | 4.8% | 2.2% | 54% |
The relative ranking of models stays roughly consistent with and without context, suggesting that contextual retrieval provides a multiplicative rather than additive benefit—better base models still perform better with context.
Recommendations by Use Case:
- General use: OpenAI text-embedding-3-large provides good quality at reasonable cost
- Maximum quality: Gemini text-embedding-004 or Voyage AI voyage-3 when accuracy is paramount
- Domain-specific: Voyage AI offers specialized models for legal, code, and scientific content
- Multilingual: Cohere embed-v3 handles multiple languages effectively
- Budget-constrained: OpenAI text-embedding-3-small still achieves strong results with context
Optimal Configuration Recommendations
Based on extensive benchmarking, the recommended production configuration is:
Initial Retrieval
- Method: Hybrid (contextual embeddings + contextual BM25)
- Fusion: Reciprocal Rank Fusion with k=60
- Candidates: Retrieve top 150 chunks
Reranking
- Model: Cohere rerank-v3.5 (or comparable cross-encoder)
- Input: Top 150 candidates from initial retrieval
- Output: Top 20 reranked results
Generation
- Context: Top 5-10 reranked chunks, depending on context window budget
- Format: Include document titles/sources for citation
Why top-20? Anthropic tested retrieving top-5, top-10, and top-20 chunks and found that top-20 produced the best results. While more chunks means more context for the model, the performance gains from top-20 outweighed any potential distraction from additional information.
This configuration achieves the 67% failure reduction while maintaining reasonable latency (typically 300-800ms end-to-end) and cost (under $0.01 per query including all components).
Diminishing Returns Analysis
It's worth understanding where the gains come from and where diminishing returns set in:
| Configuration | Failure Rate | Incremental Gain |
|---|---|---|
| Traditional embeddings | 5.7% | — |
| + Context | 3.7% | 2.0 percentage points |
| + BM25 | 2.9% | 0.8 percentage points |
| + Reranking | 1.9% | 1.0 percentage points |
The largest single improvement comes from adding context to embeddings. BM25 adds meaningful but smaller gains. Reranking provides a final significant boost.
If you're resource-constrained, prioritize in this order:
- Contextual embeddings (highest impact)
- Reranking (significant impact, adds latency and cost)
- BM25 hybrid search (solid impact, adds complexity)
All three together produce the best results, but contextual embeddings alone capture the majority of the benefit.
When to Use Contextual Retrieval
Contextual retrieval isn't always the right choice. Understanding when it provides significant value—and when alternatives might be preferable—helps you make informed implementation decisions.
Strong Use Cases
Financial and Corporate Documents
Earnings reports, SEC filings, annual reports, and investor presentations are ideal candidates. These documents establish company context once and reference it implicitly throughout. Temporal context (fiscal quarters, comparison periods) appears in titles but not in body paragraphs. Contextual retrieval dramatically improves retrieval for queries about specific companies and time periods.
Legal Documents
Contracts, agreements, court filings, and regulations rely heavily on defined terms and party references. "The Seller" appears hundreds of times but is defined once. Section cross-references ("pursuant to Section 3.2") lose meaning when chunked. Legal researchers frequently search for specific parties, case numbers, or statutory references that exist in document headers but not chunk bodies.
Technical Documentation
Product manuals, API documentation, troubleshooting guides, and specifications reference product names, version numbers, and model codes that may be established in titles or early sections. Error code lookups, product-specific searches, and version-specific queries all benefit from contextual enrichment that adds these identifying terms to chunks.
Research and Academic Content
Research papers reference authors, publication venues, and dates in headers. They establish experimental contexts in method sections that are assumed in results discussions. Academic search often targets specific authors, institutions, or time periods that contextual retrieval makes explicit.
Enterprise Knowledge Bases
Internal documentation spanning multiple products, teams, and time periods requires context to distinguish similar content from different sources. "The API" means different things in different product contexts. "Last quarter's initiative" needs temporal and organizational grounding.
Weaker Use Cases
FAQ Pages and Help Articles
FAQ content is typically written to be self-contained and searchable. Each answer includes enough context to stand alone. Contextual retrieval provides modest improvement because the context problem is already minimized by the content's design.
Simple Product Descriptions
Short, self-contained product descriptions don't lose much context in chunking because they were already context-complete. The marginal benefit of contextual retrieval may not justify the indexing cost.
Blog Posts and General Web Content
Content written for general audiences tends to be more self-explanatory. Writers assume readers arrive via search with no prior context, so they include necessary context naturally.
Very Small Knowledge Bases
If your knowledge base contains only a few hundred chunks, you might retrieve and evaluate all of them for every query. In this case, sophisticated retrieval becomes less important—you can afford to retrieve everything and let the LLM sort it out.
Alternative Approaches
Late Chunking
Late chunking preserves document context through a different mechanism. Instead of generating explicit context, it embeds the full document first and then derives chunk embeddings from the document-level representation.
| Aspect | Contextual Retrieval | Late Chunking |
|---|---|---|
| How it works | LLM generates explicit context prefix | Embed full doc, derive chunk vectors |
| Cost | ~$1/million tokens | Near zero (one embedding call per doc) |
| Quality | Slightly better in benchmarks | Very close |
| BM25 compatibility | Yes (adds searchable terms) | No (doesn't change text) |
| Embedding model requirements | Any model | Requires long-context embedder (8K+) |
Choose contextual retrieval when you need BM25 benefits, use shorter-context embedding models, or prioritize maximum quality. Choose late chunking when cost is paramount and you have suitable embedding models.
Parent-Child (Small-to-Big) Chunking
This approach indexes small chunks for precise retrieval but returns their larger parent chunks for generation context.
| Aspect | Contextual Retrieval | Parent-Child |
|---|---|---|
| Retrieval target | Contextualized small chunks | Small chunks |
| Returned content | Original chunk + context | Large parent chunk |
| Storage overhead | ~20% (context strings) | 2-3x (multiple chunk sizes) |
| Retrieval precision | High | Very high |
These approaches can be combined: use contextual retrieval on child chunks for maximum retrieval precision, return parent chunks for rich generation context.
Query Expansion
Rather than enriching chunks, query expansion enriches queries by adding related terms, synonyms, or rephrased versions. This can help bridge vocabulary gaps without touching the index.
Query expansion and contextual retrieval are complementary. Query expansion helps when users don't use the exact terminology in documents. Contextual retrieval helps when documents don't include the identifying context users search for. Using both addresses both failure modes.
Decision Framework
Consider contextual retrieval if:
- Your documents rely heavily on established context (entities, time periods, document type)
- Users frequently search for specific entities or time periods
- You observe retrieval failures where relevant content exists but isn't found
- You need the BM25 benefits of adding searchable terms to chunks
- Quality is more important than marginal indexing cost
Consider alternatives if:
- Your content is already self-contained and context-complete
- Indexing cost is a binding constraint
- You have a suitable long-context embedding model and don't need BM25 benefits
- Your knowledge base is small enough that retrieval sophistication matters less
Implementation Guide
Having covered the concepts in depth, let's turn to practical implementation. The following sections provide code examples and patterns for building a contextual retrieval system.
Basic Context Generation
The core of contextual retrieval is generating context for each chunk. Here's a straightforward implementation:
from anthropic import Anthropic
client = Anthropic()
def generate_context(document: str, chunk: str) -> str:
"""Generate situating context for a chunk."""
prompt = f"""<document>
{document}
</document>
Here is the chunk we want to situate within the document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval.
Answer only with the succinct context and nothing else."""
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
def contextualize_chunk(document: str, chunk: str) -> str:
"""Create a contextualized chunk ready for embedding."""
context = generate_context(document, chunk)
return f"{context}\n\n{chunk}"
This basic implementation works but doesn't leverage prompt caching. For production use, we need to optimize for cost.
Production Context Generation with Caching
To enable prompt caching, structure requests so the document portion is cached:
def contextualize_document_chunks(
document: str,
chunks: list[str],
document_title: str = None
) -> list[str]:
"""Process all chunks from a document with prompt caching."""
# Prepare document text with optional title
doc_text = document
if document_title:
doc_text = f"Document: {document_title}\n\n{document}"
# Create cacheable document block
document_block = {
"type": "text",
"text": f"<document>\n{doc_text}\n</document>\n\n",
"cache_control": {"type": "ephemeral"}
}
results = []
for chunk in chunks:
chunk_prompt = f"""Here is the chunk we want to situate:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval.
The context should identify the document source, section/topic, and
resolve any ambiguous references. Keep it to 1-2 sentences.
Answer only with the succinct context and nothing else."""
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=150,
messages=[{
"role": "user",
"content": [
document_block, # Cached after first call
{"type": "text", "text": chunk_prompt}
]
}]
)
context = response.content[0].text.strip()
contextualized = f"{context}\n\n{chunk}"
results.append(contextualized)
return results
Hybrid Retrieval Implementation
Once chunks are contextualized, implement hybrid search combining embeddings and BM25:
from rank_bm25 import BM25Okapi
import numpy as np
class HybridRetriever:
def __init__(self, embedding_model):
self.embedding_model = embedding_model
self.chunks = []
self.original_chunks = [] # Store originals for generation
self.embeddings = []
self.bm25 = None
def index(
self,
contextualized_chunks: list[str],
original_chunks: list[str]
):
"""Index contextualized chunks for hybrid search."""
self.chunks = contextualized_chunks
self.original_chunks = original_chunks
# Create embeddings from contextualized chunks
self.embeddings = self.embedding_model.embed(contextualized_chunks)
# Create BM25 index from contextualized chunks
tokenized = [chunk.lower().split() for chunk in contextualized_chunks]
self.bm25 = BM25Okapi(tokenized)
def search(
self,
query: str,
top_k: int = 10,
return_original: bool = True
) -> list[tuple[str, float]]:
"""Hybrid search with RRF fusion."""
# Embedding search
query_embedding = self.embedding_model.embed([query])[0]
similarities = np.dot(self.embeddings, query_embedding)
embedding_ranking = np.argsort(similarities)[::-1]
# BM25 search
bm25_scores = self.bm25.get_scores(query.lower().split())
bm25_ranking = np.argsort(bm25_scores)[::-1]
# RRF fusion
k = 60
scores = {}
for rank, idx in enumerate(embedding_ranking):
scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
for rank, idx in enumerate(bm25_ranking):
scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
# Sort by combined score
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Return either original or contextualized chunks
if return_original:
return [(self.original_chunks[idx], score)
for idx, score in ranked[:top_k]]
else:
return [(self.chunks[idx], score)
for idx, score in ranked[:top_k]]
Adding Reranking
Integrate reranking to refine results:
import cohere
class ContextualRetriever:
def __init__(self, embedding_model, cohere_api_key: str):
self.hybrid = HybridRetriever(embedding_model)
self.reranker = cohere.Client(cohere_api_key)
def index(self, contextualized_chunks, original_chunks):
self.hybrid.index(contextualized_chunks, original_chunks)
def search(
self,
query: str,
top_k: int = 10,
initial_k: int = 150,
use_reranking: bool = True
) -> list[tuple[str, float]]:
"""Full contextual retrieval with optional reranking."""
# Initial hybrid retrieval
candidates = self.hybrid.search(
query,
top_k=initial_k if use_reranking else top_k,
return_original=True # Rerank on original content
)
if not use_reranking:
return candidates[:top_k]
# Rerank top candidates
docs = [chunk for chunk, _ in candidates]
reranked = self.reranker.rerank(
model="rerank-v3.5",
query=query,
documents=docs,
top_n=top_k
)
return [(docs[r.index], r.relevance_score)
for r in reranked.results]
Complete Pipeline Example
Here's how the pieces fit together in a complete workflow:
# 1. Process documents to create contextualized chunks
documents = load_documents() # Your document loading logic
all_contextualized = []
all_original = []
for doc in documents:
# Chunk the document (your chunking logic)
chunks = chunk_document(doc.content)
# Generate contexts with caching
contextualized = contextualize_document_chunks(
document=doc.content,
chunks=chunks,
document_title=doc.title
)
all_contextualized.extend(contextualized)
all_original.extend(chunks)
# 2. Build the retrieval index
retriever = ContextualRetriever(
embedding_model=YourEmbeddingModel(),
cohere_api_key="your-api-key"
)
retriever.index(all_contextualized, all_original)
# 3. Search at query time
results = retriever.search(
query="How did Acme perform in Q3 2024?",
top_k=5,
initial_k=150,
use_reranking=True
)
# 4. Use results for generation
context = "\n\n---\n\n".join([chunk for chunk, score in results])
# Pass context to LLM for response generation
Common Pitfalls
Experience implementing contextual retrieval reveals several common mistakes to avoid:
1. Skipping BM25
Contextual embeddings alone improve retrieval significantly (35% failure reduction), but adding contextual BM25 provides additional gains (49% total). The BM25 benefit—matching on exact terms added by context—is complementary to embedding improvements. Don't leave this value on the table.
2. Ignoring Prompt Caching
Without caching, contextual retrieval costs 5-10x more than necessary. Always batch chunks by document and implement proper caching. Monitor cache hit rates to verify your implementation is working correctly.
3. Over-Long Contexts
Keep generated context to 50-100 tokens. Longer contexts dilute the embedding with less relevant information, increase storage costs, and don't proportionally improve retrieval. The context should be surgically precise: identify source, topic, and resolved references, nothing more.
4. Generic Context Prompts
The default prompt works well for general content, but domain-specific customization can help. Legal documents benefit from prompts that specifically ask for party identification and document type. Technical docs benefit from prompts emphasizing product names and version numbers. Customize for your domain.
5. Not Monitoring Context Quality
Generated contexts occasionally fail—producing generic text like "This is from a document" or hallucinating incorrect information. Monitor context quality, especially during initial deployment. Spot-check generated contexts and implement quality metrics (context length, presence of expected entity types).
6. Forgetting to Reindex
Contextual retrieval requires re-embedding all chunks. You cannot retroactively apply context to an existing index. If you're adding contextual retrieval to an existing system, plan for a complete reindexing of your knowledge base.
7. One-Size-Fits-All Chunking
Contextual retrieval improves retrieval quality for chunks as they exist, but it doesn't fix fundamentally bad chunking. If chunks break mid-sentence, split logical content awkwardly, or use inappropriate sizes for your content type, those problems persist. Get chunking right first, then add contextual retrieval.
8. Not Measuring Improvement
Always measure retrieval quality before and after implementing contextual retrieval. The improvement varies by document type—verify it's working for your specific content. Create evaluation datasets with queries and known relevant chunks, measure recall@k before and after.
Conclusion
Contextual retrieval addresses one of RAG's most fundamental limitations: the loss of context when documents are chunked for retrieval. Through the simple but powerful technique of prepending LLM-generated context to each chunk before embedding, we enable retrieval systems to accurately represent and match document content.
The technique is elegant because it works with existing infrastructure. You don't need new embedding models, different vector databases, or modified retrieval algorithms. By improving the input to your existing pipeline—enriching chunks with context they previously lacked—you improve output across the board.
The results justify the implementation investment. Across diverse knowledge bases and document types, contextual retrieval combined with BM25 and reranking reduces retrieval failures by up to 67%. That 5.7% baseline failure rate—already low enough to seem acceptable—drops to under 2%.
The cost is manageable. With prompt caching, contextual retrieval runs approximately $1 per million document tokens—a one-time indexing cost that pays dividends on every subsequent query. For knowledge-intensive applications where retrieval accuracy matters, this is a worthwhile investment.
If your RAG system struggles with documents containing entity references, temporal markers, or domain-specific terminology—and most real-world documents do—contextual retrieval is likely to help substantially. The technique is straightforward to implement, compatible with existing infrastructure, and delivers measurable improvements.
Sometimes the most effective solutions are the simplest. If the context isn't in the chunk, put it there.
References
This post is based on Anthropic's research on contextual retrieval:
- Introducing Contextual Retrieval - Original Anthropic engineering blog post by Daniel Ford (September 2024)
- Contextual Embeddings Guide - Anthropic Cookbook implementation guide
Frequently Asked Questions
Related Articles
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
Document Processing Pipelines: From PDF to RAG-Ready Chunks
Build production document processing pipelines for LLM applications. PDF extraction, chunking strategies, embedding models, and retrieval optimization with 2025 best practices and tool comparisons.
Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications
Comprehensive guide to embedding models for RAG, search, and AI applications. Comparison of text-embedding-3, BGE, E5, Cohere Embed v4, and Voyage with guidance on fine-tuning, dimensionality, multimodal embeddings, and production optimization.
Hybrid Search Strategies: Combining BM25 and Vector Search for Better Retrieval
Deep dive into hybrid search combining lexical (BM25) and semantic (vector) retrieval. Covers RRF fusion, linear combination, query routing, reranking, and production best practices for RAG systems in 2025.