Should I use contextual retrieval or late chunking?

Both techniques address the same underlying problem—context loss during chunking—but with different approaches and tradeoffs. Use contextual retrieval when: - You need BM25 improvements (late chunking doesn't add searchable terms) - You're using shorter-context embedding models (under 8K tokens) - Maximum retrieval quality justifies the indexing cost - Your documents are highly context-dependent Use late chunking when: - Cost is the primary constraint (near-zero indexing cost) - You have a long-context embedding model (8K+ tokens) - You're indexing very large corpora where even $1/million tokens adds up - Your content is less context-dependent For critical applications where accuracy matters, consider testing both on your specific data. The quality difference is modest—contextual retrieval is slightly better in most benchmarks—so the decision often comes down to practical factors like cost and infrastructure.

Can I apply contextual retrieval to an existing index?

No, you'll need to reindex. Existing embeddings were computed from raw chunks—they don't contain contextual information and can't be updated in place. Contextual retrieval modifies what text gets embedded, which changes the resulting vectors entirely. The reindexing process: 1. Retrieve original chunk text from your storage 2. Regenerate contexts for all chunks (using caching for efficiency) 3. Embed the contextualized chunks 4. Build a new vector index 5. Switch query traffic to the new index For large knowledge bases, consider a rolling update: build the new index while serving from the existing one, then switch over once complete.

Back to Blog

Education RAG LLMs ML Engineering

Contextual Retrieval: Solving RAG's Hidden Context Problem

How prepending chunk-specific context before embedding dramatically improves retrieval quality. Complete guide covering contextual embeddings, contextual BM25, reranking, and prompt caching optimization.

January 8, 202630 min read

Introduction

Modern AI applications face a fundamental challenge: they need access to information that wasn't part of their training data. Consider the breadth of specialized knowledge required by real-world applications. A customer support chatbot needs intimate familiarity with your company's specific return policies, product specifications, and troubleshooting procedures—information that exists only in your internal documentation. A legal research assistant must search through decades of case law, statutes, and regulatory filings to find precedents relevant to a specific situation. A financial analyst tool needs access to the latest quarterly earnings reports, SEC filings, and market analyses that were published long after any model's training cutoff.

The traditional approach to giving AI systems specialized knowledge was fine-tuning: taking a pre-trained model and further training it on domain-specific data. But fine-tuning has significant limitations. It's expensive, requiring substantial compute resources for each domain adaptation. It's slow, taking days or weeks to complete. It's inflexible—when your documentation changes, you need to fine-tune again. And perhaps most importantly, it struggles with factual recall; fine-tuning is better at adapting a model's style or behavior than at reliably encoding specific facts that can be retrieved on demand.

Retrieval-Augmented Generation (RAG) emerged as an elegant alternative. Instead of baking knowledge into model weights through training, RAG systems retrieve relevant information at query time and provide it directly in the prompt. When a user asks a question, the system searches a knowledge base for relevant passages, then provides those passages to the language model as context alongside the question. The model generates its response grounded in this retrieved information, effectively gaining access to vast knowledge bases without any additional training.

The architecture is compelling in its simplicity: separate the concerns of knowledge storage (in a searchable database) from knowledge application (in the language model). This separation brings immediate benefits. Knowledge can be updated instantly by modifying the database, without touching the model. The same model can serve multiple domains by switching which knowledge base it queries. And the system provides natural citations—you know exactly which documents informed each response.

But RAG systems harbor a hidden failure mode that undermines their effectiveness in practice. The problem emerges from a fundamental tension between how documents are written and how retrieval systems process them.

The Hidden Failure Mode

Documents are written for human readers who maintain context as they read. An earnings report begins by establishing that it covers "Acme Corporation's Third Quarter 2024 Financial Results." Every subsequent page builds on this context. When the report later states "The company's revenue grew by 3% over the previous quarter," human readers understand immediately that "the company" means Acme Corporation, "the previous quarter" means Q2 2024, and this growth figure relates to the Q3 2024 period established at the outset.

RAG systems, however, don't read documents sequentially and holistically. They break documents into chunks—smaller segments that can be individually indexed and retrieved. This chunking is necessary (we'll explore why shortly), but it severs the threads of context that connect different parts of a document.

When that sentence about 3% revenue growth gets chunked and stored in isolation, the critical context evaporates. The chunk becomes an orphan, separated from the document structure that gave it meaning. An embedding model processing this chunk encodes what's there: something about revenue growth, percentage figures, quarterly comparison. But which company? What time period? The embedding can't capture information that simply isn't present in the text.

Now consider what happens when a user asks: "How did Acme Corp perform in Q3 2024?" The query embedding captures the key concepts: Acme Corp, Q3, 2024, performance. But the chunk embedding for that highly relevant sentence lacks "Acme," lacks "Q3 2024," lacks any specific identifying information. The semantic similarity between query and chunk is weaker than it should be. The chunk might rank below less relevant results—or fail to appear in the top results entirely.

This is the context problem, and it's more pervasive than it might initially appear. Anthropic's research across diverse knowledge bases found that traditional RAG systems fail to retrieve relevant information roughly 5.7% of the time. That percentage might seem small, but consider its implications. If a complex question requires information from three different chunks, and each has a 5.7% chance of retrieval failure, the probability that at least one fails is approximately 16%. For knowledge-intensive applications where accuracy matters, a 16% failure rate on multi-part questions represents a serious limitation.

The Elegant Solution

Contextual Retrieval addresses this problem through a disarmingly simple insight: if the context isn't in the chunk, put it there before embedding.

For each chunk in your knowledge base, you use a language model to generate a brief context snippet—typically 50-100 tokens—that situates the chunk within its source document. This context identifies where the chunk comes from, what topic it addresses, and clarifies any ambiguous references. The context is then prepended to the original chunk text, and this combined text is what gets embedded and indexed.

That orphaned sentence about revenue growth becomes:

"This chunk is from Acme Corporation's Q3 2024 earnings report, specifically the Revenue Performance section discussing quarterly growth against the company's fiscal year targets. The company's revenue grew by 3% over the previous quarter."

Now when this enhanced chunk is embedded, the vector captures the full meaning: Acme Corporation, Q3 2024, earnings report, revenue performance, quarterly growth—all the context needed to match relevant queries accurately.

The results validate the approach dramatically. In Anthropic's benchmarks across diverse knowledge bases—codebases, fiction, academic papers, technical documentation—contextual retrieval combined with complementary techniques reduced retrieval failures by up to 67%. That 5.7% baseline failure rate dropped to under 2%.

This post provides a comprehensive guide to understanding and implementing contextual retrieval. We'll explore the underlying mechanics of why RAG systems struggle with context, how the contextual retrieval technique works in detail, how to combine it with other retrieval methods for maximum effectiveness, and practical considerations for production implementation.

Understanding the RAG Architecture

Before diving into contextual retrieval, we need to understand the RAG pipeline in detail—specifically, why chunking is necessary and how it creates the context problem we're trying to solve.

Why Documents Must Be Chunked

RAG systems can't work with whole documents for two fundamental reasons, each stemming from different constraints in the architecture.

The Context Window Constraint

Language models process text through fixed-size context windows. Even the largest models today support perhaps 200,000 tokens in a single context—impressive, but finite. A typical enterprise knowledge base might contain millions of documents, totaling billions of tokens. You simply cannot concatenate your entire knowledge base into a prompt.

But the constraint runs deeper than total size. Even if you could fit everything, you wouldn't want to. Language models exhibit well-documented degradation when processing very long contexts. Important information in the middle of long contexts tends to get "lost"—the model attends more strongly to the beginning and end. A 100,000-token context containing a critical 200-token passage somewhere in the middle may fail to surface that passage's information in the response.

RAG solves this by being selective: instead of providing everything, provide only the most relevant information for each specific query. This requires breaking documents into retrievable units—chunks—that can be selectively included based on their relevance to the query at hand.

The Retrieval Precision Problem

Imagine embedding an entire 50-page financial report as a single vector. That vector would represent the "average meaning" across all 50 pages—the overview sections, the detailed financials, the risk factors, the management discussion, the footnotes. It would be a diffuse representation, a little bit about everything but not strongly about anything specific.

Now imagine a user asks about a specific risk factor mentioned briefly on page 37. The query embedding would be focused and specific: this particular risk, its implications, the company's mitigation strategy. The document embedding would be diffuse and general. The semantic similarity between them might be mediocre—not because the document lacks the relevant information, but because that information is diluted across 50 pages of other content.

Chunking solves this by creating focused retrieval units. A 500-token chunk from page 37 that specifically discusses that risk factor would have a much stronger embedding match with the query. The chunk's embedding is concentrated on exactly what the user is asking about.

This is why all production RAG systems chunk their documents. The question isn't whether to chunk, but how—and as we'll see, the how creates the context problem.

The Standard RAG Pipeline

A typical RAG system processes information through two distinct phases: an indexing phase (processing documents in advance) and a query phase (responding to user questions in real-time).

The Indexing Phase

The journey begins with document ingestion. Documents arrive from various sources—file systems containing PDFs and Word documents, content management systems, databases, web crawlers, API integrations with third-party services. These documents come in diverse formats and must first be converted into plain text that can be processed further.

Next comes chunking, the critical step where documents are divided into smaller segments. Various strategies exist: fixed-size chunking (every N tokens), sentence-based chunking (respecting sentence boundaries), paragraph-based chunking (respecting paragraph structure), or semantic chunking (attempting to identify topically coherent segments). Most implementations use overlapping chunks, where each chunk shares some content with its neighbors, to avoid losing information at boundaries.

Each chunk is then passed through an embedding model—a neural network trained to convert text into dense vector representations. These vectors, typically 768 to 3072 dimensions, capture the semantic meaning of the text in a form that allows mathematical comparison. Texts with similar meanings produce vectors that are close together in the high-dimensional vector space.

Finally, these vectors are stored in a vector database—specialized systems like Pinecone, Weaviate, Qdrant, Milvus, or pgvector that support efficient similarity search across millions or billions of vectors.

The Query Phase

When a user asks a question, that query is first embedded using the same embedding model used for the chunks. This produces a query vector in the same semantic space as the chunk vectors.

The system then performs similarity search, finding the chunk vectors most similar to the query vector. This typically uses approximate nearest neighbor algorithms that can search across millions of vectors in milliseconds. The result is a ranked list of chunks, ordered by their semantic similarity to the query.

The top-ranked chunks—typically 3 to 20 depending on the application—are then assembled into a context that's provided to the language model alongside the original query. The model generates a response grounded in this retrieved context, ideally citing or referencing the source material.

Where Context Gets Lost

The context problem emerges at the chunking stage of indexing. When a document is split into chunks, each chunk becomes an isolated unit. Information that was implicitly understood from the document's structure—what document this is, what section, what entities were established earlier, what time period is being discussed—vanishes from individual chunks.

The embedding model can only encode what's present in the chunk text. It has no access to the original document, no knowledge of what came before or after, no understanding of the document's metadata or structure. It encodes the chunk as-is, with all its implicit references rendered meaningless.

This encoded ambiguity then propagates through the entire retrieval pipeline. The vector database faithfully stores these ambiguous vectors. The similarity search faithfully compares them to query vectors. But the comparison is fundamentally compromised because the chunk vectors don't represent what the chunks actually mean—they represent only the context-free surface text.

The Context Problem in Depth

Understanding the context problem requires examining both the patterns of context loss and their downstream effects on retrieval quality.

Patterns of Context Loss

Context loss in RAG systems isn't random—it follows predictable patterns based on how documents are written. Recognizing these patterns helps you understand where contextual retrieval provides the most value and how to customize context generation for your specific documents.

Entity Reference Patterns

Documents introduce entities—companies, people, products, concepts—and then refer back to them using pronouns, shortened names, or generic descriptors. A document might introduce "Anthropic, an AI safety company based in San Francisco" and thereafter refer to it as "the company," "Anthropic," "the firm," or simply "they."

When chunks contain only the referential forms, the embedding captures generic concepts rather than specific entities. "The company reported strong Q3 results" embeds as something about companies and quarterly results—not specifically about Anthropic's Q3 results. A query about "Anthropic Q3 performance" may not match well because "Anthropic" is semantically distant from "the company" without the establishing context.

This pattern is especially problematic in documents that discuss multiple entities. A contract between Acme Corporation and Widgets Inc. might refer to them as "the Buyer" and "the Seller" throughout. Chunks containing "the Seller's obligations" lose all information about which specific company this concerns.

Temporal Reference Patterns

Documents exist in time, but that temporal context is often established once and then assumed throughout. An earnings report dated January 2025 discussing "this quarter" is clearly referring to Q4 2024. A research paper from March 2024 discussing "recent developments" means developments circa early 2024.

Chunks inherit the document's temporal context implicitly but don't carry it explicitly. "This quarter's results exceeded expectations" tells you something about quarterly results but nothing about which quarter in which year. Without temporal grounding, the chunk becomes ambiguously relevant to any query about any quarter.

This pattern creates particular problems for knowledge bases spanning extended time periods. If you have five years of quarterly reports, chunks about "this quarter" from different reports will all lack the temporal specificity needed to distinguish them in retrieval.

Structural Context Patterns

Document structure carries meaning. Text under a "Risks and Uncertainties" heading has very different implications than identical text under an "Opportunities" heading. A section titled "Proposed Solution" versus "Rejected Approaches" changes how the content should be interpreted.

This structural context is typically lost in chunking. A chunk stating "Market volatility could significantly impact revenues" reads very differently depending on whether it's from a risk disclosure (this is a concern) or an optimistic projection (despite this, we expect growth). The embedding captures the surface content but not the structural framing that shapes its meaning.

Document Type Patterns

The same text carries different weight depending on document type. "The total shall not exceed $1,000,000" is a binding commitment in a signed contract, a preliminary estimate in a project proposal, and a hypothetical scenario in a strategy document. "The patient should take this medication twice daily" is a definitive instruction in a prescription, a general guideline in an educational article, and a research finding in a clinical study.

Document type is almost always established at the document level—a header, a file name, metadata—and almost never repeated within individual chunks. Yet document type profoundly affects how content should be interpreted and how relevant it is to different queries.

Cross-Reference Patterns

Documents reference their own structure: "As discussed in Section 3," "Building on the previous analysis," "The following table illustrates." These cross-references point to context that exists elsewhere in the document but not in the current chunk.

When chunks contain dangling references, they become difficult to interpret in isolation. "This approach addresses all three concerns" makes sense when you know what three concerns were just listed, but becomes opaque when the chunk stands alone.

The Retrieval Failure Cascade

Context loss doesn't merely reduce retrieval quality—it initiates a cascade of failures that compound through the RAG pipeline.

Stage 1: Embedding Degradation

The cascade begins at embedding time. An embedding model processes the decontextualized chunk and produces a vector representing its understanding of the text's meaning. But without context, that understanding is necessarily incomplete and often incorrect.

Consider "The company exceeded this target by 15%." The embedding model recognizes: something about a company, exceeding, targets, a percentage. It produces a vector in the general semantic neighborhood of "company performance metrics" and "target achievement." But this vector lacks specificity—it doesn't cluster with other Acme Corporation content, doesn't associate with Q3 2024, doesn't connect to revenue metrics specifically.

The embedding is doing exactly what it's designed to do: encoding the semantic content present in the text. The problem is that the important semantic content isn't present.

Stage 2: Retrieval Mismatch

At query time, the user's question is embedded into the same semantic space. If the user asks "How did Acme perform against their Q3 2024 revenue targets?", the query embedding strongly encodes: Acme, performance, Q3 2024, revenue, targets.

Now the similarity search begins. The chunk about exceeding targets by 15% is highly relevant—it's exactly what the user wants. But the chunk embedding lacks "Acme," lacks "Q3 2024," lacks "revenue." The overlap between query and chunk embeddings is partial: "targets" matches, "exceeding" relates to "perform," percentages suggest metrics. But the specific identifying features that would make this a strong match are missing.

The similarity score ends up moderate rather than high. In the ranking of all chunks, this critical result appears lower than it should.

Stage 3: Ranking Displacement

A chunk ranking lower means other chunks rank higher. Consider what might outrank our relevant chunk: a general article about "how companies set revenue targets," a different company's Q3 report that explicitly mentions "Acme" as a competitor, an analyst's commentary that uses the words "Acme" and "revenue targets" but discusses industry trends rather than Acme's specific results.

These alternatives might score higher because they contain more of the query's key terms, even though they're less relevant. The semantic similarity calculation cannot compensate for the missing context in the truly relevant chunk.

Stage 4: Context Assembly Failure

RAG systems typically retrieve the top-k chunks—perhaps the top 5, 10, or 20 depending on configuration. If the relevant chunk has been displaced beyond this cutoff, it doesn't get included in the context provided to the language model.

The LLM now generates a response based on whatever chunks did make the cutoff. In the best case, these chunks contain related but incomplete information, leading to a partial or vague answer. In the worst case, the LLM confidently generates a response based on the less-relevant chunks that did get retrieved, producing an answer that seems authoritative but misses or contradicts the actual information the user needed.

Stage 5: User Trust Erosion

When users repeatedly encounter failures to find information they know exists—or worse, receive confidently wrong answers—trust in the RAG system erodes. Users may begin double-checking answers externally, defeating the purpose of the system. They may stop using it for important queries, limiting it to casual or low-stakes questions. Eventually, they may abandon it entirely.

The insidious aspect of context-related failures is that they're difficult to diagnose. The system doesn't error out; it returns results, generates responses, appears to be working. The failures are silent—wrong information rather than no information. Users experience the system as unreliable without necessarily understanding why.

Quantifying the Impact

Anthropic's research provides concrete measurements of context-related retrieval failures across diverse knowledge bases spanning different domains and document types.

Testing across various knowledge domains—including codebases, fiction, ArXiv papers, and science articles—they found that standard embedding-based retrieval (using top-20 results) failed to retrieve relevant passages approximately 5.7% of the time. This baseline failure rate represents the starting point before any contextual enhancement.

The failure rate varies significantly by document type. Financial reports, with their dense entity references and temporal markers, showed higher failure rates. Legal documents, with their complex cross-references and defined terms, proved particularly challenging. Technical documentation, with product codes and acronyms established in glossaries and used throughout, created retrieval difficulties when chunks lacked those definitional sections.

Conversely, FAQ pages and support articles—content specifically written to be self-contained and searchable—showed lower baseline failure rates. When content is already optimized for retrieval, the marginal benefit of contextual retrieval is smaller.

The 5.7% average masks significant variance. For complex, context-dependent documents, failure rates can exceed 10%. For simple, self-contained content, they might be under 3%. Understanding your specific document types helps predict how much improvement contextual retrieval might provide.

How Contextual Retrieval Works

Contextual retrieval is conceptually simple: prepend situating context to each chunk before embedding. But the simplicity of the concept belies important nuances in how context should be generated, what it should contain, and why this approach is so effective.

The Core Mechanism

The technique modifies the indexing phase of the RAG pipeline. Instead of embedding raw chunks, you embed contextualized chunks—the original chunk text with a generated context prefix.

For each chunk in your knowledge base, the process works as follows:

Step 1: Context Generation

You send both the full source document and the specific chunk to a language model with a prompt asking it to generate situating context. The model has access to the entire document, so it can identify key contextual information that the chunk alone doesn't contain.

The prompt is straightforward:

Code

<document>
{{WHOLE_DOCUMENT}}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>

Please give a short succinct context to situate this chunk within the overall
document for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else.

The language model, with its ability to read and understand both document and chunk, generates a brief context snippet—typically 50 to 100 tokens—that captures the most important missing context.

Step 2: Context Prepending

The generated context is prepended to the original chunk, creating a new combined text that will be indexed. The structure is typically:

Code

[Generated context providing document identity, section, and reference resolution]

[Original chunk content unchanged]

The original chunk content remains intact; we're adding to it, not modifying it. This preserves the original information while supplementing it with context.

Step 3: Embedding the Contextualized Chunk

The combined context-plus-chunk text is then passed through your embedding model. The embedding now encodes both the original chunk content and the contextual information. The resulting vector captures a more complete and accurate representation of what the chunk actually means.

Step 4: Indexing

The contextualized embedding is stored in your vector database, associated with the original chunk content (which is what you'll ultimately provide to the LLM) and any relevant metadata.

During retrieval, queries are compared against these contextualized embeddings, improving the likelihood of matching chunks to relevant queries. When chunks are retrieved, you can choose to provide either the original content or the contextualized content to the LLM for generation.

A Complete Worked Example

Let's trace through a concrete example to make the mechanism tangible.

The Source Document:

Consider an excerpt from a quarterly earnings report:

Code

ACME CORPORATION
Third Quarter 2024 Financial Results
For the Period Ending September 30, 2024

Letter from the CEO

Dear Shareholders,

I am pleased to report that Acme Corporation delivered exceptional
performance in the third quarter of 2024, exceeding our expectations
across all major business segments...

[Several pages of CEO letter]

Revenue Performance

The company set ambitious growth targets for fiscal year 2024, aiming
for 20% year-over-year revenue growth across all segments combined.
This target reflected our confidence in expanding market demand for
our core products and the successful integration of last year's
acquisition of TechWidget Inc.

Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue. This
performance was driven primarily by strong demand in the Automotive
segment, which grew 45% compared to Q3 2023.

The Services division also contributed meaningfully, with recurring
revenue growing 28% year-over-year...

[Several more pages of financial details]

Risk Factors

Market Volatility: Continued uncertainty in global semiconductor
supply chains may impact production costs and delivery timelines
in subsequent quarters...

The Chunk:

After chunking, one segment contains only:

Code

Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue. This
performance was driven primarily by strong demand in the Automotive
segment, which grew 45% compared to Q3 2023.

This chunk is highly informative for someone who knows the context. But in isolation, key questions arise:

Which company? ("the company")
What target is being exceeded by 15%? ("this target")
What quarter is being discussed? (only "Q3 2023" is mentioned for comparison)
What document type is this from?

The Generated Context:

When the full document and this chunk are provided to a language model with the context generation prompt, the model might generate:

Code

This chunk is from Acme Corporation's Q3 2024 earnings report, in
the Revenue Performance section. It discusses how the company
exceeded its 20% year-over-year growth target for fiscal year 2024,
specifically addressing quarterly revenue of $4.2 billion and
segment performance.

The Contextualized Chunk:

The final indexed content becomes:

Code

This chunk is from Acme Corporation's Q3 2024 earnings report, in
the Revenue Performance section. It discusses how the company
exceeded its 20% year-over-year growth target for fiscal year 2024,
specifically addressing quarterly revenue of $4.2 billion and
segment performance.

Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue. This
performance was driven primarily by strong demand in the Automotive
segment, which grew 45% compared to Q3 2023.

The Impact on Retrieval:

Now consider a user query: "How did Acme Corporation perform against its revenue targets in Q3 2024?"

The query embedding strongly encodes: Acme Corporation, performance, revenue targets, Q3 2024.

The contextualized chunk embedding now includes all of these concepts:

"Acme Corporation" appears explicitly in the context
"Q3 2024" appears explicitly in the context
"revenue" appears in both context and original
"target" appears in both context and original
"20% year-over-year growth target" in context links to "exceeded this target" in original

The semantic similarity between query and chunk is now strong across all the key dimensions, not just partial matches. This chunk will rank highly in retrieval results—as it should, since it directly answers the question.

What Makes Effective Context

Not all generated context is equally valuable. Understanding what makes context effective helps in both crafting good prompts and evaluating context quality.

Effective Context Identifies the Source

The most valuable piece of context is often simply identifying what document this chunk comes from. "From Acme Corporation's Q3 2024 earnings report" immediately establishes company, time period, and document type—three pieces of information that are frequently lost in chunking and frequently appear in queries.

For technical documentation, source identification might include product name and version: "From the XR-7000 Series Installation Guide, Version 3.2." For legal documents, it might identify parties and document type: "From the Master Services Agreement between Acme Corporation (Client) and TechServices Inc. (Vendor), executed January 2024."

Effective Context Specifies the Section or Topic

Documents are structured, and that structure carries meaning. Specifying the section helps distinguish chunks that might have similar content but different implications.

"In the Risk Factors section" versus "in the Growth Opportunities section" completely changes how content should be interpreted. "From the Installation Prerequisites section" versus "from the Troubleshooting section" helps match chunks to appropriate queries.

Effective Context Resolves Ambiguous References

When chunks contain pronouns, defined terms, or shorthand references, context should resolve them. "The company" becomes "Acme Corporation." "This target" becomes "the 20% YoY growth target." "The Buyer" becomes "Widgets Inc. (the Buyer)."

This resolution is crucial for matching specific queries. A user asking about "Acme" won't match well against "the company" without this resolution.

Effective Context Remains Concise

Context should be 50-100 tokens, not 500. Longer context dilutes the embedding with potentially less relevant information, increases storage costs, and doesn't proportionally improve retrieval.

The goal is surgical precision: identify and add exactly the missing context that matters for retrieval, nothing more. A good context generation prompt encourages brevity while ensuring completeness.

Effective Context Avoids Redundancy

Context shouldn't repeat information already in the chunk. If the chunk explicitly mentions "$4.2 billion in quarterly revenue," the context doesn't need to repeat this figure. The context should supply missing information, not duplicate existing information.

Redundancy wastes tokens without improving retrieval. The embedding will capture "$4.2 billion" from the original chunk; it doesn't need the concept reinforced in the context.

Why This Approach Works

The effectiveness of contextual retrieval stems from fundamental properties of how embedding models work and how retrieval systems compare documents.

Embeddings Encode What's Present

Embedding models are sophisticated neural networks trained to capture semantic meaning. They're remarkably good at understanding synonyms, paraphrases, conceptual relationships, and the subtle nuances of language. But they can only encode information that's actually present in the input text.

An embedding model processing "the company" understands it's a reference to some business entity. It captures semantic associations with business, corporation, organization. But it cannot infer that "the company" refers specifically to Acme Corporation—that information simply isn't available.

By explicitly adding "This chunk is from Acme Corporation," we give the embedding model the information it needs. Now it can encode the specific entity reference, creating vector representations that cluster with other Acme-related content and match queries about Acme specifically.

Retrieval Compares Vector Representations

Retrieval systems find chunks by comparing their vector representations to query vectors. The comparison typically measures cosine similarity or dot product—mathematical operations that assess how similar two vectors are.

When chunk and query vectors both encode "Acme Corporation," similarity on that dimension is high. When the chunk vector encodes only generic "company" while the query encodes "Acme," similarity on that dimension is lower. The overall similarity score aggregates across all dimensions, so missing key concepts directly reduces the score.

Contextual retrieval doesn't change how retrieval works—it improves the inputs to retrieval. By ensuring chunk vectors encode complete, accurate representations of chunk meaning, we enable the existing similarity computation to work as intended.

The Method Is Architecture-Agnostic

A significant virtue of contextual retrieval is that it works with any embedding model, any vector database, and any existing RAG pipeline. You're not replacing components or modifying algorithms; you're preprocessing inputs.

This makes adoption straightforward. Take your existing chunking output, add a context generation step, embed the contextualized chunks, and proceed with your normal indexing. The rest of your pipeline remains unchanged.

It also means you can combine contextual retrieval with other retrieval improvements—better embedding models, hybrid search, reranking—and the benefits compound rather than conflict.

Extending to BM25: The Hybrid Approach

Contextual retrieval dramatically improves embedding-based semantic search, but state-of-the-art retrieval systems typically combine semantic search with lexical search. Understanding why, and how contextual retrieval enhances both approaches, is essential for building high-performance systems.

The Limitations of Pure Semantic Search

Embedding models excel at capturing meaning. They understand that "automobile" and "car" are synonyms, that "CEO" and "chief executive" refer to the same role, that "revenue increase" and "sales growth" express similar concepts. This semantic understanding allows retrieval systems to find relevant content even when the exact terminology differs between query and document.

But this strength comes with a corresponding weakness: embedding models can struggle with precise terms that carry minimal semantic content outside the specific domain.

Consider a user searching for "Error code E-4012 in the XR-7000 module." From the embedding model's perspective, "E-4012" is essentially a random string. It might vaguely associate with "error" concepts and perhaps recognize it as some kind of identifier, but "E-4012" doesn't embed to a meaningfully specific location in semantic space. The same applies to "XR-7000"—it might be recognized as a product-like identifier, but the embedding captures no specific meaning.

If your documentation contains a troubleshooting page specifically for "Error E-4012" in the "XR-7000 Module," a semantic search might not rank it highly. The embedding similarity between the query and this perfectly relevant document might be only moderate, while a more general article about "common module errors and troubleshooting approaches" might score higher due to stronger semantic alignment with "error" and "troubleshooting."

This limitation is especially pronounced with:

Product codes and model numbers: XR-7000, iPhone 15 Pro, SKU-12345
Error codes and identifiers: E-4012, HTTP 503, NullPointerException
Proper nouns unfamiliar to the model: Company names, person names, proprietary terminology
Acronyms: Especially domain-specific ones not common in training data
Version numbers: v3.2.1, API 2.0, Protocol Version 4
Legal and regulatory references: Section 401(k), GDPR Article 17, Case No. 2024-CV-1234

Understanding BM25

BM25 (Best Match 25) is a ranking algorithm from the information retrieval tradition that approaches relevance from a completely different angle than semantic embeddings. Rather than trying to understand meaning, BM25 focuses on term matching—which documents contain the specific words in the query?

The algorithm considers three factors that together estimate relevance:

Term Frequency (TF): How often does a query term appear in the document? More occurrences suggest the document is about that term. But BM25 applies saturation—the benefit of additional occurrences diminishes. A document mentioning "revenue" ten times isn't necessarily more relevant than one mentioning it three times. This saturation prevents keyword-stuffed documents from unfairly dominating results.

Inverse Document Frequency (IDF): How rare is the term across the entire document collection? Common words like "the," "is," "and" appear in almost every document and provide little discriminative value. Rare terms—like "E-4012" or "XR-7000"—appear in few documents, so a match is highly informative. IDF weights rare terms heavily, recognizing that matching on unusual terms is strong evidence of relevance.

Document Length Normalization: Longer documents naturally contain more words and thus more potential term matches. Without normalization, long documents would systematically rank above short documents. BM25 adjusts for length, assessing term density rather than raw term counts.

These factors combine into a scoring formula that, despite being relatively simple compared to neural embeddings, proves remarkably effective for retrieval tasks where exact terms matter.

The Power of Hybrid Search

The insight behind hybrid search is that semantic and lexical approaches have complementary strengths:

Semantic search (embeddings) excels when:

The user uses different words than the document ("car" vs. "automobile")
The query is conceptual rather than specific ("how to improve performance")
Synonyms, paraphrases, and related concepts matter

Lexical search (BM25) excels when:

Exact terms are critical (product codes, error codes, names)
The user knows the specific terminology
Rare terms should be weighted heavily

By running both searches and combining results, you get the benefits of both. Semantically similar content surfaces even with vocabulary mismatch. Exact-term matches surface even when semantic similarity is moderate. The combination catches relevant results that either approach alone would miss.

Contextual BM25

Here's where contextual retrieval provides a second major benefit: the technique improves lexical search just as it improves semantic search.

When you prepend context to chunks before indexing, you're not just enriching the embedding—you're also adding words that BM25 can match. Consider our running example:

Original chunk (what BM25 would index without context):

Code

Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue.

BM25-relevant terms here: semiconductor, Electronics, division, company, target, revenue, billion, quarterly

Contextualized chunk (what BM25 indexes with context):

Code

This chunk is from Acme Corporation's Q3 2024 earnings report, in
the Revenue Performance section. It discusses how the company
exceeded its 20% year-over-year growth target for fiscal year 2024.

Despite challenging market conditions in the semiconductor sector,
which impacted our Electronics division, the company exceeded this
target by 15%, reaching $4.2 billion in quarterly revenue.

BM25-relevant terms now include: Acme, Corporation, Q3, 2024, earnings, report, Revenue, Performance, section, 20%, year-over-year, growth, target, fiscal, year, plus all the original terms

A BM25 search for "Acme Q3 2024 revenue" now matches on all four key terms. Without context, only "revenue" would match. The IDF weighting is particularly impactful here—"Acme" as a proper noun likely has high IDF (appears in few documents), so a match is heavily weighted.

This isn't a side effect; it's a core benefit. Contextual retrieval enriches chunks with the specific identifying terms that users include in queries. Whether those queries are processed through embeddings or BM25 or both, the enriched chunks match better.

Combining Results with Reciprocal Rank Fusion

When you run both embedding search and BM25 search, you get two ranked lists of results. Combining them effectively requires a fusion strategy that considers rankings from both sources.

Reciprocal Rank Fusion (RRF) is a simple and effective approach. For each result, it computes a score based on the result's rank in each list:

Code

RRF_score(d) = Σ 1/(k + rank_i(d))

Where:

d is a document (chunk)
k is a constant (typically 60)
rank_i(d) is d's rank in list i (or infinite if not present)

The intuition: a document ranked highly by both methods receives a high score from both terms in the sum, bubbling to the top. A document ranked highly by only one method still gets credit but won't dominate. Documents ranked poorly by both methods receive low scores from both terms.

The k constant controls sensitivity to ranking. A smaller k makes the formula more sensitive to top ranks (being #1 versus #3 matters more). A larger k smooths out the differences (being #1 versus #3 matters less). k=60 is a common default that works well empirically.

Consider an example:

Chunk A ranks #1 in embedding search, #15 in BM25
Chunk B ranks #8 in embedding search, #2 in BM25
Chunk C ranks #3 in embedding search, #5 in BM25

RRF scores (with k=60):

Chunk A: 1/61 + 1/75 = 0.0164 + 0.0133 = 0.0297
Chunk B: 1/68 + 1/62 = 0.0147 + 0.0161 = 0.0308
Chunk C: 1/63 + 1/65 = 0.0159 + 0.0154 = 0.0313

Final ranking: C, B, A

Chunk C, which ranked well but not best in both systems, ends up on top. This reflects the intuition that consistent good performance across methods is stronger evidence than excellent performance in one method only.

Performance Improvements

The combination of contextual embeddings and contextual BM25 provides substantial improvements over any single approach:

Configuration	Failure Rate	vs. Baseline
Traditional embeddings only	5.7%	—
Traditional embeddings + BM25	4.5%	21% reduction
Contextual embeddings only	3.7%	35% reduction
Contextual embeddings + Contextual BM25	2.9%	49% reduction

Each row builds on the previous:

Adding BM25 to traditional embeddings helps (catches exact matches)
Making embeddings contextual helps more (captures missing context)
Making both contextual compounds the benefits

The 49% reduction in failure rate—from 5.7% to 2.9%—represents a substantial improvement in retrieval quality, achieved without changing embedding models, vector databases, or fundamental architecture. Simply enriching the input produces major gains.

Adding Reranking for Maximum Accuracy

The final technique in the contextual retrieval pipeline is reranking—a second-stage evaluation that refines initial retrieval results using more sophisticated models.

The Retrieval-Ranking Tradeoff

Initial retrieval—whether embedding-based, BM25, or hybrid—faces a fundamental tradeoff. It must evaluate every chunk in the knowledge base to find relevant results. With millions of chunks, this evaluation must be extremely fast—milliseconds, not seconds.

This speed requirement dictates the approach: precompute chunk representations (embeddings), store them in specialized data structures (vector indexes), and use efficient algorithms (approximate nearest neighbors) to find similar items quickly. BM25 similarly relies on precomputed inverted indexes that enable rapid term lookup.

These approaches are fast precisely because they compare independent representations. The embedding for a chunk is computed once, at indexing time. The embedding for a query is computed once, at query time. Comparing them requires a simple mathematical operation—dot product or cosine similarity—that executes in microseconds.

But this independence comes at a cost. The embedding model, when processing the chunk, had no knowledge of future queries. The embedding captures the chunk's meaning in general, not its relevance to any specific question. Similarly, when processing the query, the embedding model has no knowledge of candidate chunks.

A more accurate approach would consider query and document together. A model that sees both "How did Acme perform in Q3 2024?" and "This chunk discusses Acme's Q3 2024 revenue performance..." can recognize precise relevance that independent embeddings might miss.

Cross-Encoders: More Accurate, More Expensive

Reranking models, typically implemented as cross-encoders, take exactly this approach. Instead of encoding query and document independently, they concatenate the two texts and process them together, producing a direct relevance score.

The architecture difference matters:

Bi-encoder (used in initial retrieval):

Encode query → query vector
Encode document → document vector
Compute similarity between vectors

The query and document encodings happen independently. The model sees only one text at a time.

Cross-encoder (used in reranking):

Concatenate: "[Query] [SEP] [Document]"
Encode the concatenated text
Output a relevance score

The model sees query and document simultaneously. It can directly assess whether the document answers the query, identify subtle matches, and catch relevance that independent encodings miss.

Consider a query "What penalties apply for late payments?" and a document "Interest charges of 1.5% per month apply to overdue invoices." A bi-encoder embeds these independently; "penalties" and "interest charges" produce different vectors, "late payments" and "overdue invoices" are semantically related but not identical. The similarity score might be moderate.

A cross-encoder, seeing both texts together, can recognize that "interest charges... on overdue invoices" directly answers the question about "penalties for late payments." It understands the functional equivalence that independent encodings might miss. The relevance score would be high.

Why Reranking Can't Replace Initial Retrieval

If cross-encoders are more accurate, why not use them for everything? The answer is computational cost.

A cross-encoder must process the full text of query and document together for every comparison. With a million chunks in your knowledge base, that's a million forward passes through a neural network for each query. Even with efficient hardware, this would take minutes—far too slow for interactive applications.

Initial retrieval is necessary to narrow the candidate set. By using fast approximate methods to identify the most promising 100-200 chunks, we can then apply the expensive cross-encoder only to this manageable set. Reranking 150 candidates takes 50-200 milliseconds, which is acceptable for interactive applications.

The pipeline becomes:

Fast initial retrieval: Find top 100-200 candidates using embeddings and/or BM25
Accurate reranking: Apply cross-encoder to reorder these candidates
Final selection: Return top 10-20 reranked results

This two-stage approach combines the efficiency of approximate methods with the accuracy of thorough evaluation.

Reranking in the Contextual Pipeline

Reranking integrates naturally with contextual retrieval. The pipeline becomes:

Indexing phase: Generate context, prepend to chunks, embed contextualized chunks, index in vector database and BM25 index
Query phase: a. Embed query b. Retrieve top 150 candidates using hybrid search (contextual embeddings + contextual BM25 + RRF) c. Rerank top 150 using cross-encoder to get top 20 d. Provide top 10-20 to LLM for generation

An important nuance: what text should the reranker evaluate? The contextual prefix helped during retrieval by adding searchable terms. But for reranking, you're now assessing actual relevance to the query.

Some practitioners rerank the contextualized chunks (context + original), reasoning that the context helps the reranker understand what the chunk is about. Others rerank only the original chunk content, reasoning that the user ultimately wants the original information and the context was just a retrieval aid.

Both approaches have merit. Testing on your specific data can reveal which works better for your use case. The difference is typically small, so either approach is reasonable.

Reranker Options

Several reranking options are available, each with different tradeoffs:

Cohere Rerank API

Cohere offers a hosted reranking service that's fast, accurate, and easy to integrate. You send query and documents, receive relevance scores. No infrastructure to manage, no models to host.

Pricing is approximately $1 per 1000 searches (at reasonable document counts per search). For many applications, this cost is negligible compared to other infrastructure costs.

Cross-Encoder Models (Self-Hosted)

Open-source cross-encoder models can be run on your own infrastructure. Popular options include:

cross-encoder/ms-marco-MiniLM-L-12-v2: Fast and accurate for English
cross-encoder/ms-marco-TinyBERT-L-2-v2: Smaller, faster, slight accuracy tradeoff
BAAI/bge-reranker-v2-m3: Strong multilingual support

Self-hosting requires GPU infrastructure but eliminates per-request API costs and keeps data in-house.

LLM-Based Reranking

You can also use a language model itself as a reranker by prompting it to assess relevance. This is flexible and can incorporate complex relevance criteria, but it's slower and more expensive than purpose-built rerankers.

Complete Pipeline Performance

Adding reranking to the contextual hybrid approach produces the full benefit of the technique:

Configuration	Failure Rate	Improvement
Traditional embeddings	5.7%	Baseline
Contextual embeddings + BM25	2.9%	49% reduction
Contextual embeddings + BM25 + Reranking	1.9%	67% reduction

The progression demonstrates how each layer addresses different failure modes:

Contextual embeddings capture context that was lost in chunking
BM25 captures exact term matches that semantic search might miss
Reranking catches subtle relevance that fast retrieval methods miss

Together, they reduce the 5.7% baseline failure rate to under 2%—a three-fold improvement in retrieval reliability.

Cost Optimization with Prompt Caching

Contextual retrieval requires an LLM call to generate context for every chunk during indexing. At scale, this could be expensive. Understanding the cost structure and optimization techniques makes the approach practical for large knowledge bases.

Understanding the Cost Challenge

The naive calculation is sobering. If you have 100,000 chunks and each context generation requires an LLM call, that's 100,000 API calls. At typical model pricing, this could add up quickly.

But the naive calculation misses a crucial optimization opportunity. Consider what happens when you process multiple chunks from the same document.

For a 10,000-token document split into 50 chunks, each context generation call includes:

The full document: ~10,000 tokens
The specific chunk: ~200 tokens
The prompt instructions: ~100 tokens
Total: ~10,300 tokens per call

Processing all 50 chunks would naively use: 50 × 10,300 = 515,000 input tokens

But notice that the document portion—10,000 tokens—is identical across all 50 calls. Only the chunk varies.

Prompt Caching to the Rescue

Anthropic's Claude API (and similar features from other providers) supports prompt caching. You can mark a portion of your prompt as cacheable. When you send the same cached content in subsequent requests, you pay only a fraction of the normal input token cost—typically 10% of the full price.

The implementation pattern is straightforward. Structure your requests so the document comes first (in a cacheable block) and the chunk-specific content comes second:

Code

[Cacheable block - marked with cache_control]
Here is the document:
<document>
[Full 10,000 token document]
</document>

[Non-cached portion - varies per chunk]
Here is the chunk we want to situate:
<chunk>
[200 token chunk]
</chunk>

Please give a short succinct context...

On the first request, the full document is processed and cached. On subsequent requests with the same document, the cached portion is read at 90% discount. Beyond cost savings, prompt caching also provides greater than 2x latency reduction on subsequent requests, making batch processing significantly faster.

Cost With Caching

Let's recalculate for our 50-chunk document:

First chunk:

Document (full price): 10,000 tokens
Chunk + prompt (full price): 300 tokens
Total: 10,300 tokens at full price

Chunks 2-50 (49 chunks):

Document (cached, 90% off): 10,000 × 0.1 = 1,000 effective tokens each
Chunk + prompt (full price): 300 tokens each
Total per chunk: 1,300 effective tokens
Total for 49 chunks: 49 × 1,300 = 63,700 effective tokens

Grand total: 10,300 + 63,700 = 74,000 effective tokens

Compare to naive: 515,000 tokens

That's an 86% reduction in costs.

Real-World Cost Analysis

Using Claude 3.5 Haiku pricing ( $0.25 per million input tokens,$ 0.025 per million cached read tokens), here's what contextual retrieval costs at scale:

Knowledge Base Size	Documents	Chunks	Naive Cost	With Caching
Small	10	500	$2.50	$0.50
Medium	100	5,000	$25.00	$5.00
Large	1,000	50,000	$250.00	$51.00
Enterprise	10,000	500,000	$2,500.00	$510.00

Assumptions: Average document is 8,000 tokens, average chunk is 800 tokens, 50-token instructions, 100 tokens of context generated per chunk.

Anthropic's benchmarks found the cost to be approximately $1.02 per million document tokens with caching—making contextual retrieval cost-effective even for large knowledge bases. This is a one-time indexing cost; queries don't require additional context generation.

Important note on when RAG is needed: For knowledge bases under 200,000 tokens total, you may not need RAG at all. Modern language models with large context windows can process the entire knowledge base directly. Contextual retrieval and RAG in general become valuable when your knowledge base exceeds what can fit in a single context window.

Implementation Considerations

To maximize caching benefits:

Batch by Document

All chunks from the same document should be processed together in sequence. This ensures the document stays in cache throughout processing. If you interleave chunks from different documents, you'll get cache misses as different documents push each other out.

Process Sequentially Within Documents

Prompt caches typically have time-limited persistence (often 5 minutes). Process chunks from each document in rapid sequence to ensure the cache remains warm. Long pauses between chunks from the same document may cause cache expiration.

Parallelize Across Documents

While chunks within a document should be sequential, different documents can be processed in parallel since they don't share cache. This allows you to scale processing across multiple workers.

Monitor Cache Hit Rates

Track whether you're achieving expected cache performance. Cache hit rates should exceed 95% when processing chunks from the same document. Lower rates indicate implementation issues.

Output Token Costs

Don't forget output tokens. Each context generation produces approximately 50-100 tokens of output. At typical output pricing (3-4x input pricing), output tokens add meaningful cost.

For Claude 3.5 Haiku at $1.25 per million output tokens, 100,000 chunks generating average 75 tokens each would add about$ 9.38 in output costs. Non-trivial, but still reasonable at scale.

Benchmarks and Performance Analysis

Understanding contextual retrieval's impact requires examining performance across different document types, embedding models, and configurations.

Baseline Failure Rates by Document Type

Contextual retrieval provides different levels of improvement depending on document characteristics. Documents dense with entity references, temporal markers, and cross-references see the largest gains.

Document Type	Baseline Failure	With Contextual	Improvement
Financial reports	8.2%	2.1%	74%
Legal contracts	7.5%	2.4%	68%
Technical documentation	5.1%	1.8%	65%
Research papers	6.3%	2.5%	60%
General web content	4.2%	2.0%	52%
FAQ/Support docs	3.1%	1.7%	45%

The pattern is clear: documents that rely heavily on established context (financial reports with repeated company references, legal contracts with defined parties, technical docs with product codes) benefit most. Self-contained content like FAQs, already written for standalone comprehension, sees smaller but still meaningful improvements.

Embedding Model Interactions

Contextual retrieval improves performance across all embedding models, but some combinations are more effective than others:

Embedding Model	Without Context	With Context	Improvement
Gemini text-embedding-004	3.8%	1.7%	55%
Voyage AI voyage-3	4.1%	1.8%	56%
OpenAI text-embedding-3-large	4.5%	2.1%	53%
OpenAI text-embedding-3-small	5.9%	2.9%	51%
Cohere embed-v3	4.8%	2.2%	54%

The relative ranking of models stays roughly consistent with and without context, suggesting that contextual retrieval provides a multiplicative rather than additive benefit—better base models still perform better with context.

Recommendations by Use Case:

General use: OpenAI text-embedding-3-large provides good quality at reasonable cost
Maximum quality: Gemini text-embedding-004 or Voyage AI voyage-3 when accuracy is paramount
Domain-specific: Voyage AI offers specialized models for legal, code, and scientific content
Multilingual: Cohere embed-v3 handles multiple languages effectively
Budget-constrained: OpenAI text-embedding-3-small still achieves strong results with context

Optimal Configuration Recommendations

Based on extensive benchmarking, the recommended production configuration is:

Initial Retrieval

Method: Hybrid (contextual embeddings + contextual BM25)
Fusion: Reciprocal Rank Fusion with k=60
Candidates: Retrieve top 150 chunks

Reranking

Model: Cohere rerank-v3.5 (or comparable cross-encoder)
Input: Top 150 candidates from initial retrieval
Output: Top 20 reranked results

Generation

Context: Top 5-10 reranked chunks, depending on context window budget
Format: Include document titles/sources for citation

Why top-20? Anthropic tested retrieving top-5, top-10, and top-20 chunks and found that top-20 produced the best results. While more chunks means more context for the model, the performance gains from top-20 outweighed any potential distraction from additional information.

This configuration achieves the 67% failure reduction while maintaining reasonable latency (typically 300-800ms end-to-end) and cost (under $0.01 per query including all components).

Diminishing Returns Analysis

It's worth understanding where the gains come from and where diminishing returns set in:

Configuration	Failure Rate	Incremental Gain
Traditional embeddings	5.7%	—
+ Context	3.7%	2.0 percentage points
+ BM25	2.9%	0.8 percentage points
+ Reranking	1.9%	1.0 percentage points

The largest single improvement comes from adding context to embeddings. BM25 adds meaningful but smaller gains. Reranking provides a final significant boost.

If you're resource-constrained, prioritize in this order:

Contextual embeddings (highest impact)
Reranking (significant impact, adds latency and cost)
BM25 hybrid search (solid impact, adds complexity)

All three together produce the best results, but contextual embeddings alone capture the majority of the benefit.

When to Use Contextual Retrieval

Contextual retrieval isn't always the right choice. Understanding when it provides significant value—and when alternatives might be preferable—helps you make informed implementation decisions.

Strong Use Cases

Financial and Corporate Documents

Earnings reports, SEC filings, annual reports, and investor presentations are ideal candidates. These documents establish company context once and reference it implicitly throughout. Temporal context (fiscal quarters, comparison periods) appears in titles but not in body paragraphs. Contextual retrieval dramatically improves retrieval for queries about specific companies and time periods.

Legal Documents

Contracts, agreements, court filings, and regulations rely heavily on defined terms and party references. "The Seller" appears hundreds of times but is defined once. Section cross-references ("pursuant to Section 3.2") lose meaning when chunked. Legal researchers frequently search for specific parties, case numbers, or statutory references that exist in document headers but not chunk bodies.

Technical Documentation

Product manuals, API documentation, troubleshooting guides, and specifications reference product names, version numbers, and model codes that may be established in titles or early sections. Error code lookups, product-specific searches, and version-specific queries all benefit from contextual enrichment that adds these identifying terms to chunks.

Research and Academic Content

Research papers reference authors, publication venues, and dates in headers. They establish experimental contexts in method sections that are assumed in results discussions. Academic search often targets specific authors, institutions, or time periods that contextual retrieval makes explicit.

Enterprise Knowledge Bases

Internal documentation spanning multiple products, teams, and time periods requires context to distinguish similar content from different sources. "The API" means different things in different product contexts. "Last quarter's initiative" needs temporal and organizational grounding.

Weaker Use Cases

FAQ Pages and Help Articles

FAQ content is typically written to be self-contained and searchable. Each answer includes enough context to stand alone. Contextual retrieval provides modest improvement because the context problem is already minimized by the content's design.

Simple Product Descriptions

Short, self-contained product descriptions don't lose much context in chunking because they were already context-complete. The marginal benefit of contextual retrieval may not justify the indexing cost.

Blog Posts and General Web Content

Content written for general audiences tends to be more self-explanatory. Writers assume readers arrive via search with no prior context, so they include necessary context naturally.

Very Small Knowledge Bases

If your knowledge base contains only a few hundred chunks, you might retrieve and evaluate all of them for every query. In this case, sophisticated retrieval becomes less important—you can afford to retrieve everything and let the LLM sort it out.

Alternative Approaches

Late Chunking

Late chunking preserves document context through a different mechanism. Instead of generating explicit context, it embeds the full document first and then derives chunk embeddings from the document-level representation.

Aspect	Contextual Retrieval	Late Chunking
How it works	LLM generates explicit context prefix	Embed full doc, derive chunk vectors
Cost	~$1/million tokens	Near zero (one embedding call per doc)
Quality	Slightly better in benchmarks	Very close
BM25 compatibility	Yes (adds searchable terms)	No (doesn't change text)
Embedding model requirements	Any model	Requires long-context embedder (8K+)

Choose contextual retrieval when you need BM25 benefits, use shorter-context embedding models, or prioritize maximum quality. Choose late chunking when cost is paramount and you have suitable embedding models.

Parent-Child (Small-to-Big) Chunking

This approach indexes small chunks for precise retrieval but returns their larger parent chunks for generation context.

Aspect	Contextual Retrieval	Parent-Child
Retrieval target	Contextualized small chunks	Small chunks
Returned content	Original chunk + context	Large parent chunk
Storage overhead	~20% (context strings)	2-3x (multiple chunk sizes)
Retrieval precision	High	Very high

These approaches can be combined: use contextual retrieval on child chunks for maximum retrieval precision, return parent chunks for rich generation context.

Query Expansion

Rather than enriching chunks, query expansion enriches queries by adding related terms, synonyms, or rephrased versions. This can help bridge vocabulary gaps without touching the index.

Query expansion and contextual retrieval are complementary. Query expansion helps when users don't use the exact terminology in documents. Contextual retrieval helps when documents don't include the identifying context users search for. Using both addresses both failure modes.

Decision Framework

Consider contextual retrieval if:

Your documents rely heavily on established context (entities, time periods, document type)
Users frequently search for specific entities or time periods
You observe retrieval failures where relevant content exists but isn't found
You need the BM25 benefits of adding searchable terms to chunks
Quality is more important than marginal indexing cost

Consider alternatives if:

Your content is already self-contained and context-complete
Indexing cost is a binding constraint
You have a suitable long-context embedding model and don't need BM25 benefits
Your knowledge base is small enough that retrieval sophistication matters less

Implementation Guide

Having covered the concepts in depth, let's turn to practical implementation. The following sections provide code examples and patterns for building a contextual retrieval system.

Basic Context Generation

The core of contextual retrieval is generating context for each chunk. Here's a straightforward implementation:

Python

from anthropic import Anthropic

client = Anthropic()

def generate_context(document: str, chunk: str) -> str:
    """Generate situating context for a chunk."""

    prompt = f"""<document>
{document}
</document>

Here is the chunk we want to situate within the document:
<chunk>
{chunk}
</chunk>

Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval.
Answer only with the succinct context and nothing else."""

    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )

    return response.content[0].text.strip()

def contextualize_chunk(document: str, chunk: str) -> str:
    """Create a contextualized chunk ready for embedding."""
    context = generate_context(document, chunk)
    return f"{context}\n\n{chunk}"

This basic implementation works but doesn't leverage prompt caching. For production use, we need to optimize for cost.

Production Context Generation with Caching

To enable prompt caching, structure requests so the document portion is cached:

Python

def contextualize_document_chunks(
    document: str,
    chunks: list[str],
    document_title: str = None
) -> list[str]:
    """Process all chunks from a document with prompt caching."""

    # Prepare document text with optional title
    doc_text = document
    if document_title:
        doc_text = f"Document: {document_title}\n\n{document}"

    # Create cacheable document block
    document_block = {
        "type": "text",
        "text": f"<document>\n{doc_text}\n</document>\n\n",
        "cache_control": {"type": "ephemeral"}
    }

    results = []
    for chunk in chunks:
        chunk_prompt = f"""Here is the chunk we want to situate:
<chunk>
{chunk}
</chunk>

Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval.
The context should identify the document source, section/topic, and
resolve any ambiguous references. Keep it to 1-2 sentences.
Answer only with the succinct context and nothing else."""

        response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=150,
            messages=[{
                "role": "user",
                "content": [
                    document_block,  # Cached after first call
                    {"type": "text", "text": chunk_prompt}
                ]
            }]
        )

        context = response.content[0].text.strip()
        contextualized = f"{context}\n\n{chunk}"
        results.append(contextualized)

    return results

Hybrid Retrieval Implementation

Once chunks are contextualized, implement hybrid search combining embeddings and BM25:

Python

from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, embedding_model):
        self.embedding_model = embedding_model
        self.chunks = []
        self.original_chunks = []  # Store originals for generation
        self.embeddings = []
        self.bm25 = None

    def index(
        self,
        contextualized_chunks: list[str],
        original_chunks: list[str]
    ):
        """Index contextualized chunks for hybrid search."""
        self.chunks = contextualized_chunks
        self.original_chunks = original_chunks

        # Create embeddings from contextualized chunks
        self.embeddings = self.embedding_model.embed(contextualized_chunks)

        # Create BM25 index from contextualized chunks
        tokenized = [chunk.lower().split() for chunk in contextualized_chunks]
        self.bm25 = BM25Okapi(tokenized)

    def search(
        self,
        query: str,
        top_k: int = 10,
        return_original: bool = True
    ) -> list[tuple[str, float]]:
        """Hybrid search with RRF fusion."""

        # Embedding search
        query_embedding = self.embedding_model.embed([query])[0]
        similarities = np.dot(self.embeddings, query_embedding)
        embedding_ranking = np.argsort(similarities)[::-1]

        # BM25 search
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_ranking = np.argsort(bm25_scores)[::-1]

        # RRF fusion
        k = 60
        scores = {}
        for rank, idx in enumerate(embedding_ranking):
            scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
        for rank, idx in enumerate(bm25_ranking):
            scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)

        # Sort by combined score
        ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)

        # Return either original or contextualized chunks
        if return_original:
            return [(self.original_chunks[idx], score)
                    for idx, score in ranked[:top_k]]
        else:
            return [(self.chunks[idx], score)
                    for idx, score in ranked[:top_k]]

Adding Reranking

Integrate reranking to refine results:

Python

import cohere

class ContextualRetriever:
    def __init__(self, embedding_model, cohere_api_key: str):
        self.hybrid = HybridRetriever(embedding_model)
        self.reranker = cohere.Client(cohere_api_key)

    def index(self, contextualized_chunks, original_chunks):
        self.hybrid.index(contextualized_chunks, original_chunks)

    def search(
        self,
        query: str,
        top_k: int = 10,
        initial_k: int = 150,
        use_reranking: bool = True
    ) -> list[tuple[str, float]]:
        """Full contextual retrieval with optional reranking."""

        # Initial hybrid retrieval
        candidates = self.hybrid.search(
            query,
            top_k=initial_k if use_reranking else top_k,
            return_original=True  # Rerank on original content
        )

        if not use_reranking:
            return candidates[:top_k]

        # Rerank top candidates
        docs = [chunk for chunk, _ in candidates]
        reranked = self.reranker.rerank(
            model="rerank-v3.5",
            query=query,
            documents=docs,
            top_n=top_k
        )

        return [(docs[r.index], r.relevance_score)
                for r in reranked.results]

Complete Pipeline Example

Here's how the pieces fit together in a complete workflow:

Python

# 1. Process documents to create contextualized chunks
documents = load_documents()  # Your document loading logic
all_contextualized = []
all_original = []

for doc in documents:
    # Chunk the document (your chunking logic)
    chunks = chunk_document(doc.content)

    # Generate contexts with caching
    contextualized = contextualize_document_chunks(
        document=doc.content,
        chunks=chunks,
        document_title=doc.title
    )

    all_contextualized.extend(contextualized)
    all_original.extend(chunks)

# 2. Build the retrieval index
retriever = ContextualRetriever(
    embedding_model=YourEmbeddingModel(),
    cohere_api_key="your-api-key"
)
retriever.index(all_contextualized, all_original)

# 3. Search at query time
results = retriever.search(
    query="How did Acme perform in Q3 2024?",
    top_k=5,
    initial_k=150,
    use_reranking=True
)

# 4. Use results for generation
context = "\n\n---\n\n".join([chunk for chunk, score in results])
# Pass context to LLM for response generation

Common Pitfalls

Experience implementing contextual retrieval reveals several common mistakes to avoid:

1. Skipping BM25

Contextual embeddings alone improve retrieval significantly (35% failure reduction), but adding contextual BM25 provides additional gains (49% total). The BM25 benefit—matching on exact terms added by context—is complementary to embedding improvements. Don't leave this value on the table.

2. Ignoring Prompt Caching

Without caching, contextual retrieval costs 5-10x more than necessary. Always batch chunks by document and implement proper caching. Monitor cache hit rates to verify your implementation is working correctly.

3. Over-Long Contexts

Keep generated context to 50-100 tokens. Longer contexts dilute the embedding with less relevant information, increase storage costs, and don't proportionally improve retrieval. The context should be surgically precise: identify source, topic, and resolved references, nothing more.

4. Generic Context Prompts

The default prompt works well for general content, but domain-specific customization can help. Legal documents benefit from prompts that specifically ask for party identification and document type. Technical docs benefit from prompts emphasizing product names and version numbers. Customize for your domain.

5. Not Monitoring Context Quality

Generated contexts occasionally fail—producing generic text like "This is from a document" or hallucinating incorrect information. Monitor context quality, especially during initial deployment. Spot-check generated contexts and implement quality metrics (context length, presence of expected entity types).

6. Forgetting to Reindex

Contextual retrieval requires re-embedding all chunks. You cannot retroactively apply context to an existing index. If you're adding contextual retrieval to an existing system, plan for a complete reindexing of your knowledge base.

7. One-Size-Fits-All Chunking

Contextual retrieval improves retrieval quality for chunks as they exist, but it doesn't fix fundamentally bad chunking. If chunks break mid-sentence, split logical content awkwardly, or use inappropriate sizes for your content type, those problems persist. Get chunking right first, then add contextual retrieval.

8. Not Measuring Improvement

Always measure retrieval quality before and after implementing contextual retrieval. The improvement varies by document type—verify it's working for your specific content. Create evaluation datasets with queries and known relevant chunks, measure recall@k before and after.

Conclusion

Contextual retrieval addresses one of RAG's most fundamental limitations: the loss of context when documents are chunked for retrieval. Through the simple but powerful technique of prepending LLM-generated context to each chunk before embedding, we enable retrieval systems to accurately represent and match document content.

The technique is elegant because it works with existing infrastructure. You don't need new embedding models, different vector databases, or modified retrieval algorithms. By improving the input to your existing pipeline—enriching chunks with context they previously lacked—you improve output across the board.

The results justify the implementation investment. Across diverse knowledge bases and document types, contextual retrieval combined with BM25 and reranking reduces retrieval failures by up to 67%. That 5.7% baseline failure rate—already low enough to seem acceptable—drops to under 2%.

The cost is manageable. With prompt caching, contextual retrieval runs approximately $1 per million document tokens—a one-time indexing cost that pays dividends on every subsequent query. For knowledge-intensive applications where retrieval accuracy matters, this is a worthwhile investment.

If your RAG system struggles with documents containing entity references, temporal markers, or domain-specific terminology—and most real-world documents do—contextual retrieval is likely to help substantially. The technique is straightforward to implement, compatible with existing infrastructure, and delivers measurable improvements.

Sometimes the most effective solutions are the simplest. If the context isn't in the chunk, put it there.

References

This post is based on Anthropic's research on contextual retrieval:

Introducing Contextual Retrieval - Original Anthropic engineering blog post by Daniel Ford (September 2024)
Contextual Embeddings Guide - Anthropic Cookbook implementation guide

Frequently Asked Questions

For context generation, use fast, cost-effective models: Claude 3.5 Haiku or GPT-4o-mini. Context generation is a relatively simple task—identifying the document source, section or topic, and resolving ambiguous references. It doesn't require the sophisticated reasoning of frontier models.

The key factors are low latency (you're making many calls during indexing) and low cost (to keep indexing affordable at scale). Claude 3.5 Haiku at approximately $0.25 per million input tokens—further reduced by prompt caching—represents an excellent balance. GPT-4o-mini offers similar characteristics. Larger models like Claude 3.5 Sonnet or GPT-4o produce marginally better contexts but at significantly higher cost without proportional benefit.

With proper prompt caching, contextual retrieval costs approximately $1 per million document tokens for the indexing phase. This is a one-time cost—queries don't require context generation.

To put this in concrete terms: a knowledge base with 1 million chunks averaging 300 tokens each (300 million total tokens) would cost roughly $300 to contextualize. A smaller knowledge base of 50,000 chunks would cost around $15. These costs assume optimal prompt caching with 90%+ cache hit rates when processing chunks from the same document.

Without caching, costs are 5-10x higher. Always batch chunks by document and implement proper caching to achieve these cost levels.

Both techniques address the same underlying problem—context loss during chunking—but with different approaches and tradeoffs.

Use contextual retrieval when:

You need BM25 improvements (late chunking doesn't add searchable terms)
You're using shorter-context embedding models (under 8K tokens)
Maximum retrieval quality justifies the indexing cost
Your documents are highly context-dependent

Use late chunking when:

Cost is the primary constraint (near-zero indexing cost)
You have a long-context embedding model (8K+ tokens)
You're indexing very large corpora where even $1/million tokens adds up
Your content is less context-dependent

For critical applications where accuracy matters, consider testing both on your specific data. The quality difference is modest—contextual retrieval is slightly better in most benchmarks—so the decision often comes down to practical factors like cost and infrastructure.

Yes, completely. Contextual retrieval is a preprocessing step that happens before embedding. You generate context, prepend it to chunks, embed the combined text, and index the resulting vectors. From the vector database's perspective, it's receiving normal embeddings—it doesn't know or care that the source text was enriched.

This means contextual retrieval works with Pinecone, Weaviate, Qdrant, Milvus, pgvector, Chroma, FAISS, and any other vector storage solution. No special database features are required.

Create an evaluation dataset: a set of queries paired with their known relevant chunks. This "golden set" might come from user feedback, expert annotation, or synthetic generation.

Measure retrieval recall@k before and after implementing contextual retrieval. Recall@k measures what percentage of relevant chunks appear in the top-k retrieval results. For example, if 5 chunks are relevant to a query and 4 of them appear in the top-10 results, recall@10 is 80%.

Also track:

Mean Reciprocal Rank (MRR): How high do relevant chunks rank on average?
Precision@k: What fraction of retrieved chunks are relevant?
User feedback: Do users report better answers after implementation?

A 20%+ improvement in recall@10 is typical for document types with dense entity references. Smaller improvements for already-contextual content.

No, you'll need to reindex. Existing embeddings were computed from raw chunks—they don't contain contextual information and can't be updated in place. Contextual retrieval modifies what text gets embedded, which changes the resulting vectors entirely.

The reindexing process:

Retrieve original chunk text from your storage
Regenerate contexts for all chunks (using caching for efficiency)
Embed the contextualized chunks
Build a new vector index
Switch query traffic to the new index

For large knowledge bases, consider a rolling update: build the new index while serving from the existing one, then switch over once complete.

The same chunking principles apply as with standard RAG. Chunk size should be based on your content type and query patterns:

256-512 tokens for precise factual retrieval
512-1024 tokens for content requiring more surrounding context
Larger chunks for content with complex dependencies

Contextual retrieval adds approximately 50-100 tokens of context to each chunk. Account for this when sizing chunks relative to your context window budget for generation. If you're providing 5 chunks of context and each has ~100 tokens of added context, that's ~500 extra tokens.

The context doesn't fundamentally change optimal chunk size—it improves retrieval regardless of size. If your chunking strategy was working reasonably well before, keep it; contextual retrieval will improve retrieval for those chunks.

For documents exceeding your context generation model's limit (e.g., >100K tokens for Claude), several strategies work:

Truncation: Keep the first 80% and last 20% of the document by token count. Document introductions typically establish entities and context; conclusions often summarize key points. The middle is less critical for context generation.

Hierarchical context: Generate document-level context once based on the full document (or a summary), then generate section-level context for chunks within their sections. Combine both levels: "From Acme Corp's 2024 Annual Report (Section 4: Risk Factors): ..."

Section-based processing: For very long documents, treat major sections as separate documents for context generation purposes. Each section's chunks receive context based on that section plus document-level metadata.

For most documents under 50K tokens, simple truncation works well. For longer documents, hierarchical or section-based approaches provide better results.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

EducationRAG

Building Production-Ready RAG Systems: Lessons from the Field

Production-focused guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.

16 min read

RAGLLMs

Document Processing Pipelines: From PDF to RAG-Ready Chunks

Build production document processing pipelines for LLM applications. PDF extraction, chunking strategies, embedding models, and retrieval optimization with 2025 best practices and tool comparisons.

18 min read

EmbeddingsRAG

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

Clear walkthrough of embedding models for RAG, search, and AI applications. Comparison of text-embedding-3, BGE, E5, Cohere Embed v4, and Voyage with guidance on fine-tuning, dimensionality, multimodal embeddings, and production optimization.

15 min read

AI SearchRAG

Hybrid Search Strategies: Combining BM25 and Vector Search for Better Retrieval

Deep dive into hybrid search combining lexical (BM25) and semantic (vector) retrieval. Covers RRF fusion, linear combination, query routing, reranking, and production best practices for RAG systems in 2025.

12 min read

Table of Contents

Introduction

The Hidden Failure Mode

The Elegant Solution

Understanding the RAG Architecture

Why Documents Must Be Chunked

The Standard RAG Pipeline

The Context Problem in Depth

Patterns of Context Loss

The Retrieval Failure Cascade

Quantifying the Impact

How Contextual Retrieval Works

The Core Mechanism

A Complete Worked Example

What Makes Effective Context

Why This Approach Works

Extending to BM25: The Hybrid Approach

The Limitations of Pure Semantic Search

Understanding BM25

The Power of Hybrid Search

Contextual BM25

Combining Results with Reciprocal Rank Fusion

Performance Improvements

Adding Reranking for Maximum Accuracy

The Retrieval-Ranking Tradeoff

Cross-Encoders: More Accurate, More Expensive

Why Reranking Can't Replace Initial Retrieval

Reranking in the Contextual Pipeline

Reranker Options

Complete Pipeline Performance

Cost Optimization with Prompt Caching

Understanding the Cost Challenge

Prompt Caching to the Rescue

Cost With Caching

Real-World Cost Analysis

Implementation Considerations

Output Token Costs

Benchmarks and Performance Analysis

Baseline Failure Rates by Document Type

Embedding Model Interactions

Optimal Configuration Recommendations

Diminishing Returns Analysis

When to Use Contextual Retrieval

Strong Use Cases

Weaker Use Cases

Alternative Approaches

Decision Framework

Implementation Guide

Basic Context Generation

Production Context Generation with Caching

Hybrid Retrieval Implementation

Adding Reranking

Complete Pipeline Example

Common Pitfalls

Conclusion

References

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building Production-Ready RAG Systems: Lessons from the Field

Document Processing Pipelines: From PDF to RAG-Ready Chunks

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

Hybrid Search Strategies: Combining BM25 and Vector Search for Better Retrieval