Document Processing Pipelines: From PDF to RAG-Ready Chunks
Build production document processing pipelines for LLM applications. PDF extraction, chunking strategies, embedding models, and retrieval optimization with 2025 best practices and tool comparisons.
Table of Contents
Document Processing Pipelines: From PDF to RAG-Ready Chunks
The quality of your RAG system depends entirely on the quality of your document processing. A mediocre retrieval algorithm over well-processed documents beats a perfect retrieval algorithm over poorly processed documents. This guide walks through building production document processing pipelines that handle real-world documents — PDFs with tables, scanned images, mixed formats, and messy layouts.
Why Document Processing is the Foundation of RAG
Most teams building RAG systems spend 80% of their time on retrieval algorithms and prompts, when they should be spending 80% on document processing. Here's why:
Garbage in, garbage out. If your chunks contain broken sentences, missing context, or merged paragraphs, no amount of sophisticated retrieval will save you. The LLM receives fragmented information and produces fragmented answers.
The hidden cost of poor chunking. Bad chunks don't just produce wrong answers — they produce confidently wrong answers. The retriever finds "relevant" chunks that are actually noise, and the LLM synthesizes them into plausible-sounding nonsense.
Real-world documents are messy. Enterprise documents contain multi-column layouts, tables that span pages, headers and footers on every page, watermarks, scanned images with OCR artifacts, and inconsistent formatting. A naive text extraction approach mangles all of this.
The compounding effect. Every stage of the pipeline compounds errors. Extract 90% of text correctly, chunk 90% of that well, embed 90% accurately — you're already down to 73% quality. At each stage, you need near-perfect execution.
The goal of this guide is to help you build pipelines that achieve 95%+ quality at each stage, resulting in RAG systems that actually work.
The Document Processing Pipeline
A complete pipeline has six stages:
Raw Documents → Extraction → Cleaning → Chunking → Enrichment → Embedding → Storage
Stage 1: Extraction converts PDFs, Word docs, HTML, and other formats into raw text while preserving structure. This is where most pipelines fail — poor extraction cascades through everything downstream.
Stage 2: Cleaning normalizes the extracted text, fixing encoding issues, removing boilerplate, and standardizing formatting. Dirty text produces noisy embeddings.
Stage 3: Chunking splits cleaned text into retrieval units. This is the most consequential decision in your pipeline — chunk size, overlap, and boundary detection directly determine retrieval quality.
Stage 4: Enrichment adds metadata, summaries, and context to chunks. Rich metadata enables filtered retrieval and helps the LLM understand where information comes from.
Stage 5: Embedding converts chunks into vectors for similarity search. Model choice, dimensionality, and preprocessing all affect retrieval precision.
Stage 6: Storage persists embeddings in a vector database with appropriate indexing for fast retrieval at scale.
Each stage presents unique challenges. Let's tackle them systematically.
Stage 1: Document Extraction
Document extraction is the foundation of your pipeline. Get it wrong, and everything downstream suffers. The challenge is that different document types require fundamentally different extraction strategies.
The Extraction Strategy Landscape
Before diving into tools, understand the landscape of approaches:
| Approach | Best For | Accuracy | Speed | Cost |
|---|---|---|---|---|
| Native text extraction | Born-digital PDFs | High | Fast | Free |
| OCR | Scanned documents | Medium-High | Slow | Free (local) |
| Vision LLMs | Complex layouts, tables | Very High | Slow | Expensive |
| Hybrid services | Production systems | High | Medium | Per-page pricing |
Native text extraction (using libraries like PyMuPDF) works by reading the text layer embedded in PDFs. It's fast and accurate for documents created digitally — Word exports, LaTeX papers, web-to-PDF conversions. However, it completely fails on scanned documents (which have no text layer) and struggles with complex multi-column layouts.
OCR (Optical Character Recognition) converts images of text into actual text. Essential for scanned documents, but introduces errors — especially for low-quality scans, unusual fonts, or handwriting. Modern OCR engines like Tesseract achieve 95%+ accuracy on clean documents but can drop to 70% on challenging inputs.
Vision LLMs (like GPT-4o or Claude) can "see" document pages as images and extract text with understanding of layout, tables, and context. They handle complex documents that defeat traditional extraction but are slow (seconds per page) and expensive at scale.
Hybrid services like Unstructured, LlamaParse, and Docling combine multiple approaches intelligently — using native extraction where possible, OCR when needed, and vision models for complex elements.
Choosing Your Extraction Strategy
Use this decision tree:
- Is the PDF searchable? (Can you select text?) → Start with native extraction
- Is extraction quality poor or empty? → Fall back to OCR
- Are there complex tables or multi-column layouts? → Consider vision LLMs or hybrid services
- Processing thousands of documents? → Use Unstructured or LlamaParse for consistency
For most production systems, implement a hybrid approach: try native extraction first, detect quality issues (suspiciously low word count, garbled text), and escalate to more sophisticated methods as needed.
PDF Extraction: The Hard Problem
PDFs are notoriously difficult because they're designed for display, not data extraction. A PDF doesn't store "paragraphs" or "tables" — it stores instructions for drawing characters at specific coordinates. What looks like a simple document might contain:
- Multiple columns that naive extraction merges into nonsense
- Headers and footers repeated on every page, polluting your content
- Tables where cells become disconnected text fragments
- Embedded images containing text that native extraction misses entirely
- Mathematical formulas rendered as images or special fonts
- Watermarks and annotations mixed with content
Understanding these challenges explains why extraction is hard and why different tools make different tradeoffs.
2025 PDF Extraction Tool Comparison
The PDF extraction landscape has evolved significantly. Here's how the major tools compare based on recent benchmarks:
Docling (IBM Research)
Docling is an open-source library focused on layout-aware document parsing. It excels at understanding document structure — identifying headings, lists, tables, and figures.
Strengths:
- 97.9% accuracy on complex table extraction in benchmark tests
- 94%+ accuracy on numerical and textual tables
- Excellent at preserving document hierarchy
- Open source and self-hostable
Weaknesses:
- Slower than lightweight parsing libraries
- Requires engineering knowledge to integrate
- Limited support for forms and handwriting
Best for: Technical documents, research papers, structured reports where layout matters.
LlamaParse
LlamaParse uses vision models for document understanding, making it particularly strong on complex layouts.
Strengths:
- Consistent ~6 second processing regardless of document size
- Excellent structure preservation
- Handles financial reports and contracts well
- Simple API
Weaknesses:
- Can struggle on extremely complex layouts
- Requires API calls (not self-hostable)
- Cost scales with pages processed
Best for: High-value documents where accuracy justifies cost — contracts, financial reports, compliance documents.
Unstructured
Unstructured provides a comprehensive pipeline for document processing with multiple extraction strategies.
Strengths:
- 100% accuracy on simple tables
- Strong OCR capabilities
- Good integration with LangChain and other frameworks
- Self-hostable option available
Weaknesses:
- 75% accuracy on complex table structures
- Struggles with complex layouts
- Performance has reportedly declined on challenging documents
- Often requires post-processing for formatting
Best for: Production pipelines processing diverse document types at scale.
Recommendation: Hybrid Strategy
Benchmark comparisons consistently show that no single tool is perfect. When one fails, another often succeeds. For production systems:
- Start with Docling for structured documents (reports, papers, manuals)
- Use LlamaParse for high-value complex documents (contracts, financial statements)
- Fall back to Unstructured for diverse document collections
- Implement quality checks to route documents to the appropriate tool
Native PDF Extraction with PyMuPDF
For born-digital PDFs, PyMuPDF (also called Fitz) provides fast, accurate extraction:
import fitz # PyMuPDF
def extract_pdf_text(pdf_path: str) -> str:
doc = fitz.open(pdf_path)
text_parts = []
for page in doc:
text = page.get_text("text")
text_parts.append(text)
doc.close()
return "\n\n".join(text_parts)
This works well for simple documents. For better structure preservation, extract text blocks with position information and reconstruct reading order based on coordinates.
OCR for Scanned Documents
When native extraction fails (empty or garbled output), fall back to OCR. Tesseract is the standard open-source choice:
from pdf2image import convert_from_path
import pytesseract
def extract_with_ocr(pdf_path: str, dpi: int = 300) -> str:
images = convert_from_path(pdf_path, dpi=dpi)
text_parts = [pytesseract.image_to_string(img, lang='eng') for img in images]
return "\n\n".join(text_parts)
OCR tips:
- Use 300 DPI for optimal accuracy/speed tradeoff
- Preprocess images (deskew, denoise, binarize) for challenging scans
- Specify language for better accuracy on non-English text
- Consider commercial OCR (Google Cloud Vision, AWS Textract) for higher accuracy on difficult documents
HTML and Office Document Extraction
HTML: Use BeautifulSoup or html2text. Remove navigation, footers, sidebars, and ads before extraction. The readability library can help identify main content automatically.
Word documents (.docx): Use python-docx. Extract paragraphs, tables, and headers separately to preserve structure.
Excel (.xlsx): Use openpyxl. Each sheet becomes a separate document or section. Consider how to represent tabular data in text form.
PowerPoint (.pptx): Use python-pptx. Extract slide titles, content, and speaker notes. Consider slide order and logical flow.
Stage 2: Text Cleaning and Normalization
Raw extracted text is messy. Before chunking, clean and normalize it to reduce noise in your embeddings.
Common Issues to Address
Encoding problems: Extracted text often contains encoding artifacts — "’" instead of apostrophes, "é" instead of "é". These create noise in embeddings and confuse LLMs.
Excessive whitespace: Multiple spaces, tabs, and blank lines from layout extraction. Normalize to single spaces and paragraph breaks.
Headers and footers: Page numbers, document titles, and confidentiality notices repeated on every page. Remove or consolidate them.
Hyphenation: Words split across lines with hyphens ("recom-mend") should be rejoined.
Boilerplate: Copyright notices, legal disclaimers, and standard text that adds no value. Remove based on patterns.
Cleaning Pipeline
A robust cleaning pipeline addresses these issues systematically:
import re
import unicodedata
def clean_text(text: str) -> str:
# Normalize Unicode (fix encoding issues)
text = unicodedata.normalize("NFKC", text)
# Fix common encoding artifacts
replacements = {'’': "'", '“': '"', 'â€�': '"', 'â€"': '—'}
for bad, good in replacements.items():
text = text.replace(bad, good)
# Rejoin hyphenated words at line breaks
text = re.sub(r'(\w)-\n(\w)', r'\1\2', text)
# Normalize whitespace
text = re.sub(r' +', ' ', text)
text = re.sub(r'\n{3,}', '\n\n', text)
return text.strip()
Tip: Build your cleaning pipeline iteratively. Extract several documents, inspect the output, identify patterns, add rules. Repeat until output quality stabilizes.
Stage 3: Chunking — The Most Critical Decision
Chunking is where most RAG pipelines succeed or fail. Your chunking strategy directly determines retrieval quality — and there's no universal right answer.
Why Chunking Matters So Much
Chunking creates the fundamental units of retrieval. When a user asks a question, your system finds and returns chunks. If those chunks are:
- Too small: They lose context and become meaningless fragments
- Too large: They dilute relevance and waste context window
- Poorly bounded: They split mid-sentence or mid-thought, confusing the LLM
- Missing context: They contain pronouns and references that don't resolve
The right chunking strategy depends on your documents, your queries, and your quality requirements.
The 2025 Chunking Landscape
Recent research, including extensive testing by the community, has established clearer best practices:
Semantic chunking delivers the biggest gains. Testing across multiple benchmarks shows semantic chunking can improve RAG accuracy by up to 70% compared to naive fixed-size chunking. This is the single biggest lever for improving retrieval quality.
Optimal chunk size is 256-512 tokens. This range balances precision (finding exactly what's needed) with context (having enough surrounding information). Use smaller chunks for factual lookup, larger for explanatory content.
10-20% overlap is the sweet spot. For a 500-token chunk, use 50-100 tokens of overlap. This ensures concepts that span chunk boundaries aren't lost.
Chunking Strategies Compared
| Strategy | How It Works | Best For | Complexity | Cost |
|---|---|---|---|---|
| Fixed-size | Split every N tokens | Simple docs | Low | Free |
| Sentence-based | Split at sentence boundaries | General text | Low | Free |
| Recursive | Split by structure, then size | Structured docs | Medium | Free |
| Semantic | Split when topic changes | Long-form content | Medium | Embedding costs |
| Document-aware | Respect headings/sections | Reports, papers | Medium | Free |
| LLM-based | LLM decides boundaries | Complex docs | High | LLM costs |
| Parent-child | Small for retrieval, large for context | Q&A systems | Medium | 2x storage |
| Contextual | Add context to each chunk | High-value docs | High | LLM costs |
Fixed-Size Chunking
The simplest approach: split text into chunks of N tokens (or characters), with optional overlap.
When to use: Homogeneous content without clear structure — transcripts, logs, social media posts.
Limitations: Breaks mid-sentence, ignores document structure, treats all content equally. This should be your baseline, not your production strategy.
Sentence-Based Chunking
Split at sentence boundaries, accumulating sentences until reaching the size limit.
When to use: General prose where sentence integrity matters.
Limitations: Doesn't understand document structure, may create uneven chunk sizes.
This approach ensures chunks never break mid-sentence, significantly improving coherence for the LLM.
Semantic Chunking
The most sophisticated approach: use embeddings to detect topic shifts and split at semantic boundaries.
How it works:
- Split text into sentences
- Embed each sentence
- Calculate similarity between adjacent sentences
- Split where similarity drops below threshold (indicating topic change)
When to use: Long documents with multiple topics, where topic coherence within chunks matters.
Tradeoff: Requires embedding every sentence during preprocessing — adds cost and complexity. But the 70% accuracy improvement often justifies this investment.
Document-Aware Chunking
Respect document structure: split at headings, keep sections together, preserve hierarchy.
When to use: Structured documents (reports, papers, documentation) with clear sections.
How it works: Parse markdown or HTML structure, chunk within sections, include heading context in each chunk.
This is particularly effective for technical documentation where users query specific sections.
Parent-Child Chunking
Create two levels: small chunks for precise retrieval, large chunks for context.
The problem it solves: Small chunks retrieve precisely but lose context. Large chunks preserve context but retrieve imprecisely.
How it works:
- Create parent chunks (1500-2000 tokens)
- Create child chunks (300-500 tokens) within each parent
- Embed and index child chunks
- When retrieving, return the parent chunk for context
When to use: Q&A systems, legal documents, technical specs — anywhere precise retrieval needs surrounding context.
Contextual Chunking (Anthropic's Approach)
Anthropic's contextual retrieval prepends context to each chunk before embedding.
The problem it solves: Chunks lose context. "The company achieved this goal" — what company? What goal? The references made sense in the full document but become ambiguous in isolation.
How it works:
- For each chunk, prompt an LLM with the full document
- Generate a brief context statement: "This chunk is from Acme Corp's Q3 2024 earnings report, discussing revenue targets"
- Prepend context to chunk before embedding
Results: Anthropic reports a 67% reduction in retrieval failures. This is one of the most impactful techniques for improving RAG quality.
Tradeoff: Requires one LLM call per chunk during indexing. For 1000 chunks, that's 1000 API calls (~$1-10 depending on model). Worth it for high-value document collections.
Deep dive: For complete implementation details including contextual BM25, reranking integration, and prompt caching for cost optimization, see Contextual Retrieval: Solving RAG's Hidden Context Problem.
Late Chunking (Alternative Approach)
Late chunking is a more computationally efficient alternative to contextual retrieval.
How it works:
- Embed the entire document at the token level using a long-context embedding model
- After embedding, segment into chunks
- Pool token embeddings within each chunk
Advantage: Preserves context without additional LLM calls — the embeddings already capture document-level context because the full document was processed together.
Research status: ECIR 2025 research shows it's competitive with contextual retrieval at lower computational cost. A promising technique for cost-sensitive applications.
LLM-Based Chunking
Use an LLM to analyze documents and determine optimal chunk boundaries.
How it works: Pass the document to an LLM with instructions to identify logical segments based on topic, argument structure, or other criteria.
When to use: High-value, complex documents where other methods fail — legal documents with complex clause structures, research papers with intricate arguments.
Tradeoff: Most expensive approach (LLM call for every document), but can handle documents that defeat rule-based methods.
Chunking Recommendations by Document Type
| Document Type | Recommended Strategy | Chunk Size | Notes |
|---|---|---|---|
| Technical docs | Document-aware | 512 tokens | Respect section boundaries |
| Contracts | Parent-child + Contextual | 300/1500 tokens | Precision + context crucial |
| Research papers | Semantic | 512 tokens | Topic coherence matters |
| Support tickets | Fixed-size | 256 tokens | Short, uniform content |
| Transcripts | Sentence-based | 512 tokens | Speaker turns as boundaries |
| Code | AST-aware | Function/class | Preserve semantic units |
| FAQs | Per-question | Variable | Natural document structure |
Stage 4: Metadata Enrichment
Raw chunks benefit from additional metadata that improves retrieval and helps the LLM understand context.
Essential Metadata Fields
Source information: Document title, filename, URL, author, publication date. Enables filtering ("only search recent documents") and citation in responses.
Structural position: Section title, heading hierarchy, page number. Helps the LLM understand where information comes from.
Content type: Is this a paragraph, table, list, code block, image description? Different content types may need different retrieval strategies.
Chunk relationships: Previous/next chunk IDs enable expanding context when needed. Parent document ID enables document-level filtering.
LLM-Generated Enrichment
For high-value document collections, LLMs can generate additional metadata:
Summary: A 1-2 sentence summary of the chunk's content. Improves retrieval for paraphrased queries where the user's words don't match the document's words.
Keywords: Key terms and concepts in the chunk. Enables keyword-based filtering alongside semantic search.
Questions: What questions could this chunk answer? Dramatically improves retrieval by matching query patterns. If a chunk contains "Revenue grew 15% in Q3", generating "What was the revenue growth in Q3?" as metadata helps match that query.
Entities: People, organizations, products, dates mentioned. Enables entity-based filtering and improves retrieval for queries about specific entities.
This enrichment costs one LLM call per chunk (~$0.001-0.01 each) but can significantly improve retrieval quality for specialized domains.
Stage 5: Embedding Generation
Embeddings convert text into vectors for similarity search. Model choice significantly affects retrieval quality.
2025 Embedding Model Landscape
The embedding model market has matured significantly. Here's the current state based on comprehensive benchmarks:
OpenAI text-embedding-3
text-embedding-3-large ($0.13/1M tokens, 8K context)
- Strong general-purpose performance
- Wins head-to-head on many retrieval benchmarks
- 3072 dimensions (can reduce with matryoshka embeddings)
- Good choice for production when you want reliability
text-embedding-3-small ($0.02/1M tokens, 8K context)
- Excellent cost-performance ratio
- Good enough for many production use cases
- Start here, upgrade only if quality insufficient
Voyage AI
Voyage-3-large ranks #1 across eight domains and 100 datasets, outperforming OpenAI by 9.74% on average.
voyage-3.5 ($0.06/1M tokens, 32K context)
- Best accuracy-cost balance for production
- Massive 32K token context window
- Excellent for long documents
voyage-3.5-lite ($0.02/1M tokens, 32K context)
- Budget-friendly option
- Still strong performance (66.1%)
- Best choice for cost-conscious implementations
Cohere Embed v4
Cohere's latest model offers unique capabilities:
- 128K token context window — embed entire documents
- Multimodal — text, images, and mixed content in same embedding space
- 100+ language support — native multilingual without translation
- Quantization support — reduce storage costs by up to 83%
Best for: Multilingual applications, multimodal content, very long documents.
Open Source Options
BAAI BGE-M3: Strong multilingual performance, optimized for RAG, self-hostable. Good choice if you can't use external APIs.
E5-Mistral: Combines Mistral's language understanding with E5's retrieval optimization. Strong performance on specialized domains.
Nomic Embed: Good balance of quality and efficiency for self-hosting. Apache 2.0 license.
Embedding Model Selection Guide
| Use Case | Recommended Model | Why |
|---|---|---|
| General production | text-embedding-3-small | Reliable, cost-effective |
| Quality-critical | Voyage-3.5 or text-embedding-3-large | Higher accuracy |
| Budget-constrained | voyage-3.5-lite | Best accuracy per dollar |
| Multilingual | Cohere Embed v4 or BGE-M3 | Native multilingual |
| Long documents | Voyage-3.5 (32K) or Cohere (128K) | Context length |
| Self-hosted | BGE-M3 or Nomic | No API dependency |
| Multimodal | Cohere Embed v4 | Text + images |
Embedding Best Practices
Batch embedding: Embed multiple chunks per API call to reduce latency and cost. Most APIs support batching.
Dimensionality: Higher isn't always better. 1536 dimensions work well for most use cases. Use matryoshka embeddings or quantization to reduce storage costs while maintaining quality.
Preprocessing: Clean text thoroughly before embedding. Noise in text becomes noise in embeddings, degrading retrieval quality.
Query embedding: Use the same model for queries and documents. Mixing models (e.g., embedding docs with OpenAI, queries with Cohere) degrades retrieval quality significantly.
Asymmetric embedding: Some models (like E5) support different prefixes for queries vs documents. Use them correctly for best results.
Stage 6: Vector Storage and Retrieval
The final pipeline stage stores embeddings for fast retrieval. Vector database choice affects performance, cost, and operational complexity.
Vector Database Options
Managed Services
Pinecone: Industry standard for production. Excellent performance, simple API, automatic scaling. Higher cost but minimal ops burden.
Weaviate Cloud: Strong hybrid search (vector + keyword), good for complex queries. GraphQL API.
Qdrant Cloud: Good price-performance ratio, advanced filtering, growing ecosystem.
Self-Hosted
Chroma: Simple, embedded, great for prototyping. Limited scale and features for production.
Qdrant: Excellent self-hosted option. Rust-based, performant, full-featured.
Milvus: Designed for massive scale. More complex to operate but handles billions of vectors.
pgvector: PostgreSQL extension. Good if you're already on Postgres and want simplicity.
Hybrid Search: Vector + Keyword
Pure vector search has a fundamental weakness: it can miss exact keyword matches. If someone searches for "error code XYZ-123", semantic search might rank it below conceptually similar but wrong results.
Hybrid search combines:
- BM25 (keyword): Excels at exact matches, rare terms, specific identifiers
- Vector: Excels at semantic similarity, synonyms, conceptual matches
Reciprocal Rank Fusion (RRF) is the standard combination method. Rather than trying to normalize scores (which is mathematically tricky), RRF combines rank positions from each system.
Most production RAG systems should use hybrid search, especially for technical content with specific terminology, identifiers, or domain-specific vocabulary.
Reranking for Precision
Initial retrieval casts a wide net — finding candidates that might be relevant. Reranking applies a more accurate model to sort those candidates.
The two-stage pattern:
- Retrieve 50-100 candidates using fast vector search
- Rerank with a cross-encoder or LLM
- Return top 5-10 to the LLM
Reranking options (cheapest to most expensive):
- Cross-encoder models (ms-marco-MiniLM): Fast, free (local), good quality
- Cohere Rerank: Excellent quality, easy API, moderate cost
- LLM-based reranking: Highest quality for complex relevance, expensive
Reranking adds latency (100-500ms) but can dramatically improve precision for complex queries where initial retrieval returns "almost right" results.
Production Considerations
Cost Estimation
Before processing a document collection, estimate costs:
| Operation | Cost Driver | Typical Cost |
|---|---|---|
| Vision extraction | Per page | $0.01-0.03/page |
| OCR | Compute time | ~Free (local) |
| Embedding | Per token | $0.02-0.13/1M tokens |
| Enrichment (LLM) | Per chunk | $0.001-0.01/chunk |
| Contextual retrieval | Per chunk | $0.001-0.01/chunk |
| Storage | Per vector | Varies by provider |
Example: For a 1000-page document collection with ~5000 chunks:
- Embedding only: $0.10-0.65
- With enrichment: $5-50
- With contextual retrieval: $5-50 additional
- With vision extraction: $10-30 additional
Quality Monitoring
Implement quality checks throughout your pipeline:
Extraction quality: Word count per page (flag suspiciously low), character encoding validation, structure preservation checks.
Chunk quality: Average chunk size and variance, boundary quality (do chunks end at natural breaks?), coherence scores (semantic similarity within chunks).
Retrieval quality: Create a test set of 50-100 questions with known answers. Measure precision@k — does the correct chunk appear in top results? If precision@5 is below 80%, your pipeline needs work.
Error Handling and Recovery
Production pipelines fail. Build resilience:
Checkpoint processing: Save state after each stage. If embedding fails, don't re-extract and re-chunk.
Document-level isolation: One bad document shouldn't fail the entire batch. Log errors and continue.
Retry with fallback: If native extraction fails, try OCR. If OCR fails, flag for manual review.
Idempotency: Running the pipeline twice should produce the same result. Use content hashes to detect already-processed documents.
Incremental Updates
Document collections change. Handle updates efficiently:
Content hashing: Hash each chunk. On re-processing, compare hashes to identify changes. Only re-embed changed chunks.
Soft deletes: Mark old chunks as inactive rather than deleting. Enables rollback if updates introduce problems.
Version tracking: Maintain version metadata for documents. Enable temporal queries ("what did the policy say last quarter?").
Frequently Asked Questions
Related Articles
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
Agentic RAG: When Retrieval Meets Autonomous Reasoning
How to build RAG systems that don't just retrieve—they reason, plan, and iteratively refine their searches to solve complex information needs.
Building Semantic Memory for LLM Conversations: A Hierarchical RAG Approach
A practical guide to building a semantic search system for your LLM conversation history using hierarchical chunking, HyDE retrieval, knowledge graphs, and agentic research patterns.
Multi-Step Documentation Search: Building Intelligent Search for Docs
A comprehensive guide to building intelligent documentation search systems—multi-step retrieval, query understanding, hierarchical chunking, reranking, and production patterns used by Mintlify, GitBook, and modern docs platforms.
LLM-Powered Search for E-Commerce: Beyond NER and Elasticsearch
A deep dive into building intelligent e-commerce search systems that understand natural language, leverage metadata effectively, and support multi-turn conversations—moving beyond classical NER + Elasticsearch approaches.
Mastering LLM Context Windows: Strategies for Long-Context Applications
Practical techniques for managing context windows in production LLM applications—from compression to hierarchical processing to infinite context architectures.