What's the best PDF extraction library for production use?

For most use cases, start with PyMuPDF (fitz) for native PDFs—it's fast, reliable, and handles most documents well. For complex layouts with tables, add Unstructured.io or LlamaParse. For scanned documents, you'll need OCR via Tesseract, PaddleOCR, or cloud services like AWS Textract. Many production systems use a tiered approach: try fast extraction first, fall back to more sophisticated tools when needed.

How do I handle tables in PDFs?

Tables are notoriously difficult. Options include: (1) Use specialized table extraction like Camelot or Tabula for simple tables. (2) Use vision-based approaches—send page images to GPT-4V or Claude to extract table content. (3) Convert tables to markdown format for better LLM understanding. (4) Store table metadata separately and retrieve entire tables rather than chunks.

What chunk size should I use?

There's no universal answer—it depends on your documents and use case. Start with 512-1024 tokens with 10-20% overlap. Shorter chunks (256-512) work better for precise factual retrieval; longer chunks (1024-2048) preserve more context for complex reasoning. Always test empirically with your actual queries and documents.

How do I handle documents with mixed content (text, images, tables)?

Use multimodal extraction—process each element type appropriately. Extract text normally, OCR images with descriptions, convert tables to structured formats. Consider storing different element types with metadata tags so retrieval can filter by content type. Some teams embed images separately using CLIP-style models.

Should I use semantic chunking or fixed-size chunking?

Semantic chunking (splitting at natural boundaries like paragraphs or sections) generally produces better results but is more complex. Fixed-size chunking is simpler and works well for homogeneous documents. A practical middle ground: use recursive character splitting with hierarchy awareness—split at double newlines first, then single newlines, then at sentence boundaries.

How do I evaluate my document processing pipeline?

Create a test set of documents with known content. Measure extraction quality (is all text captured?), chunking quality (do chunks contain coherent information?), and end-to-end retrieval quality (do queries find the right information?). Track metrics like chunk coherence scores, retrieval recall, and answer accuracy on a held-out test set.

Document Processing Pipelines: From PDF to RAG-Ready Chunks

The quality of your RAG system depends entirely on the quality of your document processing. A mediocre retrieval algorithm over well-processed documents beats a perfect retrieval algorithm over poorly processed documents. This guide walks through building production document processing pipelines that handle real-world documents — PDFs with tables, scanned images, mixed formats, and messy layouts.

Why Document Processing is the Foundation of RAG

Most teams building RAG systems spend 80% of their time on retrieval algorithms and prompts, when they should be spending 80% on document processing. Here's why:

Garbage in, garbage out. If your chunks contain broken sentences, missing context, or merged paragraphs, no amount of sophisticated retrieval will save you. The LLM receives fragmented information and produces fragmented answers.

The hidden cost of poor chunking. Bad chunks don't just produce wrong answers — they produce confidently wrong answers. The retriever finds "relevant" chunks that are actually noise, and the LLM synthesizes them into plausible-sounding nonsense.

Real-world documents are messy. Enterprise documents contain multi-column layouts, tables that span pages, headers and footers on every page, watermarks, scanned images with OCR artifacts, and inconsistent formatting. A naive text extraction approach mangles all of this.

The compounding effect. Every stage of the pipeline compounds errors. Extract 90% of text correctly, chunk 90% of that well, embed 90% accurately — you're already down to 73% quality. At each stage, you need near-perfect execution.

The goal of this guide is to help you build pipelines that achieve 95%+ quality at each stage, resulting in RAG systems that actually work.

The Document Processing Pipeline

A complete pipeline has six stages:

Code

Raw Documents → Extraction → Cleaning → Chunking → Enrichment → Embedding → Storage

Stage 1: Extraction converts PDFs, Word docs, HTML, and other formats into raw text while preserving structure. This is where most pipelines fail — poor extraction cascades through everything downstream.

Stage 2: Cleaning normalizes the extracted text, fixing encoding issues, removing boilerplate, and standardizing formatting. Dirty text produces noisy embeddings.

Stage 3: Chunking splits cleaned text into retrieval units. This is the most consequential decision in your pipeline — chunk size, overlap, and boundary detection directly determine retrieval quality.

Stage 4: Enrichment adds metadata, summaries, and context to chunks. Rich metadata enables filtered retrieval and helps the LLM understand where information comes from.

Stage 5: Embedding converts chunks into vectors for similarity search. Model choice, dimensionality, and preprocessing all affect retrieval precision.

Stage 6: Storage persists embeddings in a vector database with appropriate indexing for fast retrieval at scale.

Each stage presents unique challenges. Let's tackle them systematically.

Stage 1: Document Extraction

Document extraction is the foundation of your pipeline. Get it wrong, and everything downstream suffers. The challenge is that different document types require fundamentally different extraction strategies.

The Extraction Strategy Landscape

Before diving into tools, understand the landscape of approaches:

Approach	Best For	Accuracy	Speed	Cost
Native text extraction	Born-digital PDFs	High	Fast	Free
OCR	Scanned documents	Medium-High	Slow	Free (local)
Vision LLMs	Complex layouts, tables	Very High	Slow	Expensive
Hybrid services	Production systems	High	Medium	Per-page pricing

Native text extraction (using libraries like PyMuPDF) works by reading the text layer embedded in PDFs. It's fast and accurate for documents created digitally — Word exports, LaTeX papers, web-to-PDF conversions. However, it completely fails on scanned documents (which have no text layer) and struggles with complex multi-column layouts.

OCR (Optical Character Recognition) converts images of text into actual text. Essential for scanned documents, but introduces errors — especially for low-quality scans, unusual fonts, or handwriting. Modern OCR engines like Tesseract achieve 95%+ accuracy on clean documents but can drop to 70% on challenging inputs.

Vision LLMs (like GPT-4o or Claude) can "see" document pages as images and extract text with understanding of layout, tables, and context. They handle complex documents that defeat traditional extraction but are slow (seconds per page) and expensive at scale.

Hybrid services like Unstructured, LlamaParse, and Docling combine multiple approaches intelligently — using native extraction where possible, OCR when needed, and vision models for complex elements.

Choosing Your Extraction Strategy

Use this decision tree:

Is the PDF searchable? (Can you select text?) → Start with native extraction
Is extraction quality poor or empty? → Fall back to OCR
Are there complex tables or multi-column layouts? → Consider vision LLMs or hybrid services
Processing thousands of documents? → Use Unstructured or LlamaParse for consistency

For most production systems, implement a hybrid approach: try native extraction first, detect quality issues (suspiciously low word count, garbled text), and escalate to more sophisticated methods as needed.

PDF Extraction: The Hard Problem

PDFs are notoriously difficult because they're designed for display, not data extraction. A PDF doesn't store "paragraphs" or "tables" — it stores instructions for drawing characters at specific coordinates. What looks like a simple document might contain:

Multiple columns that naive extraction merges into nonsense
Headers and footers repeated on every page, polluting your content
Tables where cells become disconnected text fragments
Embedded images containing text that native extraction misses entirely
Mathematical formulas rendered as images or special fonts
Watermarks and annotations mixed with content

Understanding these challenges explains why extraction is hard and why different tools make different tradeoffs.

2025 PDF Extraction Tool Comparison

The PDF extraction landscape has evolved significantly. Here's how the major tools compare based on recent benchmarks:

Docling (IBM Research)

Docling is an open-source library focused on layout-aware document parsing. It excels at understanding document structure — identifying headings, lists, tables, and figures.

Strengths:

97.9% accuracy on complex table extraction in benchmark tests
94%+ accuracy on numerical and textual tables
Excellent at preserving document hierarchy
Open source and self-hostable

Weaknesses:

Slower than lightweight parsing libraries
Requires engineering knowledge to integrate
Limited support for forms and handwriting

Best for: Technical documents, research papers, structured reports where layout matters.

LlamaParse

LlamaParse uses vision models for document understanding, making it particularly strong on complex layouts.

Strengths:

Consistent ~6 second processing regardless of document size
Excellent structure preservation
Handles financial reports and contracts well
Simple API

Weaknesses:

Can struggle on extremely complex layouts
Requires API calls (not self-hostable)
Cost scales with pages processed

Best for: High-value documents where accuracy justifies cost — contracts, financial reports, compliance documents.

Unstructured

Unstructured provides a comprehensive pipeline for document processing with multiple extraction strategies.

Strengths:

100% accuracy on simple tables
Strong OCR capabilities
Good integration with LangChain and other frameworks
Self-hostable option available

Weaknesses:

75% accuracy on complex table structures
Struggles with complex layouts
Performance has reportedly declined on challenging documents
Often requires post-processing for formatting

Best for: Production pipelines processing diverse document types at scale.

Recommendation: Hybrid Strategy

Benchmark comparisons consistently show that no single tool is perfect. When one fails, another often succeeds. For production systems:

Start with Docling for structured documents (reports, papers, manuals)
Use LlamaParse for high-value complex documents (contracts, financial statements)
Fall back to Unstructured for diverse document collections
Implement quality checks to route documents to the appropriate tool

Native PDF Extraction with PyMuPDF

For born-digital PDFs, PyMuPDF (also called Fitz) provides fast, accurate extraction:

Python

import fitz  # PyMuPDF

def extract_pdf_text(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    text_parts = []
    for page in doc:
        text = page.get_text("text")
        text_parts.append(text)
    doc.close()
    return "\n\n".join(text_parts)

This works well for simple documents. For better structure preservation, extract text blocks with position information and reconstruct reading order based on coordinates.

OCR for Scanned Documents

When native extraction fails (empty or garbled output), fall back to OCR. Tesseract is the standard open-source choice:

Python

from pdf2image import convert_from_path
import pytesseract

def extract_with_ocr(pdf_path: str, dpi: int = 300) -> str:
    images = convert_from_path(pdf_path, dpi=dpi)
    text_parts = [pytesseract.image_to_string(img, lang='eng') for img in images]
    return "\n\n".join(text_parts)

OCR tips:

Use 300 DPI for optimal accuracy/speed tradeoff
Preprocess images (deskew, denoise, binarize) for challenging scans
Specify language for better accuracy on non-English text
Consider commercial OCR (Google Cloud Vision, AWS Textract) for higher accuracy on difficult documents

HTML and Office Document Extraction

HTML: Use BeautifulSoup or html2text. Remove navigation, footers, sidebars, and ads before extraction. The readability library can help identify main content automatically.

Word documents (.docx): Use python-docx. Extract paragraphs, tables, and headers separately to preserve structure.

Excel (.xlsx): Use openpyxl. Each sheet becomes a separate document or section. Consider how to represent tabular data in text form.

PowerPoint (.pptx): Use python-pptx. Extract slide titles, content, and speaker notes. Consider slide order and logical flow.

Stage 2: Text Cleaning and Normalization

Raw extracted text is messy. Before chunking, clean and normalize it to reduce noise in your embeddings.

Common Issues to Address

Encoding problems: Extracted text often contains encoding artifacts — "â€™" instead of apostrophes, "Ã©" instead of "é". These create noise in embeddings and confuse LLMs.

Excessive whitespace: Multiple spaces, tabs, and blank lines from layout extraction. Normalize to single spaces and paragraph breaks.

Headers and footers: Page numbers, document titles, and confidentiality notices repeated on every page. Remove or consolidate them.

Hyphenation: Words split across lines with hyphens ("recom-mend") should be rejoined.

Boilerplate: Copyright notices, legal disclaimers, and standard text that adds no value. Remove based on patterns.

Cleaning Pipeline

A robust cleaning pipeline addresses these issues systematically:

Python

import re
import unicodedata

def clean_text(text: str) -> str:
    # Normalize Unicode (fix encoding issues)
    text = unicodedata.normalize("NFKC", text)

    # Fix common encoding artifacts
    replacements = {'â€™': "'", 'â€œ': '"', 'â€�': '"', 'â€"': '—'}
    for bad, good in replacements.items():
        text = text.replace(bad, good)

    # Rejoin hyphenated words at line breaks
    text = re.sub(r'(\w)-\n(\w)', r'\1\2', text)

    # Normalize whitespace
    text = re.sub(r' +', ' ', text)
    text = re.sub(r'\n{3,}', '\n\n', text)

    return text.strip()

Tip: Build your cleaning pipeline iteratively. Extract several documents, inspect the output, identify patterns, add rules. Repeat until output quality stabilizes.

Stage 3: Chunking — The Most Critical Decision

Chunking is where most RAG pipelines succeed or fail. Your chunking strategy directly determines retrieval quality — and there's no universal right answer.

Why Chunking Matters So Much

Chunking creates the fundamental units of retrieval. When a user asks a question, your system finds and returns chunks. If those chunks are:

Too small: They lose context and become meaningless fragments
Too large: They dilute relevance and waste context window
Poorly bounded: They split mid-sentence or mid-thought, confusing the LLM
Missing context: They contain pronouns and references that don't resolve

The right chunking strategy depends on your documents, your queries, and your quality requirements.

The 2025 Chunking Landscape

Recent research, including extensive testing by the community, has established clearer best practices:

Semantic chunking delivers the biggest gains. Testing across multiple benchmarks shows semantic chunking can improve RAG accuracy by up to 70% compared to naive fixed-size chunking. This is the single biggest lever for improving retrieval quality.

Optimal chunk size is 256-512 tokens. This range balances precision (finding exactly what's needed) with context (having enough surrounding information). Use smaller chunks for factual lookup, larger for explanatory content.

10-20% overlap is the sweet spot. For a 500-token chunk, use 50-100 tokens of overlap. This ensures concepts that span chunk boundaries aren't lost.

Chunking Strategies Compared

Strategy	How It Works	Best For	Complexity	Cost
Fixed-size	Split every N tokens	Simple docs	Low	Free
Sentence-based	Split at sentence boundaries	General text	Low	Free
Recursive	Split by structure, then size	Structured docs	Medium	Free
Semantic	Split when topic changes	Long-form content	Medium	Embedding costs
Document-aware	Respect headings/sections	Reports, papers	Medium	Free
LLM-based	LLM decides boundaries	Complex docs	High	LLM costs
Parent-child	Small for retrieval, large for context	Q&A systems	Medium	2x storage
Contextual	Add context to each chunk	High-value docs	High	LLM costs

Fixed-Size Chunking

The simplest approach: split text into chunks of N tokens (or characters), with optional overlap.

When to use: Homogeneous content without clear structure — transcripts, logs, social media posts.

Limitations: Breaks mid-sentence, ignores document structure, treats all content equally. This should be your baseline, not your production strategy.

Sentence-Based Chunking

Split at sentence boundaries, accumulating sentences until reaching the size limit.

When to use: General prose where sentence integrity matters.

Limitations: Doesn't understand document structure, may create uneven chunk sizes.

This approach ensures chunks never break mid-sentence, significantly improving coherence for the LLM.

Semantic Chunking

The most sophisticated approach: use embeddings to detect topic shifts and split at semantic boundaries.

How it works:

Split text into sentences
Embed each sentence
Calculate similarity between adjacent sentences
Split where similarity drops below threshold (indicating topic change)

When to use: Long documents with multiple topics, where topic coherence within chunks matters.

Tradeoff: Requires embedding every sentence during preprocessing — adds cost and complexity. But the 70% accuracy improvement often justifies this investment.

Document-Aware Chunking

Respect document structure: split at headings, keep sections together, preserve hierarchy.

When to use: Structured documents (reports, papers, documentation) with clear sections.

How it works: Parse markdown or HTML structure, chunk within sections, include heading context in each chunk.

This is particularly effective for technical documentation where users query specific sections.

Parent-Child Chunking

Create two levels: small chunks for precise retrieval, large chunks for context.

The problem it solves: Small chunks retrieve precisely but lose context. Large chunks preserve context but retrieve imprecisely.

How it works:

Create parent chunks (1500-2000 tokens)
Create child chunks (300-500 tokens) within each parent
Embed and index child chunks
When retrieving, return the parent chunk for context

When to use: Q&A systems, legal documents, technical specs — anywhere precise retrieval needs surrounding context.

Contextual Chunking (Anthropic's Approach)

Anthropic's contextual retrieval prepends context to each chunk before embedding.

The problem it solves: Chunks lose context. "The company achieved this goal" — what company? What goal? The references made sense in the full document but become ambiguous in isolation.

How it works:

For each chunk, prompt an LLM with the full document
Generate a brief context statement: "This chunk is from Acme Corp's Q3 2024 earnings report, discussing revenue targets"
Prepend context to chunk before embedding

Results: Anthropic reports a 67% reduction in retrieval failures. This is one of the most impactful techniques for improving RAG quality.

Tradeoff: Requires one LLM call per chunk during indexing. For 1000 chunks, that's 1000 API calls (~$1-10 depending on model). Worth it for high-value document collections.

Deep dive: For complete implementation details including contextual BM25, reranking integration, and prompt caching for cost optimization, see Contextual Retrieval: Solving RAG's Hidden Context Problem.

Late Chunking (Alternative Approach)

Late chunking is a more computationally efficient alternative to contextual retrieval.

How it works:

Embed the entire document at the token level using a long-context embedding model
After embedding, segment into chunks
Pool token embeddings within each chunk

Advantage: Preserves context without additional LLM calls — the embeddings already capture document-level context because the full document was processed together.

Research status: ECIR 2025 research shows it's competitive with contextual retrieval at lower computational cost. A promising technique for cost-sensitive applications.

LLM-Based Chunking

Use an LLM to analyze documents and determine optimal chunk boundaries.

How it works: Pass the document to an LLM with instructions to identify logical segments based on topic, argument structure, or other criteria.

When to use: High-value, complex documents where other methods fail — legal documents with complex clause structures, research papers with intricate arguments.

Tradeoff: Most expensive approach (LLM call for every document), but can handle documents that defeat rule-based methods.

Chunking Recommendations by Document Type

Document Type	Recommended Strategy	Chunk Size	Notes
Technical docs	Document-aware	512 tokens	Respect section boundaries
Contracts	Parent-child + Contextual	300/1500 tokens	Precision + context crucial
Research papers	Semantic	512 tokens	Topic coherence matters
Support tickets	Fixed-size	256 tokens	Short, uniform content
Transcripts	Sentence-based	512 tokens	Speaker turns as boundaries
Code	AST-aware	Function/class	Preserve semantic units
FAQs	Per-question	Variable	Natural document structure

Stage 4: Metadata Enrichment

Raw chunks benefit from additional metadata that improves retrieval and helps the LLM understand context.

Essential Metadata Fields

Source information: Document title, filename, URL, author, publication date. Enables filtering ("only search recent documents") and citation in responses.

Structural position: Section title, heading hierarchy, page number. Helps the LLM understand where information comes from.

Content type: Is this a paragraph, table, list, code block, image description? Different content types may need different retrieval strategies.

Chunk relationships: Previous/next chunk IDs enable expanding context when needed. Parent document ID enables document-level filtering.

LLM-Generated Enrichment

For high-value document collections, LLMs can generate additional metadata:

Summary: A 1-2 sentence summary of the chunk's content. Improves retrieval for paraphrased queries where the user's words don't match the document's words.

Keywords: Key terms and concepts in the chunk. Enables keyword-based filtering alongside semantic search.

Questions: What questions could this chunk answer? Dramatically improves retrieval by matching query patterns. If a chunk contains "Revenue grew 15% in Q3", generating "What was the revenue growth in Q3?" as metadata helps match that query.

Entities: People, organizations, products, dates mentioned. Enables entity-based filtering and improves retrieval for queries about specific entities.

This enrichment costs one LLM call per chunk (~$0.001-0.01 each) but can significantly improve retrieval quality for specialized domains.

Stage 5: Embedding Generation

Embeddings convert text into vectors for similarity search. Model choice significantly affects retrieval quality.

2025 Embedding Model Landscape

The embedding model market has matured significantly. Here's the current state based on comprehensive benchmarks:

OpenAI text-embedding-3

text-embedding-3-large ($0.13/1M tokens, 8K context)

Strong general-purpose performance
Wins head-to-head on many retrieval benchmarks
3072 dimensions (can reduce with matryoshka embeddings)
Good choice for production when you want reliability

text-embedding-3-small ($0.02/1M tokens, 8K context)

Excellent cost-performance ratio
Good enough for many production use cases
Start here, upgrade only if quality insufficient

Voyage AI

Voyage-3-large ranks #1 across eight domains and 100 datasets, outperforming OpenAI by 9.74% on average.

voyage-3.5 ($0.06/1M tokens, 32K context)

Best accuracy-cost balance for production
Massive 32K token context window
Excellent for long documents

voyage-3.5-lite ($0.02/1M tokens, 32K context)

Budget-friendly option
Still strong performance (66.1%)
Best choice for cost-conscious implementations

Cohere Embed v4

Cohere's latest model offers unique capabilities:

128K token context window — embed entire documents
Multimodal — text, images, and mixed content in same embedding space
100+ language support — native multilingual without translation
Quantization support — reduce storage costs by up to 83%

Best for: Multilingual applications, multimodal content, very long documents.

Open Source Options

BAAI BGE-M3: Strong multilingual performance, optimized for RAG, self-hostable. Good choice if you can't use external APIs.

E5-Mistral: Combines Mistral's language understanding with E5's retrieval optimization. Strong performance on specialized domains.

Nomic Embed: Good balance of quality and efficiency for self-hosting. Apache 2.0 license.

Embedding Model Selection Guide

Use Case	Recommended Model	Why
General production	text-embedding-3-small	Reliable, cost-effective
Quality-critical	Voyage-3.5 or text-embedding-3-large	Higher accuracy
Budget-constrained	voyage-3.5-lite	Best accuracy per dollar
Multilingual	Cohere Embed v4 or BGE-M3	Native multilingual
Long documents	Voyage-3.5 (32K) or Cohere (128K)	Context length
Self-hosted	BGE-M3 or Nomic	No API dependency
Multimodal	Cohere Embed v4	Text + images

Embedding Best Practices

Batch embedding: Embed multiple chunks per API call to reduce latency and cost. Most APIs support batching.

Dimensionality: Higher isn't always better. 1536 dimensions work well for most use cases. Use matryoshka embeddings or quantization to reduce storage costs while maintaining quality.

Preprocessing: Clean text thoroughly before embedding. Noise in text becomes noise in embeddings, degrading retrieval quality.

Query embedding: Use the same model for queries and documents. Mixing models (e.g., embedding docs with OpenAI, queries with Cohere) degrades retrieval quality significantly.

Asymmetric embedding: Some models (like E5) support different prefixes for queries vs documents. Use them correctly for best results.

Stage 6: Vector Storage and Retrieval

The final pipeline stage stores embeddings for fast retrieval. Vector database choice affects performance, cost, and operational complexity.

Vector Database Options

Managed Services

Pinecone: Industry standard for production. Excellent performance, simple API, automatic scaling. Higher cost but minimal ops burden.

Weaviate Cloud: Strong hybrid search (vector + keyword), good for complex queries. GraphQL API.

Qdrant Cloud: Good price-performance ratio, advanced filtering, growing ecosystem.

Self-Hosted

Chroma: Simple, embedded, great for prototyping. Limited scale and features for production.

Qdrant: Excellent self-hosted option. Rust-based, performant, full-featured.

Milvus: Designed for massive scale. More complex to operate but handles billions of vectors.

pgvector: PostgreSQL extension. Good if you're already on Postgres and want simplicity.

Hybrid Search: Vector + Keyword

Pure vector search has a fundamental weakness: it can miss exact keyword matches. If someone searches for "error code XYZ-123", semantic search might rank it below conceptually similar but wrong results.

Hybrid search combines:

BM25 (keyword): Excels at exact matches, rare terms, specific identifiers
Vector: Excels at semantic similarity, synonyms, conceptual matches

Reciprocal Rank Fusion (RRF) is the standard combination method. Rather than trying to normalize scores (which is mathematically tricky), RRF combines rank positions from each system.

Most production RAG systems should use hybrid search, especially for technical content with specific terminology, identifiers, or domain-specific vocabulary.

Reranking for Precision

Initial retrieval casts a wide net — finding candidates that might be relevant. Reranking applies a more accurate model to sort those candidates.

The two-stage pattern:

Retrieve 50-100 candidates using fast vector search
Rerank with a cross-encoder or LLM
Return top 5-10 to the LLM

Reranking options (cheapest to most expensive):

Cross-encoder models (ms-marco-MiniLM): Fast, free (local), good quality
Cohere Rerank: Excellent quality, easy API, moderate cost
LLM-based reranking: Highest quality for complex relevance, expensive

Reranking adds latency (100-500ms) but can dramatically improve precision for complex queries where initial retrieval returns "almost right" results.

Production Considerations

Cost Estimation

Before processing a document collection, estimate costs:

Operation	Cost Driver	Typical Cost
Vision extraction	Per page	$0.01-0.03/page
OCR	Compute time	~Free (local)
Embedding	Per token	$0.02-0.13/1M tokens
Enrichment (LLM)	Per chunk	$0.001-0.01/chunk
Contextual retrieval	Per chunk	$0.001-0.01/chunk
Storage	Per vector	Varies by provider

Example: For a 1000-page document collection with ~5000 chunks:

Embedding only: $0.10-0.65
With enrichment: $5-50
With contextual retrieval: $5-50 additional
With vision extraction: $10-30 additional

Quality Monitoring

Implement quality checks throughout your pipeline:

Extraction quality: Word count per page (flag suspiciously low), character encoding validation, structure preservation checks.

Chunk quality: Average chunk size and variance, boundary quality (do chunks end at natural breaks?), coherence scores (semantic similarity within chunks).

Retrieval quality: Create a test set of 50-100 questions with known answers. Measure precision@k — does the correct chunk appear in top results? If precision@5 is below 80%, your pipeline needs work.

Error Handling and Recovery

Production pipelines fail. Build resilience:

Checkpoint processing: Save state after each stage. If embedding fails, don't re-extract and re-chunk.

Document-level isolation: One bad document shouldn't fail the entire batch. Log errors and continue.

Retry with fallback: If native extraction fails, try OCR. If OCR fails, flag for manual review.

Idempotency: Running the pipeline twice should produce the same result. Use content hashes to detect already-processed documents.

Incremental Updates

Document collections change. Handle updates efficiently:

Content hashing: Hash each chunk. On re-processing, compare hashes to identify changes. Only re-embed changed chunks.

Soft deletes: Mark old chunks as inactive rather than deleting. Enables rollback if updates introduce problems.

Version tracking: Maintain version metadata for documents. Enable temporal queries ("what did the policy say last quarter?").

Table of Contents

Document Processing Pipelines: From PDF to RAG-Ready Chunks

Why Document Processing is the Foundation of RAG

The Document Processing Pipeline

Stage 1: Document Extraction

The Extraction Strategy Landscape

Choosing Your Extraction Strategy

PDF Extraction: The Hard Problem

2025 PDF Extraction Tool Comparison

Docling (IBM Research)

LlamaParse

Unstructured

Recommendation: Hybrid Strategy

Native PDF Extraction with PyMuPDF

OCR for Scanned Documents

HTML and Office Document Extraction

Stage 2: Text Cleaning and Normalization

Common Issues to Address

Cleaning Pipeline

Stage 3: Chunking — The Most Critical Decision

Why Chunking Matters So Much

The 2025 Chunking Landscape

Chunking Strategies Compared

Fixed-Size Chunking

Sentence-Based Chunking

Semantic Chunking

Document-Aware Chunking

Parent-Child Chunking

Contextual Chunking (Anthropic's Approach)

Late Chunking (Alternative Approach)

LLM-Based Chunking

Chunking Recommendations by Document Type

Stage 4: Metadata Enrichment

Essential Metadata Fields

LLM-Generated Enrichment

Stage 5: Embedding Generation

2025 Embedding Model Landscape

OpenAI text-embedding-3

Voyage AI

Cohere Embed v4

Open Source Options

Embedding Model Selection Guide

Embedding Best Practices

Stage 6: Vector Storage and Retrieval

Vector Database Options

Managed Services

Self-Hosted

Hybrid Search: Vector + Keyword

Reranking for Precision

Production Considerations

Cost Estimation

Quality Monitoring

Error Handling and Recovery

Incremental Updates

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building Production-Ready RAG Systems: Lessons from the Field

Agentic RAG: When Retrieval Meets Autonomous Reasoning

Building Semantic Memory for LLM Conversations: A Hierarchical RAG Approach

Multi-Step Documentation Search: Building Intelligent Search for Docs

LLM-Powered Search for E-Commerce: Beyond NER and Elasticsearch

Mastering LLM Context Windows: Strategies for Long-Context Applications