Should I use a long-context model or chunking?

If cost isn't a concern and your document fits in context, use full context—it's simpler and often higher quality. If cost matters, documents are very long, or you're doing retrieval tasks, chunking is usually better.

What chunk size should I use?

[Fix context issues with smart chunking: 400-800 tokens with overlap](https://medium.com/@kuldeep.paul08/debugging-llm-failures-a-comprehensive-guide-to-robust-ai-applications-4d3e07c59df5). Smaller chunks improve retrieval precision; larger chunks preserve more context. Start with 500 tokens and adjust based on your use case.

How do I handle tables and figures?

Keep tables together—don't split rows across chunks. For figures, extract captions and descriptions. Consider multimodal models that can process images directly.

Map-reduce or iterative refinement?

Map-reduce for parallelizable tasks where chunks are independent. Iterative refinement for documents with narrative flow where context builds progressively.

How do I measure processing quality?

Compare against ground truth where available. Use human evaluation for subjective quality. Test on representative documents before production deployment.

Long Document Processing

LLM context windows have grown dramatically—from 4K tokens to millions in some models. Yet real-world documents still regularly exceed these limits. A single legal contract can span 100,000 tokens. A codebase contains millions. Even with expanded context, processing strategies matter: cramming everything into context isn't always the best approach for quality or cost.

This guide covers strategies for processing long documents: chunking approaches, map-reduce patterns, hierarchical summarization, iterative refinement, and practical guidance on when to use each technique.

The Long Document Challenge

Why Context Windows Aren't Enough

Even with models supporting 100K+ tokens, naive full-context processing has problems:

Lost in the middle: Models struggle to attend equally to all parts of long contexts. Information in the middle of long prompts receives less attention than information at the beginning or end.

Cost: Longer contexts cost more. Processing a 100K token document costs 25x more than processing 4K tokens.

Latency: Time-to-first-token scales with context length. Long contexts mean slower responses.

Quality degradation: Even when models technically handle long contexts, quality often degrades compared to processing smaller, relevant portions.

The Processing Spectrum

Long document processing exists on a spectrum:

Retrieval-based: Don't process the whole document. Retrieve relevant portions and process only those. Fast and cheap but may miss context.

Chunk-and-process: Divide the document into chunks, process each independently, then combine results. Scalable but loses cross-chunk context.

Hierarchical: Build summaries at multiple levels of abstraction. Preserves structure but adds complexity.

Full-context: Process the entire document at once. Simple but expensive and quality-limited.

The right approach depends on the task, document type, quality requirements, and budget.

Chunking Strategies

Chunking divides documents into smaller pieces suitable for LLM processing. The art is chunking in ways that preserve meaning and minimize information loss at boundaries.

Fixed-Size Chunking

The simplest approach: divide text into chunks of fixed token count.

Implementation: Split every N tokens, with M tokens of overlap between chunks.

Typical values: 400-800 tokens per chunk, 50-100 tokens overlap.

Advantages: Simple, predictable, works for any text.

Disadvantages: Ignores document structure. Chunks may cut mid-sentence, mid-paragraph, or mid-section, losing context.

Semantic Chunking

Split at natural boundaries where meaning changes:

Sentence-level: Split between sentences. Preserves sentence integrity but chunks vary in size.

Paragraph-level: Split between paragraphs. Preserves topical coherence within chunks.

Section-level: Split at section headings. Preserves document structure for structured documents.

Semantic similarity: Detect topic shifts by measuring embedding similarity between adjacent sentences. Split when similarity drops below threshold.

Advantages: Preserves semantic coherence. Chunks contain complete thoughts.

Disadvantages: Variable chunk sizes complicate batch processing. Requires detecting boundaries.

Recursive Chunking

Recursively split documents using multiple strategies:

Try splitting by sections
If sections too large, split by paragraphs
If paragraphs too large, split by sentences
If sentences too large, split by fixed size

This produces chunks that respect structure when possible and fall back to fixed-size when necessary.

Document-Aware Chunking

Leverage document format for intelligent splitting:

Code: Split at function/class boundaries, preserving complete units.

Markdown: Split at headers, preserving section structure.

HTML: Split at semantic tags (article, section, div), respecting DOM structure.

PDF: Split at page boundaries or detected section breaks.

Tables: Keep tables together; don't split rows across chunks.

Overlap Strategies

Overlap between chunks ensures context at boundaries:

Fixed overlap: Include the same number of tokens from the previous chunk.

Sentence overlap: Include complete sentences from the previous chunk.

Summary overlap: Instead of raw text, include a summary of the previous chunk.

Overlap helps but adds redundancy. Balance context preservation against increased total tokens.

Map-Reduce Processing

Map-reduce is a foundational pattern for long document processing: process chunks independently, then combine results.

The Basic Pattern

Map phase: Apply the same operation to each chunk independently. For summarization, summarize each chunk. For extraction, extract from each chunk.

Reduce phase: Combine chunk results into a final result. For summarization, summarize the summaries. For extraction, deduplicate and merge extractions.

Summarization Example

For a 50,000-token document:

Split: Divide into 50 chunks of 1,000 tokens each
Map: Summarize each chunk into ~100 tokens (can parallelize)
Combine: You now have ~5,000 tokens of summaries
Reduce: Summarize the combined summaries into final output

Advantages

Parallelization: Map operations are embarrassingly parallel. Process all chunks simultaneously for speed.

Scalability: Document length doesn't limit processing—just add more chunks.

Simplicity: Each operation is straightforward LLM call.

Disadvantages

Information loss: Cumulative error risk—errors in early summaries propagate through layers.

Cross-chunk context: Independent chunk processing can't capture relationships spanning chunks.

Cost consideration: Map-reduce might not be less expensive than large context models. You still process all tokens plus the intermediate summaries.

When to Use

Map-reduce excels when:

Chunks can be processed independently (each contains complete information units)
The reduce operation naturally combines partial results
Parallelization is valuable for latency
Document structure aligns with chunking

Hierarchical Processing

Hierarchical summarization builds multiple abstraction levels, creating a pyramid from detailed chunks to high-level summary.

The Hierarchical Structure

Level 0: Original chunks (most detailed)

Level 1: Summaries of chunks (less detailed)

Level 2: Summaries of level 1 summaries (higher abstraction)

Level N: Final summary (most abstract)

Each level provides a view at different granularity. Users or downstream processes can access the appropriate level.

Building the Hierarchy

For a 300-page document:

Chunk: Divide into chapters or sections
Level 1: Summarize each chapter
Level 2: Group chapter summaries, summarize groups
Level 3: Summarize all level 2 summaries into final summary

This approach retains narrative flow better than summarizing random fixed-size chunks.

Recent Research (2025)

CoTHSSum: Integrates hierarchical segmentation with chain-of-thought prompting. The model reasons through structured segments, producing more coherent summaries.

NEXUSSUM: Uses hierarchical LLM agents for long-form summarization. Achieves up to 30% improvement over baselines, particularly for narratives where hierarchical processing mitigates context truncation.

Advantages

Preserves structure: Document organization is reflected in the hierarchy.

Multi-granularity access: Different use cases can access appropriate abstraction levels.

Scalability: Handles arbitrarily long documents by adding hierarchy levels.

Disadvantages

Complexity: More complex to implement and manage than flat approaches.

Latency: Building the full hierarchy takes time.

Error propagation: Early errors compound through layers.

Iterative Refinement

Iterative refinement processes documents sequentially, building up understanding over multiple passes.

Refine Pattern

Process the first chunk, then iteratively refine with each subsequent chunk:

Initialize: Summarize first chunk
Iterate: For each subsequent chunk, combine existing summary with new chunk, produce refined summary
Complete: Final iteration produces the finished summary

Each step has access to both the running summary and new content.

Advantages Over Map-Reduce

Context preservation: The running summary carries context across chunks. Later chunks benefit from understanding built from earlier chunks.

Incremental improvement: Each iteration can refine and correct earlier processing.

Disadvantages

Sequential: Can't parallelize—each step depends on the previous.

Order sensitivity: Results depend on chunk ordering.

Compression artifacts: Running summary may over-compress important details as document length increases.

When to Use

Iterative refinement excels when:

Document has narrative or logical flow that builds
Cross-chunk context is important
Parallelization isn't a priority
Quality matters more than speed

Extended Context Models

The 2025 landscape includes models with dramatically extended context windows.

The Extended Context Landscape

Model	Context Window
Gemini 1.5 Pro	2M tokens
Claude 3.5 (Opus)	200K tokens
GPT-4.5	128K tokens
Llama 3.1	128K tokens
Claude 3	200K tokens

When to Use Extended Context

Full document fits: If your document fits within context, full-context processing is often simpler and higher quality than chunking.

Cross-document reasoning: Tasks requiring reasoning across the entire document benefit from full context.

Quality priority: When quality matters more than cost, full context avoids chunking artifacts.

When to Still Use Chunking

Cost sensitivity: Processing 100K tokens costs significantly more than processing 10K tokens.

Retrieval tasks: For question-answering, retrieving relevant chunks is faster and often better than searching full context.

Recurring processing: If you'll process the same document many times, pre-chunking and indexing is more efficient.

Lost-in-the-middle concerns: For extraction tasks where precision matters, focused retrieval beats long context.

Hybrid Approaches

Combine extended context with retrieval:

Chunk and index: Create searchable chunks for the document
Retrieve broadly: For a query, retrieve more chunks than strictly necessary
Process in context: Use extended context to process retrieved chunks together
Benefit from both: Retrieval focuses attention; extended context provides more surrounding context

Task-Specific Strategies

Summarization

Short documents (< 10K tokens): Full context, single prompt.

Medium documents (10K-50K tokens): Full context if available, otherwise map-reduce with careful reduce prompting.

Long documents (50K+ tokens): Hierarchical summarization. Build structure-aware summaries at multiple levels.

Very long documents (500K+ tokens): Multi-stage hierarchical processing with chunking aligned to document structure.

Question Answering

Known question at index time: Chunk documents, embed chunks, retrieve relevant chunks for each question.

Unknown questions: Index comprehensively, retrieve relevant chunks, use extended context for multi-chunk reasoning.

Complex questions: Iterative retrieval—answer partial questions, use answers to inform further retrieval.

Information Extraction

Structured documents: Use document-aware chunking to keep related content together.

Extraction across document: Process chunks, deduplicate extractions, validate consistency.

Relationship extraction: May need extended context or iterative processing to capture cross-chunk relationships.

Analysis and Reasoning

Full-document reasoning: Tasks requiring holistic understanding need either full context or very carefully designed hierarchical processing.

Section-by-section analysis: Map-reduce works when sections can be analyzed independently.

Comparative analysis: May need to process related sections together using retrieval to identify relationships.

Production Considerations

Cost Management

Long document processing can be expensive. Strategies:

Cache intermediate results: Store chunk summaries for reuse.

Tiered processing: Use cheaper models for initial processing, expensive models for final refinement.

Adaptive chunking: Use larger chunks with cheaper models when quality requirements are lower.

Skip irrelevant sections: For focused tasks, identify and skip irrelevant document sections.

Latency Management

Long document processing is slow. Strategies:

Parallel processing: Maximize parallelization in map phases.

Streaming results: Stream partial results as they become available.

Precomputation: Pre-process documents when they're uploaded, not when they're queried.

Progressive disclosure: Show high-level results quickly, refine detail asynchronously.

Quality Assurance

Validation: Verify outputs against source documents for factual accuracy.

Human review: For high-stakes applications, human review of processed outputs.

Confidence signals: Track model confidence across processing stages.

Comparison: Test processing strategies on representative documents before production deployment.

The LLM×MapReduce Framework

LLM×MapReduce is a recent training-free framework specifically designed for long text processing using divide-and-conquer.

Key Challenges Addressed

The main challenge for divide-and-conquer frameworks is losing essential long-range information when splitting documents. Disrupted information falls into two categories:

Inter-chunk dependency: Information in one chunk depends on context from another chunk.

Inter-chunk conflict: Different chunks provide conflicting information that must be reconciled.

The Solution

LLM×MapReduce introduces a structured information protocol that:

Identifies dependencies between chunks
Passes relevant context across chunk boundaries
Resolves conflicts in the reduce phase
Maintains coherence across the full document

This enables short-context models to effectively handle long contexts without losing critical information.

Agentic Document Workflows

Traditional document processing follows predetermined paths: chunk, process, combine. Agentic approaches let an LLM-powered agent dynamically decide how to navigate and process documents based on the task at hand. This proves especially powerful for complex documents where the optimal processing strategy depends on content discovered during processing.

The Agentic Paradigm Shift

Static pipelines assume you know the processing strategy upfront. But consider a legal contract analysis task: should you summarize the entire document, extract specific clauses, compare terms across sections, or identify risks? The answer depends on what the document contains—which you don't know until you start reading.

Agentic document processing flips this model. Instead of prescribing a fixed pipeline, you give an agent tools for document navigation and let it decide how to explore and extract based on the task and what it discovers.

Agent Tools for Document Processing

Effective document agents need tools that mirror how humans read documents:

Navigation tools:

Get table of contents / section headings
Jump to specific section by name or number
Move to next/previous section
Search for keywords within document

Reading tools:

Read current section (with token budget)
Get section summary (pre-computed or on-demand)
Read specific page range
Extract tables or figures

Processing tools:

Summarize current content
Extract structured data from current section
Compare two sections
Add to working memory / notes

Output tools:

Compile findings into report
Generate structured output
Flag sections for human review

Multi-Agent Document Processing

Complex documents benefit from specialized agents working together:

Coordinator agent: Receives the user query, develops a processing plan, delegates to specialists, synthesizes final response.

Section specialists: Agents fine-tuned or prompted for specific content types—financial data extraction, legal clause analysis, technical specification parsing.

Verification agent: Reviews extracted information against source content, flags inconsistencies, requests re-processing when confidence is low.

This architecture handles documents where different sections require radically different processing approaches—a corporate filing with financial tables, legal disclaimers, and narrative sections.

Adaptive Chunking via Agents

Rather than pre-determining chunk boundaries, agents can chunk adaptively:

Initial scan: Agent skims document structure (TOC, headings, length)
Strategy selection: Based on document type and task, agent decides chunking approach
Dynamic adjustment: As processing proceeds, agent adjusts chunk sizes based on content density and relevance
Context management: Agent maintains a working memory of key findings, using it to inform subsequent processing

This outperforms static chunking when documents have heterogeneous structure—dense technical sections interspersed with boilerplate.

When Agentic Approaches Excel

Agentic document processing adds complexity and cost. It's most valuable when:

Document structure is unknown: You can't design a static pipeline without knowing the document format
Task requires exploration: Finding specific information that could be anywhere in the document
Processing strategy depends on content: Different sections need different treatment
Quality matters more than cost: The agent's adaptive decisions improve output quality

For routine processing of known document formats, static pipelines remain simpler and more predictable.

Library Comparison for Document Processing

The ecosystem offers multiple libraries for document processing, each with different strengths. Choosing the right library depends on your document types, processing needs, and existing infrastructure.

LangChain Document Loaders

LangChain provides extensive document loading capabilities through its document_loaders module.

Strengths:

Broad format support (100+ loaders): PDF, HTML, Markdown, Office documents, email, databases, APIs
Consistent Document interface across all loaders
Integration with LangChain's text splitters for chunking
Active development and community support

Document loading patterns:

Directory loaders for batch processing
Lazy loading for memory efficiency with large documents
Metadata extraction and propagation
Custom loader extension via base classes

Text splitting options:

RecursiveCharacterTextSplitter (recommended default)
TokenTextSplitter for precise token counts
MarkdownTextSplitter for markdown structure
Language-specific splitters for code

Best for: Projects already using LangChain, broad format requirements, rapid prototyping.

LlamaIndex Data Connectors

LlamaIndex (formerly GPT Index) focuses on data indexing and retrieval with strong document processing support.

Strengths:

Deep integration with indexing and retrieval
Hierarchical index structures for multi-level access
Node parsers that create rich node relationships
Advanced sentence window and auto-merging retrievers

Distinctive features:

Sentence window indexing: Each node includes surrounding sentences for context
Hierarchical parsers: Automatically extract document hierarchy
Metadata extractors: LLM-based metadata generation during ingestion

Best for: RAG-focused applications, complex retrieval requirements, applications needing document-level and chunk-level access.

Unstructured.io

Unstructured focuses specifically on document parsing and element extraction.

Strengths:

Superior handling of complex layouts (tables, figures, multi-column)
Element-type detection (title, narrative text, list item, table)
Partitioning strategies (fast, hi_res, ocr_only)
API service for scalable processing

Partitioning approach:

Detects document elements rather than raw text extraction
Preserves table structure as structured data
Extracts images and figures separately
Maintains reading order across complex layouts

Best for: Documents with complex layouts, scanned documents requiring OCR, enterprises needing managed service.

DocETL

DocETL is a declarative framework specifically for LLM-powered document processing pipelines.

Strengths:

Declarative YAML pipeline definition
Built-in optimization for LLM operations
Designed for complex multi-step document analysis
Automatic retry and error handling

Pipeline approach: Define operations (map, reduce, filter, resolve) declaratively; the framework handles execution, parallelization, and optimization.

Best for: Complex document analysis workflows, teams preferring declarative over imperative approaches.

PyMuPDF and pdf-parser Libraries

For PDF-specific workloads, specialized libraries offer fine-grained control.

PyMuPDF (fitz):

Fast PDF parsing with low-level access
Text, image, and annotation extraction
Page-level manipulation
Efficient for large PDF collections

pdfplumber:

Excellent table extraction
Visual debugging with page images
Character-level positioning data

Best for: PDF-heavy workloads requiring fine control, table extraction, custom parsing logic.

Comparison Table

Capability	LangChain	LlamaIndex	Unstructured	DocETL
Format breadth	Excellent	Good	Good	Limited
PDF tables	Basic	Basic	Excellent	Depends on loader
OCR support	Via integrations	Via integrations	Built-in	Via integrations
Hierarchical processing	Via custom code	Built-in	Via elements	Built-in
LLM integration	Native	Native	Separate step	Native
Async processing	Supported	Supported	API service	Supported
Self-hosted	Yes	Yes	Yes + API	Yes

Choosing a Library

Start with LangChain if you're building a general LLM application and need broad format support with a straightforward API.

Choose LlamaIndex if retrieval quality is paramount and you want sophisticated indexing structures like sentence windows or auto-merging.

Use Unstructured if you're processing complex documents with tables, figures, or layouts that simpler parsers struggle with.

Consider DocETL if you have complex multi-step document analysis that benefits from declarative pipeline definition.

Use specialized PDF libraries when you need fine-grained control over PDF parsing or are processing large PDF collections where efficiency matters.

In practice, many production systems combine libraries: Unstructured for parsing complex documents, LangChain or LlamaIndex for chunking and indexing, specialized libraries for particular document types.

Table of Contents

Long Document Processing

The Long Document Challenge

Why Context Windows Aren't Enough

The Processing Spectrum

Chunking Strategies

Fixed-Size Chunking

Semantic Chunking

Recursive Chunking

Document-Aware Chunking

Overlap Strategies

Map-Reduce Processing

The Basic Pattern

Summarization Example

Advantages

Disadvantages

When to Use

Hierarchical Processing

The Hierarchical Structure

Building the Hierarchy

Recent Research (2025)

Advantages

Disadvantages

Iterative Refinement

Refine Pattern

Advantages Over Map-Reduce

Disadvantages

When to Use

Extended Context Models

The Extended Context Landscape

When to Use Extended Context

When to Still Use Chunking

Hybrid Approaches

Task-Specific Strategies

Summarization

Question Answering

Information Extraction

Analysis and Reasoning

Production Considerations

Cost Management

Latency Management

Quality Assurance

The LLM×MapReduce Framework

Key Challenges Addressed

The Solution

Agentic Document Workflows

The Agentic Paradigm Shift

Agent Tools for Document Processing

Multi-Agent Document Processing

Adaptive Chunking via Agents

When Agentic Approaches Excel

Library Comparison for Document Processing

LangChain Document Loaders

LlamaIndex Data Connectors

Unstructured.io

DocETL

PyMuPDF and pdf-parser Libraries

Comparison Table

Choosing a Library

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Document Processing Pipelines: From PDF to RAG-Ready Chunks

Building Production-Ready RAG Systems: Lessons from the Field

Mastering LLM Context Windows: Strategies for Long-Context Applications

LLM Cost Engineering: Token Budgeting, Caching, and Model Routing for Production