Skip to main content
Back to Blog

Long Document Processing: Strategies for LLM Applications Beyond Context Windows

Comprehensive guide to processing documents that exceed LLM context windows. Covers chunking strategies, map-reduce summarization, hierarchical processing, iterative refinement, and the 2025 landscape of extended context models.

14 min read
Share:

Long Document Processing

LLM context windows have grown dramatically—from 4K tokens to millions in some models. Yet real-world documents still regularly exceed these limits. A single legal contract can span 100,000 tokens. A codebase contains millions. Even with expanded context, processing strategies matter: cramming everything into context isn't always the best approach for quality or cost.

This guide covers strategies for processing long documents: chunking approaches, map-reduce patterns, hierarchical summarization, iterative refinement, and practical guidance on when to use each technique.


The Long Document Challenge

Why Context Windows Aren't Enough

Even with models supporting 100K+ tokens, naive full-context processing has problems:

Lost in the middle: Models struggle to attend equally to all parts of long contexts. Information in the middle of long prompts receives less attention than information at the beginning or end.

Cost: Longer contexts cost more. Processing a 100K token document costs 25x more than processing 4K tokens.

Latency: Time-to-first-token scales with context length. Long contexts mean slower responses.

Quality degradation: Even when models technically handle long contexts, quality often degrades compared to processing smaller, relevant portions.

The Processing Spectrum

Long document processing exists on a spectrum:

Retrieval-based: Don't process the whole document. Retrieve relevant portions and process only those. Fast and cheap but may miss context.

Chunk-and-process: Divide the document into chunks, process each independently, then combine results. Scalable but loses cross-chunk context.

Hierarchical: Build summaries at multiple levels of abstraction. Preserves structure but adds complexity.

Full-context: Process the entire document at once. Simple but expensive and quality-limited.

The right approach depends on the task, document type, quality requirements, and budget.


Chunking Strategies

Chunking divides documents into smaller pieces suitable for LLM processing. The art is chunking in ways that preserve meaning and minimize information loss at boundaries.

Fixed-Size Chunking

The simplest approach: divide text into chunks of fixed token count.

Implementation: Split every N tokens, with M tokens of overlap between chunks.

Typical values: 400-800 tokens per chunk, 50-100 tokens overlap.

Advantages: Simple, predictable, works for any text.

Disadvantages: Ignores document structure. Chunks may cut mid-sentence, mid-paragraph, or mid-section, losing context.

Semantic Chunking

Split at natural boundaries where meaning changes:

Sentence-level: Split between sentences. Preserves sentence integrity but chunks vary in size.

Paragraph-level: Split between paragraphs. Preserves topical coherence within chunks.

Section-level: Split at section headings. Preserves document structure for structured documents.

Semantic similarity: Detect topic shifts by measuring embedding similarity between adjacent sentences. Split when similarity drops below threshold.

Advantages: Preserves semantic coherence. Chunks contain complete thoughts.

Disadvantages: Variable chunk sizes complicate batch processing. Requires detecting boundaries.

Recursive Chunking

Recursively split documents using multiple strategies:

  1. Try splitting by sections
  2. If sections too large, split by paragraphs
  3. If paragraphs too large, split by sentences
  4. If sentences too large, split by fixed size

This produces chunks that respect structure when possible and fall back to fixed-size when necessary.

Document-Aware Chunking

Leverage document format for intelligent splitting:

Code: Split at function/class boundaries, preserving complete units.

Markdown: Split at headers, preserving section structure.

HTML: Split at semantic tags (article, section, div), respecting DOM structure.

PDF: Split at page boundaries or detected section breaks.

Tables: Keep tables together; don't split rows across chunks.

Overlap Strategies

Overlap between chunks ensures context at boundaries:

Fixed overlap: Include the same number of tokens from the previous chunk.

Sentence overlap: Include complete sentences from the previous chunk.

Summary overlap: Instead of raw text, include a summary of the previous chunk.

Overlap helps but adds redundancy. Balance context preservation against increased total tokens.


Map-Reduce Processing

Map-reduce is a foundational pattern for long document processing: process chunks independently, then combine results.

The Basic Pattern

Map phase: Apply the same operation to each chunk independently. For summarization, summarize each chunk. For extraction, extract from each chunk.

Reduce phase: Combine chunk results into a final result. For summarization, summarize the summaries. For extraction, deduplicate and merge extractions.

Summarization Example

For a 50,000-token document:

  1. Split: Divide into 50 chunks of 1,000 tokens each
  2. Map: Summarize each chunk into ~100 tokens (can parallelize)
  3. Combine: You now have ~5,000 tokens of summaries
  4. Reduce: Summarize the combined summaries into final output

Advantages

Parallelization: Map operations are embarrassingly parallel. Process all chunks simultaneously for speed.

Scalability: Document length doesn't limit processing—just add more chunks.

Simplicity: Each operation is straightforward LLM call.

Disadvantages

Information loss: Cumulative error risk—errors in early summaries propagate through layers.

Cross-chunk context: Independent chunk processing can't capture relationships spanning chunks.

Cost consideration: Map-reduce might not be less expensive than large context models. You still process all tokens plus the intermediate summaries.

When to Use

Map-reduce excels when:

  • Chunks can be processed independently (each contains complete information units)
  • The reduce operation naturally combines partial results
  • Parallelization is valuable for latency
  • Document structure aligns with chunking

Hierarchical Processing

Hierarchical summarization builds multiple abstraction levels, creating a pyramid from detailed chunks to high-level summary.

The Hierarchical Structure

Level 0: Original chunks (most detailed)

Level 1: Summaries of chunks (less detailed)

Level 2: Summaries of level 1 summaries (higher abstraction)

Level N: Final summary (most abstract)

Each level provides a view at different granularity. Users or downstream processes can access the appropriate level.

Building the Hierarchy

For a 300-page document:

  1. Chunk: Divide into chapters or sections
  2. Level 1: Summarize each chapter
  3. Level 2: Group chapter summaries, summarize groups
  4. Level 3: Summarize all level 2 summaries into final summary

This approach retains narrative flow better than summarizing random fixed-size chunks.

Recent Research (2025)

CoTHSSum: Integrates hierarchical segmentation with chain-of-thought prompting. The model reasons through structured segments, producing more coherent summaries.

NEXUSSUM: Uses hierarchical LLM agents for long-form summarization. Achieves up to 30% improvement over baselines, particularly for narratives where hierarchical processing mitigates context truncation.

Advantages

Preserves structure: Document organization is reflected in the hierarchy.

Multi-granularity access: Different use cases can access appropriate abstraction levels.

Scalability: Handles arbitrarily long documents by adding hierarchy levels.

Disadvantages

Complexity: More complex to implement and manage than flat approaches.

Latency: Building the full hierarchy takes time.

Error propagation: Early errors compound through layers.


Iterative Refinement

Iterative refinement processes documents sequentially, building up understanding over multiple passes.

Refine Pattern

Process the first chunk, then iteratively refine with each subsequent chunk:

  1. Initialize: Summarize first chunk
  2. Iterate: For each subsequent chunk, combine existing summary with new chunk, produce refined summary
  3. Complete: Final iteration produces the finished summary

Each step has access to both the running summary and new content.

Advantages Over Map-Reduce

Context preservation: The running summary carries context across chunks. Later chunks benefit from understanding built from earlier chunks.

Incremental improvement: Each iteration can refine and correct earlier processing.

Disadvantages

Sequential: Can't parallelize—each step depends on the previous.

Order sensitivity: Results depend on chunk ordering.

Compression artifacts: Running summary may over-compress important details as document length increases.

When to Use

Iterative refinement excels when:

  • Document has narrative or logical flow that builds
  • Cross-chunk context is important
  • Parallelization isn't a priority
  • Quality matters more than speed

Extended Context Models

The 2025 landscape includes models with dramatically extended context windows.

The Extended Context Landscape

ModelContext Window
Gemini 1.5 Pro2M tokens
Claude 3.5 (Opus)200K tokens
GPT-4.5128K tokens
Llama 3.1128K tokens
Claude 3200K tokens

When to Use Extended Context

Full document fits: If your document fits within context, full-context processing is often simpler and higher quality than chunking.

Cross-document reasoning: Tasks requiring reasoning across the entire document benefit from full context.

Quality priority: When quality matters more than cost, full context avoids chunking artifacts.

When to Still Use Chunking

Cost sensitivity: Processing 100K tokens costs significantly more than processing 10K tokens.

Retrieval tasks: For question-answering, retrieving relevant chunks is faster and often better than searching full context.

Recurring processing: If you'll process the same document many times, pre-chunking and indexing is more efficient.

Lost-in-the-middle concerns: For extraction tasks where precision matters, focused retrieval beats long context.

Hybrid Approaches

Combine extended context with retrieval:

  1. Chunk and index: Create searchable chunks for the document
  2. Retrieve broadly: For a query, retrieve more chunks than strictly necessary
  3. Process in context: Use extended context to process retrieved chunks together
  4. Benefit from both: Retrieval focuses attention; extended context provides more surrounding context

Task-Specific Strategies

Summarization

Short documents (< 10K tokens): Full context, single prompt.

Medium documents (10K-50K tokens): Full context if available, otherwise map-reduce with careful reduce prompting.

Long documents (50K+ tokens): Hierarchical summarization. Build structure-aware summaries at multiple levels.

Very long documents (500K+ tokens): Multi-stage hierarchical processing with chunking aligned to document structure.

Question Answering

Known question at index time: Chunk documents, embed chunks, retrieve relevant chunks for each question.

Unknown questions: Index comprehensively, retrieve relevant chunks, use extended context for multi-chunk reasoning.

Complex questions: Iterative retrieval—answer partial questions, use answers to inform further retrieval.

Information Extraction

Structured documents: Use document-aware chunking to keep related content together.

Extraction across document: Process chunks, deduplicate extractions, validate consistency.

Relationship extraction: May need extended context or iterative processing to capture cross-chunk relationships.

Analysis and Reasoning

Full-document reasoning: Tasks requiring holistic understanding need either full context or very carefully designed hierarchical processing.

Section-by-section analysis: Map-reduce works when sections can be analyzed independently.

Comparative analysis: May need to process related sections together using retrieval to identify relationships.


Production Considerations

Cost Management

Long document processing can be expensive. Strategies:

Cache intermediate results: Store chunk summaries for reuse.

Tiered processing: Use cheaper models for initial processing, expensive models for final refinement.

Adaptive chunking: Use larger chunks with cheaper models when quality requirements are lower.

Skip irrelevant sections: For focused tasks, identify and skip irrelevant document sections.

Latency Management

Long document processing is slow. Strategies:

Parallel processing: Maximize parallelization in map phases.

Streaming results: Stream partial results as they become available.

Precomputation: Pre-process documents when they're uploaded, not when they're queried.

Progressive disclosure: Show high-level results quickly, refine detail asynchronously.

Quality Assurance

Validation: Verify outputs against source documents for factual accuracy.

Human review: For high-stakes applications, human review of processed outputs.

Confidence signals: Track model confidence across processing stages.

Comparison: Test processing strategies on representative documents before production deployment.


The LLM×MapReduce Framework

LLM×MapReduce is a recent training-free framework specifically designed for long text processing using divide-and-conquer.

Key Challenges Addressed

The main challenge for divide-and-conquer frameworks is losing essential long-range information when splitting documents. Disrupted information falls into two categories:

Inter-chunk dependency: Information in one chunk depends on context from another chunk.

Inter-chunk conflict: Different chunks provide conflicting information that must be reconciled.

The Solution

LLM×MapReduce introduces a structured information protocol that:

  • Identifies dependencies between chunks
  • Passes relevant context across chunk boundaries
  • Resolves conflicts in the reduce phase
  • Maintains coherence across the full document

This enables short-context models to effectively handle long contexts without losing critical information.


Agentic Document Workflows

Traditional document processing follows predetermined paths: chunk, process, combine. Agentic approaches let an LLM-powered agent dynamically decide how to navigate and process documents based on the task at hand. This proves especially powerful for complex documents where the optimal processing strategy depends on content discovered during processing.

The Agentic Paradigm Shift

Static pipelines assume you know the processing strategy upfront. But consider a legal contract analysis task: should you summarize the entire document, extract specific clauses, compare terms across sections, or identify risks? The answer depends on what the document contains—which you don't know until you start reading.

Agentic document processing flips this model. Instead of prescribing a fixed pipeline, you give an agent tools for document navigation and let it decide how to explore and extract based on the task and what it discovers.

Agent Tools for Document Processing

Effective document agents need tools that mirror how humans read documents:

Navigation tools:

  • Get table of contents / section headings
  • Jump to specific section by name or number
  • Move to next/previous section
  • Search for keywords within document

Reading tools:

  • Read current section (with token budget)
  • Get section summary (pre-computed or on-demand)
  • Read specific page range
  • Extract tables or figures

Processing tools:

  • Summarize current content
  • Extract structured data from current section
  • Compare two sections
  • Add to working memory / notes

Output tools:

  • Compile findings into report
  • Generate structured output
  • Flag sections for human review

Multi-Agent Document Processing

Complex documents benefit from specialized agents working together:

Coordinator agent: Receives the user query, develops a processing plan, delegates to specialists, synthesizes final response.

Section specialists: Agents fine-tuned or prompted for specific content types—financial data extraction, legal clause analysis, technical specification parsing.

Verification agent: Reviews extracted information against source content, flags inconsistencies, requests re-processing when confidence is low.

This architecture handles documents where different sections require radically different processing approaches—a corporate filing with financial tables, legal disclaimers, and narrative sections.

Adaptive Chunking via Agents

Rather than pre-determining chunk boundaries, agents can chunk adaptively:

  1. Initial scan: Agent skims document structure (TOC, headings, length)
  2. Strategy selection: Based on document type and task, agent decides chunking approach
  3. Dynamic adjustment: As processing proceeds, agent adjusts chunk sizes based on content density and relevance
  4. Context management: Agent maintains a working memory of key findings, using it to inform subsequent processing

This outperforms static chunking when documents have heterogeneous structure—dense technical sections interspersed with boilerplate.

When Agentic Approaches Excel

Agentic document processing adds complexity and cost. It's most valuable when:

  • Document structure is unknown: You can't design a static pipeline without knowing the document format
  • Task requires exploration: Finding specific information that could be anywhere in the document
  • Processing strategy depends on content: Different sections need different treatment
  • Quality matters more than cost: The agent's adaptive decisions improve output quality

For routine processing of known document formats, static pipelines remain simpler and more predictable.


Library Comparison for Document Processing

The ecosystem offers multiple libraries for document processing, each with different strengths. Choosing the right library depends on your document types, processing needs, and existing infrastructure.

LangChain Document Loaders

LangChain provides extensive document loading capabilities through its document_loaders module.

Strengths:

  • Broad format support (100+ loaders): PDF, HTML, Markdown, Office documents, email, databases, APIs
  • Consistent Document interface across all loaders
  • Integration with LangChain's text splitters for chunking
  • Active development and community support

Document loading patterns:

  • Directory loaders for batch processing
  • Lazy loading for memory efficiency with large documents
  • Metadata extraction and propagation
  • Custom loader extension via base classes

Text splitting options:

  • RecursiveCharacterTextSplitter (recommended default)
  • TokenTextSplitter for precise token counts
  • MarkdownTextSplitter for markdown structure
  • Language-specific splitters for code

Best for: Projects already using LangChain, broad format requirements, rapid prototyping.

LlamaIndex Data Connectors

LlamaIndex (formerly GPT Index) focuses on data indexing and retrieval with strong document processing support.

Strengths:

  • Deep integration with indexing and retrieval
  • Hierarchical index structures for multi-level access
  • Node parsers that create rich node relationships
  • Advanced sentence window and auto-merging retrievers

Distinctive features:

  • Sentence window indexing: Each node includes surrounding sentences for context
  • Hierarchical parsers: Automatically extract document hierarchy
  • Metadata extractors: LLM-based metadata generation during ingestion

Best for: RAG-focused applications, complex retrieval requirements, applications needing document-level and chunk-level access.

Unstructured.io

Unstructured focuses specifically on document parsing and element extraction.

Strengths:

  • Superior handling of complex layouts (tables, figures, multi-column)
  • Element-type detection (title, narrative text, list item, table)
  • Partitioning strategies (fast, hi_res, ocr_only)
  • API service for scalable processing

Partitioning approach:

  • Detects document elements rather than raw text extraction
  • Preserves table structure as structured data
  • Extracts images and figures separately
  • Maintains reading order across complex layouts

Best for: Documents with complex layouts, scanned documents requiring OCR, enterprises needing managed service.

DocETL

DocETL is a declarative framework specifically for LLM-powered document processing pipelines.

Strengths:

  • Declarative YAML pipeline definition
  • Built-in optimization for LLM operations
  • Designed for complex multi-step document analysis
  • Automatic retry and error handling

Pipeline approach: Define operations (map, reduce, filter, resolve) declaratively; the framework handles execution, parallelization, and optimization.

Best for: Complex document analysis workflows, teams preferring declarative over imperative approaches.

PyMuPDF and pdf-parser Libraries

For PDF-specific workloads, specialized libraries offer fine-grained control.

PyMuPDF (fitz):

  • Fast PDF parsing with low-level access
  • Text, image, and annotation extraction
  • Page-level manipulation
  • Efficient for large PDF collections

pdfplumber:

  • Excellent table extraction
  • Visual debugging with page images
  • Character-level positioning data

Best for: PDF-heavy workloads requiring fine control, table extraction, custom parsing logic.

Comparison Table

CapabilityLangChainLlamaIndexUnstructuredDocETL
Format breadthExcellentGoodGoodLimited
PDF tablesBasicBasicExcellentDepends on loader
OCR supportVia integrationsVia integrationsBuilt-inVia integrations
Hierarchical processingVia custom codeBuilt-inVia elementsBuilt-in
LLM integrationNativeNativeSeparate stepNative
Async processingSupportedSupportedAPI serviceSupported
Self-hostedYesYesYes + APIYes

Choosing a Library

Start with LangChain if you're building a general LLM application and need broad format support with a straightforward API.

Choose LlamaIndex if retrieval quality is paramount and you want sophisticated indexing structures like sentence windows or auto-merging.

Use Unstructured if you're processing complex documents with tables, figures, or layouts that simpler parsers struggle with.

Consider DocETL if you have complex multi-step document analysis that benefits from declarative pipeline definition.

Use specialized PDF libraries when you need fine-grained control over PDF parsing or are processing large PDF collections where efficiency matters.

In practice, many production systems combine libraries: Unstructured for parsing complex documents, LangChain or LlamaIndex for chunking and indexing, specialized libraries for particular document types.


Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles