Skip to main content
Back to Blog

Data Curation for LLM Training: The Hidden Foundation of Model Quality

A comprehensive guide to curating training data for large language models—from web crawl filtering and deduplication to quality classifiers and data mixing strategies. The unglamorous work that determines model quality.

10 min read
Share:

Data Quality Determines Model Quality

The AI industry obsesses over model architectures, training techniques, and parameter counts. Yet the single biggest determinant of model quality is often the least discussed: training data.

The evidence is overwhelming:

Scale alone isn't enough: A 7B model trained on carefully filtered data can outperform models trained on 40% more tokens of unfiltered data. Quality beats quantity.

Data curation yields compounding returns: NVIDIA reports that proper data curation can improve downstream task performance by up to 7%—without changing the model architecture or training procedure.

Garbage in, garbage out: Models trained on noisy, duplicated, or low-quality data exhibit worse generalization, increased memorization, inflated perplexity scores, and reduced sample efficiency.

The trend in 2025 is clear: prefer fewer, diverse, high-quality tokens over sheer volume. Including too many duplicates or boilerplate pages yields diminishing returns. The teams building the best models invest heavily in data curation—often more than in model architecture.

This guide covers the complete data curation pipeline: collection, filtering, deduplication, quality assessment, and mixing strategies.

The Data Curation Pipeline

A modern LLM data curation pipeline looks like this:

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                      DATA CURATION PIPELINE OVERVIEW                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  RAW DATA SOURCES                                                           │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐           │
│  │  Web    │  │  Books  │  │  Code   │  │  Papers │  │Conversa-│           │
│  │ Crawls  │  │         │  │  Repos  │  │         │  │  tions  │           │
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘           │
│       └────────────┴────────────┴────────────┴────────────┘                 │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 1: TEXT EXTRACTION & NORMALIZATION                            │   │
│  │                                                                       │   │
│  │  - HTML parsing and boilerplate removal                              │   │
│  │  - PDF/document extraction                                            │   │
│  │  - Unicode normalization                                              │   │
│  │  - Language detection                                                 │   │
│  │  - Character encoding fixes                                           │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 2: HEURISTIC FILTERING                                        │   │
│  │                                                                       │   │
│  │  Remove documents that fail rule-based checks:                       │   │
│  │  - Too short (< N characters)                                        │   │
│  │  - Too many special characters                                        │   │
│  │  - Excessive repetition (n-gram frequency)                           │   │
│  │  - Low word count / high symbol ratio                                │   │
│  │  - Blocklisted URLs/domains                                          │   │
│  │  - Detected as machine-generated spam                                │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 3: DEDUPLICATION                                              │   │
│  │                                                                       │   │
│  │  - Exact deduplication (hash matching)                               │   │
│  │  - Near-duplicate removal (MinHash / SimHash)                        │   │
│  │  - URL-based deduplication                                            │   │
│  │  - Paragraph-level deduplication                                      │   │
│  │  - Semantic deduplication (optional, expensive)                      │   │
│  │                                                                       │   │
│  │  Typically removes 20-30% of raw web data                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 4: QUALITY FILTERING                                          │   │
│  │                                                                       │   │
│  │  Model-based quality assessment:                                      │   │
│  │  - Perplexity filtering (against reference LM)                       │   │
│  │  - Educational value classifier (FineWeb-Edu style)                  │   │
│  │  - Domain quality classifiers                                         │   │
│  │  - NSFW/toxicity filtering                                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 5: PII & SAFETY FILTERING                                     │   │
│  │                                                                       │   │
│  │  - PII detection and redaction (emails, phones, SSNs)               │   │
│  │  - Toxic content removal                                              │   │
│  │  - Copyright/license filtering (for code)                            │   │
│  │  - Sensitive information removal                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 6: DATA MIXING & SAMPLING                                     │   │
│  │                                                                       │   │
│  │  - Domain weighting (more code? more books?)                         │   │
│  │  - Language balancing                                                 │   │
│  │  - Upsampling high-quality sources                                   │   │
│  │  - Curriculum design                                                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  FINAL TRAINING DATASET                                                     │
│  (Tokenized, shuffled, ready for training)                                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Let's examine each stage in detail.

Stage 1: Data Collection and Sources

Web Crawls

The largest source of text data is web crawls, primarily Common Crawl—a nonprofit that crawls the web monthly and makes the data freely available.

Common Crawl statistics:

  • 250+ billion pages crawled
  • Petabytes of raw data
  • Monthly snapshots since 2008
  • Contains the full spectrum of web content quality

Working with Common Crawl requires significant processing:

WARC parsing: Raw crawl data is stored in WARC (Web ARChive) format. Extracting text requires parsing HTML, handling various encodings, and removing boilerplate.

Boilerplate removal: Web pages contain navigation, ads, footers, and other non-content elements. Tools like trafilatura, newspaper3k, or custom extractors isolate the main content.

Language detection: Common Crawl contains content in hundreds of languages. Language detection (using fastText, langdetect, or similar) enables language-specific filtering.

Books and Long-Form Content

Books provide high-quality, long-form text that's crucial for training models on extended reasoning and narrative coherence.

Sources:

  • Project Gutenberg (public domain books)
  • Internet Archive (digitized books)
  • Academic publications
  • Legal documents and court cases

Challenges:

  • OCR errors in scanned books
  • Formatting artifacts
  • Duplicate editions
  • Copyright considerations

Code Repositories

Code is essential for training models with programming capabilities.

Sources:

  • GitHub (via GHArchive, BigQuery, or direct API)
  • GitLab, Bitbucket
  • Package repositories (PyPI, npm, crates.io)
  • Stack Overflow

Considerations:

  • License filtering (many only include permissively licensed code)
  • Language distribution (balance Python-heavy datasets)
  • Documentation vs. code separation
  • Test code vs. production code

Scientific Papers

Academic papers provide high-quality technical content.

Sources:

  • arXiv (preprints)
  • Semantic Scholar
  • PubMed (biomedical)
  • ACL Anthology (NLP)

Challenges:

  • PDF extraction (LaTeX formatting, equations, tables)
  • Citation parsing
  • Section segmentation

Curated Datasets

Beyond raw sources, teams often include curated datasets:

Instruction datasets: FLAN, Natural Instructions, Dolly Conversation datasets: ShareGPT, WildChat High-quality web: Wikipedia, StackExchange, selected subreddits

Stage 2: Heuristic Filtering

Heuristic filters use rule-based methods to remove obviously low-quality content. They're fast, interpretable, and catch the worst offenders.

Common Heuristic Filters

Length filters:

  • Minimum character count (e.g., 200+ characters)
  • Minimum word count (e.g., 50+ words)
  • Maximum length (extremely long documents may be generated spam)

Character composition:

  • Alphabetic character ratio (mostly letters vs. symbols)
  • Digit ratio (too many numbers suggests tables/data dumps)
  • Special character ratio
  • Whitespace ratio

Word-level heuristics:

  • Mean word length (too short = abbreviations/noise; too long = URLs/code)
  • Stop word ratio (very low = unnatural text)
  • Unique word ratio (too low = repetitive content)

Repetition detection:

  • N-gram repetition (repeated phrases suggest generated spam)
  • Line repetition (same line appearing multiple times)
  • Paragraph repetition

Structural filters:

  • Bullet point/list ratio
  • Header count
  • Sentence length variance

Example Heuristics from Major Datasets

C4 (Colossal Clean Crawled Corpus):

  • Remove pages with fewer than 3 sentences
  • Remove sentences without terminal punctuation
  • Remove pages containing "lorem ipsum"
  • Remove pages with bad words (basic profanity filter)
  • Remove lines starting with "JavaScript"

RefinedWeb:

  • Minimum 100 characters
  • Remove pages where >30% of lines end with an ellipsis
  • Remove pages with excessive hashtags or @mentions
  • URL-based deduplication

FineWeb:

  • More sophisticated repetition detection
  • Stricter length requirements
  • Domain-specific rules (remove known spam domains)

Implementing Heuristic Filters

The key principle: be aggressive but monitor what you're removing. Sample removed documents to ensure you're not discarding valuable content.

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                    HEURISTIC FILTER EXAMPLES                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  GOOD DOCUMENT (PASSES ALL FILTERS):                                        │
│  ──────────────────────────────────                                         │
│  "Large language models have revolutionized natural language processing.    │
│   These models, trained on vast amounts of text data, can generate         │
│   coherent and contextually relevant text across many domains."            │
│                                                                             │
│  ✓ Length: 247 characters                                                   │
│  ✓ Word count: 35 words                                                     │
│  ✓ Alphabetic ratio: 0.85                                                   │
│  ✓ Stop word ratio: 0.31                                                    │
│  ✓ No excessive repetition                                                  │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  BAD DOCUMENT 1 (FAILS LENGTH):                                             │
│  ─────────────────────────────────                                          │
│  "Click here!!!"                                                            │
│                                                                             │
│  ✗ Length: 14 characters (below threshold)                                  │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  BAD DOCUMENT 2 (FAILS REPETITION):                                         │
│  ──────────────────────────────────                                         │
│  "Buy now! Best prices! Buy now! Best prices! Buy now! Best prices!        │
│   Buy now! Best prices! Buy now! Best prices! Buy now! Best prices!"       │
│                                                                             │
│  ✗ 3-gram repetition ratio: 0.95 (above threshold of 0.3)                   │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  BAD DOCUMENT 3 (FAILS COMPOSITION):                                        │
│  ───────────────────────────────────                                        │
│  "$$$ ### *** !!! @@@ %%% ^^^ &&& ((( ))) +++ === --- ___"                 │
│                                                                             │
│  ✗ Alphabetic ratio: 0.0 (below threshold of 0.7)                           │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  BAD DOCUMENT 4 (FAILS STOP WORD RATIO):                                    │
│  ──────────────────────────────────────                                     │
│  "Synergistic paradigmatic methodologies systematically operationalize      │
│   transformative infrastructural implementations strategically."            │
│                                                                             │
│  ✗ Stop word ratio: 0.05 (below threshold of 0.15)                          │
│  This suggests artificially generated or heavily jargon-laden text         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Stage 3: Deduplication

Web-scraped datasets contain massive redundancy. The same content appears across many websites—syndicated articles, scraped content, boilerplate text. Training on duplicated data is wasteful and harmful:

Reduced efficiency: Learning the same content multiple times wastes compute.

Memorization: Models trained on duplicated content are more likely to memorize and regurgitate exact training sequences.

Inflated metrics: Perplexity scores become artificially low when test data overlaps with training duplicates.

Benchmark contamination: Duplicates increase the chance of test data leaking into training.

Deduplication typically removes 20-30% of raw web data—a massive but necessary cut.

Exact Deduplication

The simplest approach: compute a hash of each document (or normalized document) and remove exact matches.

Hash functions: MD5, SHA-256, or xxHash for speed

Normalization: Before hashing, normalize whitespace, case, and punctuation to catch trivial variations.

Granularity: Document-level, paragraph-level, or line-level depending on needs.

Exact deduplication is fast and catches verbatim copies but misses near-duplicates.

Fuzzy Deduplication (MinHash / SimHash)

Near-duplicates—documents that differ slightly (different ads, minor edits, formatting changes)—require fuzzy matching.

MinHash with Locality-Sensitive Hashing (LSH):

  1. Convert document to set of n-grams (shingles)
  2. Apply multiple hash functions to the shingle set
  3. Keep minimum hash value from each function (the "MinHash signature")
  4. Use LSH to efficiently find documents with similar signatures
  5. Verify candidates with actual Jaccard similarity

MinHash can detect documents that are 80%+ similar even with significant differences.

SimHash:

  1. Convert document to weighted feature vector
  2. Compute a single fingerprint hash
  3. Documents with similar fingerprints (low Hamming distance) are duplicates

SimHash is faster but less accurate than MinHash.

URL-Based Deduplication

Many duplicates are the same page crawled at different times or with different URL parameters.

URL normalization:

  • Remove query parameters (?)
  • Remove fragments (#)
  • Normalize www vs. non-www
  • Handle HTTP vs. HTTPS

Keep only the most recent or highest-quality version of each URL.

Paragraph-Level Deduplication

Beyond document-level, remove duplicated paragraphs that appear across many documents:

Boilerplate detection: Legal disclaimers, cookie notices, and navigation text appear on millions of pages.

Syndicated content: News articles syndicated across many sites.

Scraped content: Content farms that copy from original sources.

Semantic Deduplication

The most sophisticated (and expensive) approach: detect semantically similar content even with different wording.

Method:

  1. Embed documents using a sentence embedding model
  2. Cluster embeddings or use approximate nearest neighbor search
  3. Within clusters, keep only representative documents

Semantic deduplication catches paraphrased duplicates but requires significant compute (embedding millions/billions of documents).

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                     DEDUPLICATION METHODS COMPARED                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  METHOD              │ CATCHES                │ SPEED    │ COST            │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Exact Hash          │ Verbatim copies        │ Very fast│ Very low        │
│  (MD5/SHA256)        │                        │ O(n)     │                 │
│                      │                        │          │                 │
│  MinHash + LSH       │ Near-duplicates        │ Fast     │ Low             │
│                      │ (80%+ similar)         │ O(n)     │                 │
│                      │                        │          │                 │
│  SimHash             │ Near-duplicates        │ Very fast│ Very low        │
│                      │ (less accurate)        │ O(n)     │                 │
│                      │                        │          │                 │
│  URL Normalization   │ Same page, diff crawl  │ Very fast│ Very low        │
│                      │                        │ O(n)     │                 │
│                      │                        │          │                 │
│  Paragraph-level     │ Shared boilerplate     │ Fast     │ Low             │
│                      │ Syndicated paragraphs  │ O(n)     │                 │
│                      │                        │          │                 │
│  Semantic            │ Paraphrased content    │ Slow     │ High            │
│  (Embedding-based)   │ Same info, diff words  │ O(n²)    │ Requires GPU    │
│                      │                        │          │                 │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  TYPICAL PIPELINE:                                                          │
│                                                                             │
│  Raw Data                                                                   │
│     │                                                                       │
│     ├── URL deduplication ────────────► Removes ~10%                        │
│     │                                                                       │
│     ├── Exact hash deduplication ─────► Removes ~5%                         │
│     │                                                                       │
│     ├── MinHash near-deduplication ───► Removes ~10%                        │
│     │                                                                       │
│     ├── Paragraph deduplication ──────► Removes ~5%                         │
│     │                                                                       │
│     └── (Optional) Semantic dedup ────► Removes ~5%                         │
│                                                                             │
│  Total: 20-35% reduction from deduplication alone                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Stage 4: Quality Filtering

Heuristic filters catch obviously bad content, but much of the web is mediocre—grammatical, non-duplicated, but not particularly useful for training capable models. Quality filtering uses ML models to identify high-quality content.

Perplexity Filtering

Train a small language model on known high-quality text (e.g., Wikipedia) and use it to score documents. Low perplexity = similar to high-quality text; high perplexity = dissimilar.

Intuition: A model trained on clean English will be "surprised" (high perplexity) by garbled text, foreign languages mixed with English, or unusual content.

Caveats:

  • May filter out rare but valuable content (technical jargon, specialized domains)
  • Wikipedia-based models bias toward encyclopedia-style writing
  • Threshold tuning is critical

Classifier-Based Quality Filtering

Train a classifier to directly predict quality:

FineWeb-Edu approach:

  1. Use a strong LLM (Llama-3-70B) to annotate documents on a quality scale (e.g., educational value 0-5)
  2. Train a lightweight classifier (or use embeddings + regression) to predict these scores
  3. Filter documents below a threshold

This approach was used to create FineWeb-Edu, achieving strong results by selecting documents with high educational value.

Training quality classifiers:

  • Define your quality criteria (educational value, writing quality, factual density)
  • Generate labels using human annotation or LLM-as-judge
  • Train an efficient classifier (DistilBERT-sized models work well)
  • Apply at scale

Domain-Specific Quality Filters

Different content types need different quality measures:

Code quality:

  • Syntactic validity (does it parse?)
  • Presence of comments/documentation
  • Function length and complexity
  • Test coverage indicators

Scientific text:

  • Citation density
  • Presence of methodology sections
  • Technical vocabulary density

Conversational text:

  • Coherence across turns
  • Appropriate length of responses
  • Absence of toxic content

NSFW and Safety Filtering

Remove harmful content:

Toxicity classifiers: Detect hate speech, harassment, threats

NSFW detection: Identify adult content

Personally identifiable information (PII): Detect and redact:

  • Email addresses (regex + classifier)
  • Phone numbers (regex)
  • Social security numbers (regex)
  • Physical addresses (NER)
  • Names (NER, with context)

Safety considerations:

  • Instructions for illegal activities
  • Malware and exploits
  • Detailed personal information

Stage 5: Data Mixing

A final training dataset isn't just filtered web data—it's a carefully designed mixture of multiple sources with specific proportions.

Why Mixing Matters

Different data sources provide different capabilities:

  • Web text: General knowledge, writing styles, current events
  • Books: Long-form coherence, narrative, deep knowledge
  • Code: Programming capabilities, logical reasoning
  • Math: Mathematical reasoning (when included as text)
  • Scientific papers: Technical knowledge, formal reasoning
  • Conversations: Dialog capabilities, instruction following

The mixture determines the model's strengths and weaknesses.

Mixing Strategies

Proportional mixing: Include sources in proportion to their natural size (heavily favors web).

Domain upsampling: Artificially increase representation of valuable but smaller sources (more code, more books).

Quality-weighted mixing: Higher-quality sources get more representation.

Curriculum learning: Change the mixture during training (e.g., easier content early, harder content later).

Language balancing: For multilingual models, balance language representation (don't let English dominate completely).

The Llama Approach

Meta's Llama models use sophisticated data mixing:

Llama 3 recipe:

  • Upsampled high-quality sources (selected websites, books)
  • Code heavily represented (35% of tokens in some stages)
  • Non-English languages included but English-dominant
  • Different mixtures for different training stages
  • Annealing phase with highest-quality data at the end

Determining Optimal Mixtures

Finding the right mixture is empirical:

Scaling laws: Small-scale experiments can predict large-scale performance, but mixture effects don't always transfer.

Ablation studies: Train models with different mixtures, evaluate on target tasks.

Domain probing: Measure performance on domain-specific benchmarks as a function of domain representation.

The bitter lesson: More data from more domains generally helps, up to a point.

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                      DATA MIXING STRATEGIES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  EXAMPLE MIXTURE (Conceptual, not actual Llama recipe):                     │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                       │   │
│  │  Web (filtered)    ████████████████████████████████░░░░░░░░  50%     │   │
│  │                                                                       │   │
│  │  Code              ████████████████░░░░░░░░░░░░░░░░░░░░░░░░  20%     │   │
│  │                                                                       │   │
│  │  Books             ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  10%     │   │
│  │                                                                       │   │
│  │  Wikipedia         ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  8%      │   │
│  │                                                                       │   │
│  │  Scientific        ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  5%      │   │
│  │                                                                       │   │
│  │  Conversations     ███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  4%      │   │
│  │                                                                       │   │
│  │  Math              ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  3%      │   │
│  │                                                                       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  MIXING CONSIDERATIONS:                                                     │
│                                                                             │
│  Domain upsampling:                                                         │
│  - Code is ~5% of natural web distribution but often upsampled to 20-35%   │
│  - Books provide long-form coherence worth overrepresenting                │
│  - Math data is scarce but valuable for reasoning                          │
│                                                                             │
│  Quality tiers within domains:                                              │
│  - Web: Tier 1 (curated sites) vs Tier 2 (filtered CC)                     │
│  - Code: Popular repos vs random repositories                              │
│  - The Pile strategy: Specific high-quality sources identified             │
│                                                                             │
│  Curriculum considerations:                                                 │
│  - Early training: More diverse, slightly noisier data                     │
│  - Late training: Higher quality, more curated data                        │
│  - Annealing: Final phase on highest-quality subset                        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Contamination and Benchmark Integrity

A critical concern in data curation: benchmark contamination. If test data appears in training data, evaluation metrics are meaningless.

Types of Contamination

Direct contamination: The exact test examples appear in training data (e.g., benchmark questions with answers).

Indirect contamination: Content highly similar to test data (same questions with minor rephrasing).

Source contamination: The original sources of benchmark data (e.g., the books used to create reading comprehension tests).

Detection Methods

N-gram overlap: Check for significant n-gram overlap between training and test data. 8-13 gram matches are suspicious.

Embedding similarity: Flag training documents highly similar to test examples.

Exact match on canonicalized text: Normalize and compare.

Mitigation

Blocklisting: Identify and remove known benchmark sources from training data.

Temporal cutoffs: Use training data from before benchmarks were created (limited applicability for older benchmarks).

Held-out evaluation: Create private evaluation sets not released publicly.

Canary strings: Include unique strings in evaluation data to detect their presence in model outputs.

Tools and Infrastructure

Processing web-scale data requires serious infrastructure. Several tools and frameworks help:

NVIDIA NeMo Curator

A GPU-accelerated data curation framework:

  • Modular pipeline components
  • GPU-accelerated deduplication using RAPIDS
  • Quality classifiers
  • PII detection and redaction
  • Multi-node, multi-GPU scaling

NeMo Curator can process terabytes of text data efficiently, integrating semantic deduplication, heuristic filtering, classification, and PII redaction.

Datatrove

Hugging Face's data processing library:

  • Designed for FineWeb and similar datasets
  • Support for Common Crawl processing
  • Efficient distributed processing
  • Built-in quality filters

LP Data Pipeline

A lightweight framework optimized for CPU processing:

  • Lower infrastructure requirements
  • Focus on streamlined extraction and filtering
  • Reduced preparation time and cost

Custom Solutions

Many teams build custom pipelines using:

  • Apache Spark for distributed processing
  • Ray for parallel Python workloads
  • PostgreSQL/DuckDB for metadata management
  • MinIO/S3 for data storage

Practical Recommendations

Based on best practices from major dataset efforts:

Start with Established Datasets

Don't reinvent the wheel. Use filtered datasets as starting points:

  • FineWeb / FineWeb-Edu: State-of-the-art filtered Common Crawl
  • RedPajama v2: Open reproduction of Llama training data
  • RefinedWeb: Carefully curated web data
  • The Pile: Diverse, well-documented dataset

Invest in Deduplication

Deduplication provides the highest ROI for data quality. Use at minimum:

  • URL-based deduplication
  • Exact hash deduplication
  • MinHash near-deduplication

Build Quality Classifiers for Your Use Case

Generic quality filters are a good start, but domain-specific classifiers can significantly improve results for specialized applications.

Monitor What You're Removing

Always sample removed documents. You'll find:

  • False positives (good content incorrectly filtered)
  • Patterns to improve filters
  • Edge cases you didn't anticipate

Document Everything

Data curation decisions are hard to reverse. Document:

  • Filter thresholds and rationale
  • Data sources and versions
  • Processing pipeline code
  • Quality metrics at each stage

Think About Data Contamination Early

Once training starts, contamination is baked in. Check for benchmark contamination before training.


Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles