How much data do I need to train an LLM?

It depends on model size and target capabilities. General guidance from Chinchilla scaling laws: train on ~20 tokens per parameter (a 7B model wants ~140B tokens). However, many successful models train longer than this. For fine-tuning, thousands to millions of examples depending on the task.

Should I use synthetic data?

Synthetic data (LLM-generated) can help, especially for instruction tuning and specialized domains. However, models trained on too much of their own output can degrade (model collapse). Use synthetic data judiciously, mixed with real data.

How do I handle code licensing?

Many datasets only include permissively licensed code (MIT, Apache, BSD). Tools like license detectors can help filter. The legal landscape is evolving—consult legal counsel for commercial applications.

What about data freshness?

Web data becomes stale. Common Crawl provides monthly snapshots. For applications needing current information, consider ongoing data updates and retrieval-augmented approaches rather than training on static data.

How do I balance quality vs. quantity?

Current consensus favors quality. A smaller, cleaner dataset often beats a larger, noisier one. Start with aggressive filtering, then relax thresholds if you need more data and quality metrics remain acceptable.

Data Curation for LLM Training: The Hidden Foundation of Model Quality | Enrico Piovano

Data Quality Determines Model Quality

The AI industry obsesses over model architectures, training techniques, and parameter counts. Yet the single biggest determinant of model quality is often the least discussed: training data.

The evidence is overwhelming:

Scale alone isn't enough: A 7B model trained on carefully filtered data can outperform models trained on 40% more tokens of unfiltered data. Quality beats quantity.

Data curation yields compounding returns: NVIDIA reports that proper data curation can improve downstream task performance by up to 7%—without changing the model architecture or training procedure.

Garbage in, garbage out: Models trained on noisy, duplicated, or low-quality data exhibit worse generalization, increased memorization, inflated perplexity scores, and reduced sample efficiency.

The trend in 2025 is clear: prefer fewer, diverse, high-quality tokens over sheer volume. Including too many duplicates or boilerplate pages yields diminishing returns. The teams building the best models invest heavily in data curation—often more than in model architecture.

This guide covers the complete data curation pipeline: collection, filtering, deduplication, quality assessment, and mixing strategies.

The Data Curation Pipeline

A modern LLM data curation pipeline looks like this:

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                      DATA CURATION PIPELINE OVERVIEW                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  RAW DATA SOURCES                                                           │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐           │
│  │  Web    │  │  Books  │  │  Code   │  │  Papers │  │Conversa-│           │
│  │ Crawls  │  │         │  │  Repos  │  │         │  │  tions  │           │
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘           │
│       └────────────┴────────────┴────────────┴────────────┘                 │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 1: TEXT EXTRACTION & NORMALIZATION                            │   │
│  │                                                                       │   │
│  │  - HTML parsing and boilerplate removal                              │   │
│  │  - PDF/document extraction                                            │   │
│  │  - Unicode normalization                                              │   │
│  │  - Language detection                                                 │   │
│  │  - Character encoding fixes                                           │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 2: HEURISTIC FILTERING                                        │   │
│  │                                                                       │   │
│  │  Remove documents that fail rule-based checks:                       │   │
│  │  - Too short (< N characters)                                        │   │
│  │  - Too many special characters                                        │   │
│  │  - Excessive repetition (n-gram frequency)                           │   │
│  │  - Low word count / high symbol ratio                                │   │
│  │  - Blocklisted URLs/domains                                          │   │
│  │  - Detected as machine-generated spam                                │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 3: DEDUPLICATION                                              │   │
│  │                                                                       │   │
│  │  - Exact deduplication (hash matching)                               │   │
│  │  - Near-duplicate removal (MinHash / SimHash)                        │   │
│  │  - URL-based deduplication                                            │   │
│  │  - Paragraph-level deduplication                                      │   │
│  │  - Semantic deduplication (optional, expensive)                      │   │
│  │                                                                       │   │
│  │  Typically removes 20-30% of raw web data                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 4: QUALITY FILTERING                                          │   │
│  │                                                                       │   │
│  │  Model-based quality assessment:                                      │   │
│  │  - Perplexity filtering (against reference LM)                       │   │
│  │  - Educational value classifier (FineWeb-Edu style)                  │   │
│  │  - Domain quality classifiers                                         │   │
│  │  - NSFW/toxicity filtering                                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 5: PII & SAFETY FILTERING                                     │   │
│  │                                                                       │   │
│  │  - PII detection and redaction (emails, phones, SSNs)               │   │
│  │  - Toxic content removal                                              │   │
│  │  - Copyright/license filtering (for code)                            │   │
│  │  - Sensitive information removal                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  STAGE 6: DATA MIXING & SAMPLING                                     │   │
│  │                                                                       │   │
│  │  - Domain weighting (more code? more books?)                         │   │
│  │  - Language balancing                                                 │   │
│  │  - Upsampling high-quality sources                                   │   │
│  │  - Curriculum design                                                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                   │                                         │
│                                   ▼                                         │
│  FINAL TRAINING DATASET                                                     │
│  (Tokenized, shuffled, ready for training)                                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Let's examine each stage in detail.

Stage 1: Data Collection and Sources

Web Crawls

The largest source of text data is web crawls, primarily Common Crawl—a nonprofit that crawls the web monthly and makes the data freely available.

Common Crawl statistics:

250+ billion pages crawled
Petabytes of raw data
Monthly snapshots since 2008
Contains the full spectrum of web content quality

Working with Common Crawl requires significant processing:

WARC parsing: Raw crawl data is stored in WARC (Web ARChive) format. Extracting text requires parsing HTML, handling various encodings, and removing boilerplate.

Boilerplate removal: Web pages contain navigation, ads, footers, and other non-content elements. Tools like trafilatura, newspaper3k, or custom extractors isolate the main content.

Language detection: Common Crawl contains content in hundreds of languages. Language detection (using fastText, langdetect, or similar) enables language-specific filtering.

Books and Long-Form Content

Books provide high-quality, long-form text that's crucial for training models on extended reasoning and narrative coherence.

Sources:

Project Gutenberg (public domain books)
Internet Archive (digitized books)
Academic publications
Legal documents and court cases

Challenges:

OCR errors in scanned books
Formatting artifacts
Duplicate editions
Copyright considerations

Code Repositories

Code is essential for training models with programming capabilities.

Sources:

GitHub (via GHArchive, BigQuery, or direct API)
GitLab, Bitbucket
Package repositories (PyPI, npm, crates.io)
Stack Overflow

Considerations:

License filtering (many only include permissively licensed code)
Language distribution (balance Python-heavy datasets)
Documentation vs. code separation
Test code vs. production code

Scientific Papers

Academic papers provide high-quality technical content.

Sources:

arXiv (preprints)
Semantic Scholar
PubMed (biomedical)
ACL Anthology (NLP)

Challenges:

PDF extraction (LaTeX formatting, equations, tables)
Citation parsing
Section segmentation

Curated Datasets

Beyond raw sources, teams often include curated datasets:

Instruction datasets: FLAN, Natural Instructions, Dolly Conversation datasets: ShareGPT, WildChat High-quality web: Wikipedia, StackExchange, selected subreddits

Stage 2: Heuristic Filtering

Heuristic filters use rule-based methods to remove obviously low-quality content. They're fast, interpretable, and catch the worst offenders.

Common Heuristic Filters

Length filters:

Minimum character count (e.g., 200+ characters)
Minimum word count (e.g., 50+ words)
Maximum length (extremely long documents may be generated spam)

Character composition:

Alphabetic character ratio (mostly letters vs. symbols)
Digit ratio (too many numbers suggests tables/data dumps)
Special character ratio
Whitespace ratio

Word-level heuristics:

Mean word length (too short = abbreviations/noise; too long = URLs/code)
Stop word ratio (very low = unnatural text)
Unique word ratio (too low = repetitive content)

Repetition detection:

N-gram repetition (repeated phrases suggest generated spam)
Line repetition (same line appearing multiple times)
Paragraph repetition

Structural filters:

Bullet point/list ratio
Header count
Sentence length variance

Example Heuristics from Major Datasets

C4 (Colossal Clean Crawled Corpus):

Remove pages with fewer than 3 sentences
Remove sentences without terminal punctuation
Remove pages containing "lorem ipsum"
Remove pages with bad words (basic profanity filter)
Remove lines starting with "JavaScript"

RefinedWeb:

Minimum 100 characters
Remove pages where >30% of lines end with an ellipsis
Remove pages with excessive hashtags or @mentions
URL-based deduplication

FineWeb:

More sophisticated repetition detection
Stricter length requirements
Domain-specific rules (remove known spam domains)

Implementing Heuristic Filters

The key principle: be aggressive but monitor what you're removing. Sample removed documents to ensure you're not discarding valuable content.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                    HEURISTIC FILTER EXAMPLES                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  GOOD DOCUMENT (PASSES ALL FILTERS):                                        │
│  ──────────────────────────────────                                         │
│  "Large language models have revolutionized natural language processing.    │
│   These models, trained on vast amounts of text data, can generate         │
│   coherent and contextually relevant text across many domains."            │
│                                                                             │
│  ✓ Length: 247 characters                                                   │
│  ✓ Word count: 35 words                                                     │
│  ✓ Alphabetic ratio: 0.85                                                   │
│  ✓ Stop word ratio: 0.31                                                    │
│  ✓ No excessive repetition                                                  │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  BAD DOCUMENT 1 (FAILS LENGTH):                                             │
│  ─────────────────────────────────                                          │
│  "Click here!!!"                                                            │
│                                                                             │
│  ✗ Length: 14 characters (below threshold)                                  │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  BAD DOCUMENT 2 (FAILS REPETITION):                                         │
│  ──────────────────────────────────                                         │
│  "Buy now! Best prices! Buy now! Best prices! Buy now! Best prices!        │
│   Buy now! Best prices! Buy now! Best prices! Buy now! Best prices!"       │
│                                                                             │
│  ✗ 3-gram repetition ratio: 0.95 (above threshold of 0.3)                   │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  BAD DOCUMENT 3 (FAILS COMPOSITION):                                        │
│  ───────────────────────────────────                                        │
│  "$$$ ### *** !!! @@@ %%% ^^^ &&& ((( ))) +++ === --- ___"                 │
│                                                                             │
│  ✗ Alphabetic ratio: 0.0 (below threshold of 0.7)                           │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  BAD DOCUMENT 4 (FAILS STOP WORD RATIO):                                    │
│  ──────────────────────────────────────                                     │
│  "Synergistic paradigmatic methodologies systematically operationalize      │
│   transformative infrastructural implementations strategically."            │
│                                                                             │
│  ✗ Stop word ratio: 0.05 (below threshold of 0.15)                          │
│  This suggests artificially generated or heavily jargon-laden text         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Stage 3: Deduplication

Web-scraped datasets contain massive redundancy. The same content appears across many websites—syndicated articles, scraped content, boilerplate text. Training on duplicated data is wasteful and harmful:

Reduced efficiency: Learning the same content multiple times wastes compute.

Memorization: Models trained on duplicated content are more likely to memorize and regurgitate exact training sequences.

Inflated metrics: Perplexity scores become artificially low when test data overlaps with training duplicates.

Benchmark contamination: Duplicates increase the chance of test data leaking into training.

Deduplication typically removes 20-30% of raw web data—a massive but necessary cut.

Exact Deduplication

The simplest approach: compute a hash of each document (or normalized document) and remove exact matches.

Hash functions: MD5, SHA-256, or xxHash for speed

Normalization: Before hashing, normalize whitespace, case, and punctuation to catch trivial variations.

Granularity: Document-level, paragraph-level, or line-level depending on needs.

Exact deduplication is fast and catches verbatim copies but misses near-duplicates.

Fuzzy Deduplication (MinHash / SimHash)

Near-duplicates—documents that differ slightly (different ads, minor edits, formatting changes)—require fuzzy matching.

MinHash with Locality-Sensitive Hashing (LSH):

Convert document to set of n-grams (shingles)
Apply multiple hash functions to the shingle set
Keep minimum hash value from each function (the "MinHash signature")
Use LSH to efficiently find documents with similar signatures
Verify candidates with actual Jaccard similarity

MinHash can detect documents that are 80%+ similar even with significant differences.

SimHash:

Convert document to weighted feature vector
Compute a single fingerprint hash
Documents with similar fingerprints (low Hamming distance) are duplicates

SimHash is faster but less accurate than MinHash.

URL-Based Deduplication

Many duplicates are the same page crawled at different times or with different URL parameters.

URL normalization:

Remove query parameters (?)
Remove fragments (#)
Normalize www vs. non-www
Handle HTTP vs. HTTPS

Keep only the most recent or highest-quality version of each URL.

Paragraph-Level Deduplication

Beyond document-level, remove duplicated paragraphs that appear across many documents:

Boilerplate detection: Legal disclaimers, cookie notices, and navigation text appear on millions of pages.

Syndicated content: News articles syndicated across many sites.

Scraped content: Content farms that copy from original sources.

Semantic Deduplication

The most sophisticated (and expensive) approach: detect semantically similar content even with different wording.

Method:

Embed documents using a sentence embedding model
Cluster embeddings or use approximate nearest neighbor search
Within clusters, keep only representative documents

Semantic deduplication catches paraphrased duplicates but requires significant compute (embedding millions/billions of documents).

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                     DEDUPLICATION METHODS COMPARED                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  METHOD              │ CATCHES                │ SPEED    │ COST            │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Exact Hash          │ Verbatim copies        │ Very fast│ Very low        │
│  (MD5/SHA256)        │                        │ O(n)     │                 │
│                      │                        │          │                 │
│  MinHash + LSH       │ Near-duplicates        │ Fast     │ Low             │
│                      │ (80%+ similar)         │ O(n)     │                 │
│                      │                        │          │                 │
│  SimHash             │ Near-duplicates        │ Very fast│ Very low        │
│                      │ (less accurate)        │ O(n)     │                 │
│                      │                        │          │                 │
│  URL Normalization   │ Same page, diff crawl  │ Very fast│ Very low        │
│                      │                        │ O(n)     │                 │
│                      │                        │          │                 │
│  Paragraph-level     │ Shared boilerplate     │ Fast     │ Low             │
│                      │ Syndicated paragraphs  │ O(n)     │                 │
│                      │                        │          │                 │
│  Semantic            │ Paraphrased content    │ Slow     │ High            │
│  (Embedding-based)   │ Same info, diff words  │ O(n²)    │ Requires GPU    │
│                      │                        │          │                 │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  TYPICAL PIPELINE:                                                          │
│                                                                             │
│  Raw Data                                                                   │
│     │                                                                       │
│     ├── URL deduplication ────────────► Removes ~10%                        │
│     │                                                                       │
│     ├── Exact hash deduplication ─────► Removes ~5%                         │
│     │                                                                       │
│     ├── MinHash near-deduplication ───► Removes ~10%                        │
│     │                                                                       │
│     ├── Paragraph deduplication ──────► Removes ~5%                         │
│     │                                                                       │
│     └── (Optional) Semantic dedup ────► Removes ~5%                         │
│                                                                             │
│  Total: 20-35% reduction from deduplication alone                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Stage 4: Quality Filtering

Heuristic filters catch obviously bad content, but much of the web is mediocre—grammatical, non-duplicated, but not particularly useful for training capable models. Quality filtering uses ML models to identify high-quality content.

Perplexity Filtering

Train a small language model on known high-quality text (e.g., Wikipedia) and use it to score documents. Low perplexity = similar to high-quality text; high perplexity = dissimilar.

Intuition: A model trained on clean English will be "surprised" (high perplexity) by garbled text, foreign languages mixed with English, or unusual content.

Caveats:

May filter out rare but valuable content (technical jargon, specialized domains)
Wikipedia-based models bias toward encyclopedia-style writing
Threshold tuning is critical

Classifier-Based Quality Filtering

Train a classifier to directly predict quality:

FineWeb-Edu approach:

Use a strong LLM (Llama-3-70B) to annotate documents on a quality scale (e.g., educational value 0-5)
Train a lightweight classifier (or use embeddings + regression) to predict these scores
Filter documents below a threshold

This approach was used to create FineWeb-Edu, achieving strong results by selecting documents with high educational value.

Training quality classifiers:

Define your quality criteria (educational value, writing quality, factual density)
Generate labels using human annotation or LLM-as-judge
Train an efficient classifier (DistilBERT-sized models work well)
Apply at scale

Domain-Specific Quality Filters

Different content types need different quality measures:

Code quality:

Syntactic validity (does it parse?)
Presence of comments/documentation
Function length and complexity
Test coverage indicators

Scientific text:

Citation density
Presence of methodology sections
Technical vocabulary density

Conversational text:

Coherence across turns
Appropriate length of responses
Absence of toxic content

NSFW and Safety Filtering

Remove harmful content:

Toxicity classifiers: Detect hate speech, harassment, threats

NSFW detection: Identify adult content

Personally identifiable information (PII): Detect and redact:

Email addresses (regex + classifier)
Phone numbers (regex)
Social security numbers (regex)
Physical addresses (NER)
Names (NER, with context)

Safety considerations:

Instructions for illegal activities
Malware and exploits
Detailed personal information

Stage 5: Data Mixing

A final training dataset isn't just filtered web data—it's a carefully designed mixture of multiple sources with specific proportions.

Why Mixing Matters

Different data sources provide different capabilities:

Web text: General knowledge, writing styles, current events
Books: Long-form coherence, narrative, deep knowledge
Code: Programming capabilities, logical reasoning
Math: Mathematical reasoning (when included as text)
Scientific papers: Technical knowledge, formal reasoning
Conversations: Dialog capabilities, instruction following

The mixture determines the model's strengths and weaknesses.

Mixing Strategies

Proportional mixing: Include sources in proportion to their natural size (heavily favors web).

Domain upsampling: Artificially increase representation of valuable but smaller sources (more code, more books).

Quality-weighted mixing: Higher-quality sources get more representation.

Curriculum learning: Change the mixture during training (e.g., easier content early, harder content later).

Language balancing: For multilingual models, balance language representation (don't let English dominate completely).

The Llama Approach

Meta's Llama models use sophisticated data mixing:

Llama 3 recipe:

Upsampled high-quality sources (selected websites, books)
Code heavily represented (35% of tokens in some stages)
Non-English languages included but English-dominant
Different mixtures for different training stages
Annealing phase with highest-quality data at the end

Determining Optimal Mixtures

Finding the right mixture is empirical:

Scaling laws: Small-scale experiments can predict large-scale performance, but mixture effects don't always transfer.

Ablation studies: Train models with different mixtures, evaluate on target tasks.

Domain probing: Measure performance on domain-specific benchmarks as a function of domain representation.

The bitter lesson: More data from more domains generally helps, up to a point.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                      DATA MIXING STRATEGIES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  EXAMPLE MIXTURE (Conceptual, not actual Llama recipe):                     │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                       │   │
│  │  Web (filtered)    ████████████████████████████████░░░░░░░░  50%     │   │
│  │                                                                       │   │
│  │  Code              ████████████████░░░░░░░░░░░░░░░░░░░░░░░░  20%     │   │
│  │                                                                       │   │
│  │  Books             ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  10%     │   │
│  │                                                                       │   │
│  │  Wikipedia         ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  8%      │   │
│  │                                                                       │   │
│  │  Scientific        ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  5%      │   │
│  │                                                                       │   │
│  │  Conversations     ███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  4%      │   │
│  │                                                                       │   │
│  │  Math              ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  3%      │   │
│  │                                                                       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  MIXING CONSIDERATIONS:                                                     │
│                                                                             │
│  Domain upsampling:                                                         │
│  - Code is ~5% of natural web distribution but often upsampled to 20-35%   │
│  - Books provide long-form coherence worth overrepresenting                │
│  - Math data is scarce but valuable for reasoning                          │
│                                                                             │
│  Quality tiers within domains:                                              │
│  - Web: Tier 1 (curated sites) vs Tier 2 (filtered CC)                     │
│  - Code: Popular repos vs random repositories                              │
│  - The Pile strategy: Specific high-quality sources identified             │
│                                                                             │
│  Curriculum considerations:                                                 │
│  - Early training: More diverse, slightly noisier data                     │
│  - Late training: Higher quality, more curated data                        │
│  - Annealing: Final phase on highest-quality subset                        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Contamination and Benchmark Integrity

A critical concern in data curation: benchmark contamination. If test data appears in training data, evaluation metrics are meaningless.

Types of Contamination

Direct contamination: The exact test examples appear in training data (e.g., benchmark questions with answers).

Indirect contamination: Content highly similar to test data (same questions with minor rephrasing).

Source contamination: The original sources of benchmark data (e.g., the books used to create reading comprehension tests).

Detection Methods

N-gram overlap: Check for significant n-gram overlap between training and test data. 8-13 gram matches are suspicious.

Embedding similarity: Flag training documents highly similar to test examples.

Exact match on canonicalized text: Normalize and compare.

Mitigation

Blocklisting: Identify and remove known benchmark sources from training data.

Temporal cutoffs: Use training data from before benchmarks were created (limited applicability for older benchmarks).

Held-out evaluation: Create private evaluation sets not released publicly.

Canary strings: Include unique strings in evaluation data to detect their presence in model outputs.

Tools and Infrastructure

Processing web-scale data requires serious infrastructure. Several tools and frameworks help:

NVIDIA NeMo Curator

A GPU-accelerated data curation framework:

Modular pipeline components
GPU-accelerated deduplication using RAPIDS
Quality classifiers
PII detection and redaction
Multi-node, multi-GPU scaling

NeMo Curator can process terabytes of text data efficiently, integrating semantic deduplication, heuristic filtering, classification, and PII redaction.

Datatrove

Hugging Face's data processing library:

Designed for FineWeb and similar datasets
Support for Common Crawl processing
Efficient distributed processing
Built-in quality filters

LP Data Pipeline

A lightweight framework optimized for CPU processing:

Lower infrastructure requirements
Focus on streamlined extraction and filtering
Reduced preparation time and cost

Custom Solutions

Many teams build custom pipelines using:

Apache Spark for distributed processing
Ray for parallel Python workloads
PostgreSQL/DuckDB for metadata management
MinIO/S3 for data storage

Practical Recommendations

Based on best practices from major dataset efforts:

Start with Established Datasets

Don't reinvent the wheel. Use filtered datasets as starting points:

FineWeb / FineWeb-Edu: State-of-the-art filtered Common Crawl
RedPajama v2: Open reproduction of Llama training data
RefinedWeb: Carefully curated web data
The Pile: Diverse, well-documented dataset

Invest in Deduplication

Deduplication provides the highest ROI for data quality. Use at minimum:

URL-based deduplication
Exact hash deduplication
MinHash near-deduplication

Build Quality Classifiers for Your Use Case

Generic quality filters are a good start, but domain-specific classifiers can significantly improve results for specialized applications.

Monitor What You're Removing

Always sample removed documents. You'll find:

False positives (good content incorrectly filtered)
Patterns to improve filters
Edge cases you didn't anticipate

Document Everything

Data curation decisions are hard to reverse. Document:

Filter thresholds and rationale
Data sources and versions
Processing pipeline code
Quality metrics at each stage

Think About Data Contamination Early

Once training starts, contamination is baked in. Check for benchmark contamination before training.

Table of Contents

Data Quality Determines Model Quality

The Data Curation Pipeline

Stage 1: Data Collection and Sources

Web Crawls

Books and Long-Form Content

Code Repositories

Scientific Papers

Curated Datasets

Stage 2: Heuristic Filtering

Common Heuristic Filters

Example Heuristics from Major Datasets

Implementing Heuristic Filters

Stage 3: Deduplication

Exact Deduplication

Fuzzy Deduplication (MinHash / SimHash)

URL-Based Deduplication

Paragraph-Level Deduplication

Semantic Deduplication

Stage 4: Quality Filtering

Perplexity Filtering

Classifier-Based Quality Filtering

Domain-Specific Quality Filters

NSFW and Safety Filtering

Stage 5: Data Mixing

Why Mixing Matters

Mixing Strategies

The Llama Approach

Determining Optimal Mixtures

Contamination and Benchmark Integrity

Types of Contamination

Detection Methods

Mitigation

Tools and Infrastructure

NVIDIA NeMo Curator

Datatrove

LP Data Pipeline

Custom Solutions

Practical Recommendations

Start with Established Datasets

Invest in Deduplication

Build Quality Classifiers for Your Use Case

Monitor What You're Removing

Document Everything

Think About Data Contamination Early

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

LLM Pre-training: Building Foundation Models from Scratch

Synthetic Data Generation for LLM Training

SFT Deep Dive: Instruction Tuning Techniques and Best Practices

RLHF Complete Guide: Aligning LLMs with Human Preferences