LLM Pre-training: Building Foundation Models from Scratch
A comprehensive guide to pre-training large language models—from data curation and architecture decisions to scaling laws and distributed training infrastructure. Understanding how GPT, Llama, and other foundation models are built.
Table of Contents
What is Pre-training?
Pre-training is the foundational phase of building a large language model. It's where a model learns language itself—grammar, facts, reasoning patterns, and the statistical structure of human text—by processing massive amounts of data.
Think of pre-training as teaching a child to read and understand language by exposing them to millions of books, articles, conversations, and documents. The child doesn't memorize specific facts (though some stick); they develop an intuition for how language works, what words mean, how ideas connect, and how to reason about the world.
Pre-training is distinct from later training phases:
┌─────────────────────────────────────────────────────────────────────────┐
│ THE LLM TRAINING PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PRE-TRAINING (this post) │
│ ───────────────────────── │
│ • Train on trillions of tokens from the internet │
│ • Self-supervised learning (predict next token) │
│ • Result: Base model that can complete text │
│ • Cost: $1M - $100M+ compute │
│ • Time: Weeks to months │
│ │
│ ↓ │
│ │
│ SUPERVISED FINE-TUNING (SFT) │
│ ─────────────────────────── │
│ • Train on instruction-response pairs │
│ • Teaches model to follow instructions │
│ • Result: Model that responds helpfully │
│ • Cost: $1K - $100K compute │
│ • Time: Hours to days │
│ │
│ ↓ │
│ │
│ REINFORCEMENT LEARNING (RLHF/DPO) │
│ ───────────────────────────────── │
│ • Train on human preferences │
│ • Aligns model with human values │
│ • Result: Model that's helpful, harmless, honest │
│ • Cost: $10K - $1M compute │
│ • Time: Days to weeks │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Pre-training is by far the most expensive and foundational phase. Everything that follows—SFT, RLHF, fine-tuning for specific tasks—builds on the capabilities established during pre-training. A model can only be as good as its pre-training allows.
The Pre-training Objective: Learning to Predict
Next Token Prediction (Autoregressive LMs)
The dominant pre-training objective for modern LLMs is deceptively simple: predict the next token.
Given a sequence of tokens, predict what comes next:
Input: "The capital of France is"
Target: "Paris"
Input: "def fibonacci(n):\n if n <= 1:\n return"
Target: "n"
This objective is called autoregressive language modeling or causal language modeling. Models like GPT-4, Claude, Llama, and Mistral all use this approach.
Why Next Token Prediction Works So Well
At first glance, predicting the next word seems too simple to produce intelligent behavior. But consider what the model must learn to predict well:
To predict the next word in a sentence about physics:
- The model must understand physics concepts
- It must know how these concepts relate
- It must follow logical reasoning chains
To predict the next token in code:
- The model must understand syntax
- It must track variable types and scopes
- It must follow programming logic
To predict the next word in a dialogue:
- The model must understand context and intent
- It must model different perspectives
- It must follow conversational norms
The next token prediction objective is a "universal task" that requires mastering language at every level—from grammar and spelling to reasoning and world knowledge. The model isn't explicitly taught any of these skills; they emerge from the pressure to predict accurately.
The Mathematical Formulation
The pre-training objective minimizes cross-entropy loss over the training corpus:
┌─────────────────────────────────────────────────────────────────────────┐
│ NEXT TOKEN PREDICTION LOSS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ For a sequence of tokens x₁, x₂, ..., xₙ: │
│ │
│ Loss = -∑ log P(xᵢ | x₁, x₂, ..., xᵢ₋₁) │
│ │
│ In words: For each position, how surprised is the model │
│ by the actual token given everything that came before? │
│ │
│ Lower loss = Better predictions = Better understanding │
│ │
│ Example: │
│ "The cat sat on the [mat]" │
│ │
│ If model predicts: │
│ • P("mat") = 0.3 → Loss contribution = -log(0.3) = 1.2 │
│ • P("mat") = 0.01 → Loss contribution = -log(0.01) = 4.6 │
│ • P("mat") = 0.9 → Loss contribution = -log(0.9) = 0.1 │
│ │
│ The model is trained to maximize probability of correct tokens. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Perplexity: The Standard Metric
Perplexity is the standard metric for evaluating pre-trained language models. It's the exponential of the average loss:
Perplexity = exp(Loss / N)
Intuitively, perplexity represents how "confused" the model is. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options.
Typical perplexity ranges:
- Random guessing (50K vocab): ~50,000
- Bad language model: ~100-500
- Good language model: ~10-30
- State-of-the-art (on common benchmarks): ~5-15
Masked Language Modeling (BERT-style)
An alternative pre-training objective is masked language modeling (MLM), used by BERT and its variants:
Input: "The [MASK] of France is Paris"
Target: "capital"
Instead of predicting the next token, the model predicts randomly masked tokens. This creates a bidirectional model—it can look both forward and backward when making predictions.
Comparison:
| Aspect | Autoregressive (GPT) | Masked (BERT) |
|---|---|---|
| Direction | Left-to-right only | Bidirectional |
| Use case | Text generation | Understanding/classification |
| Generation | Natural (token by token) | Unnatural (must iterate) |
| Context | Only past tokens | Full context |
| Modern preference | ✅ Dominant for LLMs | Used for embeddings |
Modern LLMs almost universally use autoregressive pre-training because generation is natural and the same model works for both understanding and generation.
Data: The Fuel for Pre-training
Pre-training data quality and quantity are arguably more important than architecture or training techniques. A model trained on high-quality data will outperform a larger model trained on low-quality data.
Scale: How Much Data?
Modern LLMs are trained on staggering amounts of text:
┌─────────────────────────────────────────────────────────────────────────┐
│ PRE-TRAINING DATA SCALE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Model Training Tokens Approximate Size │
│ ───────────────────────────────────────────────────────────────────── │
│ GPT-2 (2019) 40 billion ~40 GB text │
│ GPT-3 (2020) 300 billion ~570 GB text │
│ Chinchilla (2022) 1.4 trillion ~1.4 TB text │
│ Llama 2 (2023) 2 trillion ~2 TB text │
│ Llama 3 (2024) 15 trillion ~15 TB text │
│ GPT-4 (estimated) ~10-13 trillion ~10+ TB text │
│ │
│ For reference: │
│ • Wikipedia: ~4 billion tokens (~20 GB) │
│ • All books ever written: ~500 billion tokens (estimate) │
│ • Common Crawl (filtered): ~1-3 trillion tokens │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The trend is clear: more data leads to better models, but with diminishing returns. The question becomes: where does all this data come from?
Data Sources
1. Web Crawls (Common Crawl)
The foundation of most pre-training datasets is Common Crawl, a non-profit that has been crawling and archiving the web since 2008.
- Contains petabytes of raw web data
- New crawls released monthly
- Covers billions of web pages
But raw web data is mostly garbage. Common Crawl contains:
- Spam and SEO content
- Duplicated pages
- Boilerplate (navigation, ads, footers)
- Low-quality machine-generated text
- Malicious content
- Personally identifiable information
The art of using web data is in filtering. Models like Llama use only a small fraction of Common Crawl after aggressive filtering.
2. Books
Books provide high-quality, long-form, well-edited text:
- Books1/Books2: Datasets of digitized books (used by GPT-3)
- Project Gutenberg: Public domain books
- Internet Archive: Digital library
Books are valuable because they contain:
- Sustained reasoning and arguments
- Diverse writing styles
- Edited, high-quality prose
- Long-range dependencies
3. Code
Code has become increasingly important for LLM capabilities:
- GitHub: Public repositories
- Stack Overflow: Q&A with code
- The Stack: Curated code dataset
Code training improves:
- Reasoning ability (code requires logical thinking)
- Structured output (JSON, XML, etc.)
- Instruction following (code is precise)
- General capability (surprisingly broad transfer)
4. Scientific Papers
Academic content provides high-quality technical knowledge:
- ArXiv: Pre-prints across sciences
- PubMed: Medical literature
- Semantic Scholar: Academic papers
5. Curated/Synthetic Data
Increasingly, pre-training includes curated or synthetic data:
- Wikipedia: High-quality encyclopedic content (often upweighted)
- Textbooks: Educational content (Phi models heavily use this)
- Synthetic data: Generated by other LLMs for specific capabilities
Data Curation Pipeline
Raw data must be processed through an extensive pipeline before training:
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA CURATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ RAW WEB CRAWL │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ URL FILTERING │ Remove known bad domains, adult content, │
│ │ │ spam domains, etc. │
│ └────────┬────────┘ │
│ │ ~50% removed │
│ ▼ │
│ ┌─────────────────┐ │
│ │ TEXT EXTRACTION │ Extract text from HTML, remove boilerplate, │
│ │ │ navigation, ads, scripts │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ LANGUAGE ID │ Keep only target languages │
│ │ │ (usually English + selected others) │
│ └────────┬────────┘ │
│ │ ~30% removed │
│ ▼ │
│ ┌─────────────────┐ │
│ │ QUALITY FILTER │ Remove low-quality text using classifiers │
│ │ │ (trained on Wikipedia vs web text) │
│ └────────┬────────┘ │
│ │ ~60-80% removed │
│ ▼ │
│ ┌─────────────────┐ │
│ │ DEDUPLICATION │ Remove duplicate documents (exact + fuzzy) │
│ │ │ Critical for training stability │
│ └────────┬────────┘ │
│ │ ~30-50% removed │
│ ▼ │
│ ┌─────────────────┐ │
│ │ PII REMOVAL │ Remove emails, phone numbers, addresses, │
│ │ │ social security numbers, etc. │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ TOXICITY FILTER │ Remove hate speech, extreme content │
│ │ │ (classifiers or keyword lists) │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ CLEAN TRAINING DATA │
│ (typically 1-5% of raw crawl) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Quality Filtering in Detail
Quality filtering is perhaps the most impactful step. The goal is to keep text that looks like "good" writing and remove text that looks like spam, machine-generated content, or low-effort writing.
Common approaches:
Classifier-based: Train a classifier to distinguish Wikipedia/books (positive) from random web text (negative). Apply to all web data, keep high-scoring documents.
Heuristic-based: Apply rules like:
- Minimum/maximum document length
- Ratio of alphabetic characters to total
- Presence of stop words (real text has "the", "and", "is")
- Average word length (spam often has unusual patterns)
- Repetition detection (spam repeats phrases)
- Symbol ratio (too many special characters = bad)
Perplexity-based: Use a small pre-trained model to score text. Very high perplexity = unusual/bad text.
Human evaluation: Sample and manually rate documents to calibrate filters.
Deduplication: More Important Than You'd Think
Duplicate and near-duplicate documents cause serious problems:
- Wasted compute: Training on the same content twice doesn't help
- Memorization: Exact duplicates encourage memorization over generalization
- Evaluation contamination: If test data appears in training, benchmarks are invalid
- Privacy risks: Duplicated private information is more likely to be memorized
Deduplication methods:
Exact deduplication: Hash documents, remove duplicates. Fast but misses near-duplicates.
MinHash/LSH: Create document fingerprints, find similar documents efficiently. Can catch documents that differ by a few words.
Suffix array: Find repeated substrings across the corpus. Can remove duplicated paragraphs even if documents differ overall.
Data Mixing: The Art of Proportions
Pre-training data comes from multiple sources. The mixture proportions significantly impact model capabilities:
┌─────────────────────────────────────────────────────────────────────────┐
│ EXAMPLE DATA MIXTURES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LLAMA 2 (reported): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Web data: ~89% │ │
│ │ Code: ~4% │ │
│ │ Wikipedia: ~4% │ │
│ │ Books: ~3% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ GPT-3 (reported): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Common Crawl: 60% │ │
│ │ WebText2: 22% │ │
│ │ Books: 8% │ │
│ │ Wikipedia: 3% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ PHI (Microsoft): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Textbooks: Very high (exact % unknown) │ │
│ │ Synthetic: Very high │ │
│ │ Code: Significant │ │
│ │ Web data: Lower than typical │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Key insight: More high-quality data is often better than more │
│ low-quality data, even if total tokens is lower. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Upweighting: High-quality sources like Wikipedia are often shown to the model multiple times (2-10x). This is called "upweighting" or "oversampling."
Curriculum: Some approaches vary the mixture during training—starting with easier/cleaner data and adding harder/noisier data later.
Architecture: The Transformer and Its Variants
The Transformer Foundation
All modern LLMs are based on the Transformer architecture, introduced in "Attention Is All You Need" (2017). The key components:
┌─────────────────────────────────────────────────────────────────────────┐
│ TRANSFORMER DECODER BLOCK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Input (sequence of tokens) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TOKEN EMBEDDING │ │
│ │ Convert discrete tokens to continuous vectors │ │
│ │ vocab_size × hidden_dim matrix lookup │ │
│ └──────────────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ POSITIONAL ENCODING │ │
│ │ Add position information (since attention is permutation- │ │
│ │ invariant without it) │ │
│ └──────────────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ╔═══════════════════════════════════════════╗ │
│ ║ TRANSFORMER BLOCK (×N) ║ │
│ ║ ┌─────────────────────────────────────┐ ║ │
│ ║ │ MASKED SELF-ATTENTION │ ║ │
│ ║ │ Each position attends to previous │ ║ │
│ ║ │ positions only (causal mask) │ ║ │
│ ║ └──────────────────┬──────────────────┘ ║ │
│ ║ │ ║ │
│ ║ ▼ ║ │
│ ║ ┌─────────────────────────────────────┐ ║ │
│ ║ │ ADD & LAYER NORM │ ║ │
│ ║ │ Residual connection + normalize │ ║ │
│ ║ └──────────────────┬──────────────────┘ ║ │
│ ║ │ ║ │
│ ║ ▼ ║ │
│ ║ ┌─────────────────────────────────────┐ ║ │
│ ║ │ FEED-FORWARD NETWORK │ ║ │
│ ║ │ Two linear layers with activation │ ║ │
│ ║ │ (typically 4× hidden_dim) │ ║ │
│ ║ └──────────────────┬──────────────────┘ ║ │
│ ║ │ ║ │
│ ║ ▼ ║ │
│ ║ ┌─────────────────────────────────────┐ ║ │
│ ║ │ ADD & LAYER NORM │ ║ │
│ ║ └──────────────────┬──────────────────┘ ║ │
│ ╚══════════════════════╪════════════════════╝ │
│ │ │
│ ▼ (repeat N times) │
│ │ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ OUTPUT PROJECTION │ │
│ │ Project back to vocabulary size for next token prediction │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Key Architecture Decisions
1. Model Dimensions
The "size" of a model is determined by several hyperparameters:
| Parameter | Description | Typical Values |
|---|---|---|
| hidden_dim | Width of representations | 768 - 8192 |
| num_layers | Depth of transformer stack | 12 - 96 |
| num_heads | Parallel attention heads | 12 - 64 |
| vocab_size | Number of tokens | 32K - 128K |
| context_length | Maximum sequence length | 2K - 128K |
Parameter count formula:
params ≈ 12 × num_layers × hidden_dim²
This is approximate; the exact count depends on vocabulary size and other factors.
2. Attention Variants
Standard self-attention is O(n²) in sequence length, which becomes prohibitive for long contexts. Many variants address this:
Multi-Head Attention (standard):
- Multiple parallel attention "heads"
- Each head can focus on different relationship types
- Outputs concatenated and projected
Grouped Query Attention (GQA):
- Used in Llama 2/3, Mistral
- Multiple query heads share fewer key-value heads
- Reduces memory usage during inference
- Similar quality to full multi-head
Multi-Query Attention (MQA):
- Extreme version: all queries share one KV head
- Maximum memory efficiency
- Slight quality trade-off
Sliding Window Attention:
- Used in Mistral, Mixtral
- Each position only attends to nearby positions
- Enables very long sequences efficiently
Ring Attention / Sequence Parallelism:
- Distribute attention across devices
- Enables training on very long sequences
3. Positional Encoding
Transformers need position information added explicitly. Evolution of approaches:
Absolute Positional Embeddings (original):
- Learned embedding for each position
- Limited to trained context length
- Can't extrapolate to longer sequences
Rotary Position Embedding (RoPE):
- Used in Llama, Mistral, most modern LLMs
- Encodes relative positions through rotation
- Better length generalization
- Can be extended with interpolation
ALiBi (Attention with Linear Biases):
- Used in BLOOM, MPT
- Adds linear bias based on position distance
- No learned parameters for position
- Good length generalization
4. Normalization
Where and how to normalize activations:
Post-Norm (original Transformer):
- Normalize after attention/FFN
- Can be unstable for deep networks
Pre-Norm (GPT-2 onwards):
- Normalize before attention/FFN
- More stable training
- Now standard for LLMs
RMSNorm (Llama):
- Simplified LayerNorm (removes mean centering)
- Slightly faster, similar quality
- Increasingly standard
5. Feed-Forward Network Variants
Standard FFN:
SwiGLU (Llama, most modern LLMs):
SwiGLU adds a gating mechanism that empirically improves quality.
Model Scale Comparison
┌─────────────────────────────────────────────────────────────────────────┐
│ MODEL ARCHITECTURE COMPARISON │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Model Params Layers Hidden Heads Context Year │
│ ───────────────────────────────────────────────────────────────────── │
│ GPT-2 1.5B 48 1600 25 1K 2019 │
│ GPT-3 175B 96 12288 96 2K 2020 │
│ Llama 2 7B 7B 32 4096 32 4K 2023 │
│ Llama 2 70B 70B 80 8192 64 4K 2023 │
│ Llama 3 8B 8B 32 4096 32 8K 2024 │
│ Llama 3 70B 70B 80 8192 64 8K 2024 │
│ Llama 3 405B 405B 126 16384 128 128K 2024 │
│ Mistral 7B 7B 32 4096 32 32K 2023 │
│ Mixtral 8×7B 47B* 32 4096 32 32K 2024 │
│ Qwen2.5 72B 72B 80 8192 64 128K 2024 │
│ │
│ *Mixtral uses Mixture-of-Experts; 47B total but 13B active per token │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Mixture of Experts (MoE)
A major architectural trend is Mixture of Experts, where the model has multiple "expert" feed-forward networks and a router selects which experts to use for each token:
┌─────────────────────────────────────────────────────────────────────────┐
│ MIXTURE OF EXPERTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Standard Transformer: │
│ ┌────────────┐ │
│ │ FFN │ Same FFN for every token │
│ └────────────┘ │
│ │
│ Mixture of Experts: │
│ ┌─────────────┐ │
│ │ Router │ Decides which experts to use │
│ └──────┬──────┘ │
│ │ │
│ ┌─────────────────┼─────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Expert 1 │ │Expert 2 │ ... │Expert N │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Typically: 8 experts, top-2 selected per token │
│ │
│ Benefits: │
│ • More parameters without more compute (sparse activation) │
│ • Experts can specialize (one for code, one for math, etc.) │
│ • Better scaling properties │
│ │
│ Examples: │
│ • Mixtral 8×7B: 47B total params, 13B active │
│ • Grok-1: 314B total, mixture of experts │
│ • GPT-4 (rumored): MoE architecture │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Scaling Laws: The Science of Bigger Models
The Scaling Laws Discovery
One of the most important discoveries in modern AI is that LLM performance follows predictable scaling laws. Given more compute, data, or parameters, we can predict how much better the model will be.
The foundational paper, "Scaling Laws for Neural Language Models" (Kaplan et al., 2020), showed that loss decreases as a power law with scale:
L(N) = (Nₑ/N)^αₙ for model size
L(D) = (Dₑ/D)^αᵈ for data size
L(C) = (Cₑ/C)^αᴄ for compute
Where:
- L = loss (lower is better)
- N = number of parameters
- D = dataset size (tokens)
- C = compute (FLOPs)
- α values ≈ 0.07-0.1
Key insight: Performance improves smoothly and predictably with scale. There's no magic threshold; just keep scaling.
Chinchilla Scaling Laws
DeepMind's "Chinchilla" paper (2022) refined the scaling laws and revealed that most models were undertrained:
┌─────────────────────────────────────────────────────────────────────────┐
│ CHINCHILLA OPTIMAL SCALING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Previous wisdom (GPT-3 era): │
│ "Make models as big as possible, train on fixed data" │
│ │
│ Chinchilla discovery: │
│ "Optimal allocation: scale data and parameters equally" │
│ │
│ OPTIMAL RATIO: ~20 tokens per parameter │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Model Size │ Optimal Training Tokens │ Actual (GPT-3 era) │ │
│ ├────────────────┼─────────────────────────┼─────────────────────│ │
│ │ 1B params │ 20B tokens │ ~300B (overtrained)│ │
│ │ 10B params │ 200B tokens │ ~300B │ │
│ │ 70B params │ 1.4T tokens │ ~300B (undertrained)│ │
│ │ 175B params │ 3.5T tokens │ ~300B (very under) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ Implication: GPT-3 (175B, 300B tokens) was massively undertrained. │
│ Chinchilla (70B, 1.4T tokens) matched GPT-3 with 4× less compute. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Beyond Chinchilla: Modern Scaling
Post-Chinchilla, the field has moved toward even longer training:
Llama approach: Train smaller models for much longer than Chinchilla-optimal:
- Llama 2 7B: Trained on 2T tokens (100× Chinchilla ratio)
- Llama 3 8B: Trained on 15T tokens (nearly 2000× ratio)
Why overtrain? Inference cost matters. A smaller model trained longer:
- Costs less to run in production
- Is faster for users
- Uses less memory
- Can run on smaller GPUs/edge devices
The trade-off: You spend more compute during training (once) to save compute during inference (millions of times).
Emergent Abilities
An important phenomenon in scaling: emergent abilities appear suddenly at certain scales.
┌─────────────────────────────────────────────────────────────────────────┐
│ EMERGENT ABILITIES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Performance │
│ ▲ │
│ │ ┌────── Emergent ability appears │
│ │ │ (sudden jump) │
│ │ ▼ │
│ │ ╭───────────── │
│ │ ╭────╯ │
│ │ ╭────╯ │
│ │ ╭────╯ │
│ │╭────╯ │
│ └──────────────────────────────────────▶ Scale │
│ Small Medium Large Very Large │
│ │
│ Examples of emergent abilities: │
│ • Chain-of-thought reasoning (~100B params) │
│ • Multi-step arithmetic (~10B params) │
│ • Word unscrambling (~10B params) │
│ • Instruction following (~1B params) │
│ │
│ Note: Recent research suggests emergence may be a measurement │
│ artifact—abilities may improve smoothly but our metrics have │
│ thresholds that create apparent discontinuities. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Training Infrastructure: Making It Work at Scale
Pre-training a large language model requires massive distributed computing infrastructure. The challenges are immense:
The Scale of the Problem
Consider training a 70B parameter model on 2T tokens:
┌─────────────────────────────────────────────────────────────────────────┐
│ TRAINING COMPUTE REQUIREMENTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Model: 70B parameters, 2T tokens │
│ │
│ FLOPs required: ~6 × N × D = 6 × 70B × 2T = 8.4 × 10²³ FLOPs │
│ │
│ Single A100 (80GB): │
│ • Peak: 312 TFLOPS (bfloat16) │
│ • Realistic: ~150 TFLOPS (50% utilization) │
│ • Time: 8.4×10²³ / 150×10¹² = 5.6×10⁹ seconds = 177 years │
│ │
│ 1024 A100s (typical large cluster): │
│ • Time: 177 years / 1024 ≈ 63 days │
│ • Cost: ~$5-10M (cloud pricing) │
│ │
│ Memory requirement: │
│ • Model parameters: 70B × 2 bytes = 140 GB │
│ • Optimizer states: 70B × 12 bytes = 840 GB (Adam) │
│ • Gradients: 70B × 2 bytes = 140 GB │
│ • Activations: Variable, but large │
│ • Total: >1 TB just for model state │
│ │
│ This is why distributed training is essential. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Distributed Training Strategies
Data Parallelism
The simplest form: replicate the model on each GPU, split data across GPUs.
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA PARALLELISM │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │
│ │ Model │ │ Model │ │ Model │ │ Model │ │
│ │ (copy) │ │ (copy) │ │ (copy) │ │ (copy) │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Batch 0 Batch 1 Batch 2 Batch 3 │
│ │ │ │ │ │
│ └──────────┴──────────┴──────────┘ │
│ │ │
│ ▼ │
│ Gradient AllReduce │
│ (average gradients) │
│ │
│ Limitation: Model must fit on single GPU │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Model Parallelism (Tensor Parallelism)
Split the model's layers across GPUs. Each GPU holds part of each layer.
┌─────────────────────────────────────────────────────────────────────────┐
│ TENSOR PARALLELISM │
│ │
│ Linear layer: Y = XW │
│ │
│ Split W across GPUs: │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │
│ │ W₀ │ │ W₁ │ │ W₂ │ │ W₃ │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ X·W₀ X·W₁ X·W₂ X·W₃ │
│ │ │ │ │ │
│ └──────────┴──────────┴──────────┘ │
│ │ │
│ ▼ │
│ AllGather [Y₀, Y₁, Y₂, Y₃] │
│ │
│ Benefit: Can train models larger than single GPU memory │
│ Cost: Communication overhead between GPUs │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Pipeline Parallelism
Split the model's layers across GPUs sequentially. GPU 0 has layers 1-10, GPU 1 has layers 11-20, etc.
┌─────────────────────────────────────────────────────────────────────────┐
│ PIPELINE PARALLELISM │
│ │
│ GPU 0 GPU 1 GPU 2 GPU 3 │
│ Layers Layers Layers Layers │
│ 1-8 9-16 17-24 25-32 │
│ │ │ │ │ │
│ ▼ │ │ │ │
│ [===]──────────▼ │ │ │
│ │ [===]───────────▼ │ │
│ │ │ [===]───────────▼ │
│ │ │ │ [===] │
│ │ │ │ │ │
│ ◄────────────◄────────────◄────────────┘ (backward pass) │
│ │
│ Problem: "Pipeline bubble" - GPUs idle waiting for others │
│ │
│ Solution: Micro-batching (split batch into micro-batches) │
│ │
│ GPU 0: [=1=][=2=][=3=][=4=][ ][ ][=4=][=3=][=2=][=1=] │
│ GPU 1: [ ][=1=][=2=][=3=][=4=][=4=][=3=][=2=][=1=][ ] │
│ GPU 2: [ ][ ][=1=][=2=][=3=][=3=][=2=][=1=][ ][ ] │
│ GPU 3: [ ][ ][ ][=1=][=2=][=2=][=1=][ ][ ][ ] │
│ └── forward ──┘ └── backward ──┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Fully Sharded Data Parallelism (FSDP) / ZeRO
The modern standard: shard model parameters, gradients, and optimizer states across GPUs.
┌─────────────────────────────────────────────────────────────────────────┐
│ FSDP / ZeRO │
│ │
│ Key insight: Each GPU only needs all parameters during forward/back │
│ pass. Rest of the time, keep only a shard. │
│ │
│ Standard Data Parallel: │
│ GPU 0: [Full Model] [Full Optimizer] [Full Gradients] │
│ GPU 1: [Full Model] [Full Optimizer] [Full Gradients] │
│ GPU 2: [Full Model] [Full Optimizer] [Full Gradients] │
│ GPU 3: [Full Model] [Full Optimizer] [Full Gradients] │
│ Memory: 4× redundant │
│ │
│ FSDP / ZeRO Stage 3: │
│ GPU 0: [Model¼] [Optim¼] [Grad¼] │
│ GPU 1: [Model¼] [Optim¼] [Grad¼] │
│ GPU 2: [Model¼] [Optim¼] [Grad¼] │
│ GPU 3: [Model¼] [Optim¼] [Grad¼] │
│ Memory: ~4× reduction │
│ │
│ During forward: AllGather to reconstruct full layer, compute, discard │
│ During backward: AllGather weights, compute gradients, ReduceScatter │
│ │
│ Trade-off: More communication, but can train much larger models │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Mixed Precision Training
Modern training uses multiple numeric precisions to balance speed and stability:
FP32 (32-bit float): Full precision, 4 bytes per parameter FP16 (16-bit float): Half precision, 2 bytes, but limited range BF16 (bfloat16): Half precision with FP32's range, 2 bytes FP8: Emerging, even smaller, requires careful handling
┌─────────────────────────────────────────────────────────────────────────┐
│ MIXED PRECISION TRAINING │
│ │
│ Typical setup: │
│ • Master weights: FP32 (for accuracy in optimizer) │
│ • Forward pass: BF16 (speed) │
│ • Gradients: BF16 (speed) │
│ • Optimizer states: FP32 (stability) │
│ │
│ Why BF16 over FP16? │
│ • FP16 range: ±65,504 (overflow risk) │
│ • BF16 range: ±3.4×10³⁸ (same as FP32) │
│ • BF16 has less precision but rarely causes issues │
│ │
│ Memory savings: │
│ • FP32 throughout: 4 bytes × params × 4 (weights, grads, 2 optimizer) │
│ • Mixed precision: ~6 bytes per param (vs 16) │
│ • ~2.5× memory reduction │
│ │
│ Speed improvement: │
│ • Modern GPUs have tensor cores optimized for FP16/BF16 │
│ • ~2× throughput for matrix operations │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Checkpointing and Fault Tolerance
Training runs for weeks or months. Hardware fails. Checkpointing is essential:
Regular checkpoints: Save model state every N steps (e.g., every 1000 steps)
Includes:
- Model parameters
- Optimizer states
- Learning rate scheduler state
- Random number generator states
- Current step/epoch
- Data loader state (which samples have been seen)
Checkpoint size: For a 70B model, each checkpoint is ~1-2 TB
Fault tolerance strategies:
- Redundant storage (multiple copies)
- Automatic restart from latest checkpoint
- Elastic training (can add/remove GPUs)
- Preemption handling (for cloud spot instances)
Training Dynamics: What Happens During Pre-training
The Loss Curve
Pre-training produces a characteristic loss curve:
┌─────────────────────────────────────────────────────────────────────────┐
│ TYPICAL TRAINING LOSS CURVE │
│ │
│ Loss │
│ 12 │╲ │
│ │ ╲ │
│ 10 │ ╲ │
│ │ ╲ │
│ 8 │ ╲ │
│ │ ╲ │
│ 6 │ ╲ │
│ │ ╲ │
│ 4 │ ╲____ │
│ │ ╲____ │
│ 2 │ ╲________________________________________ │
│ │ │
│ 0 └────────────────────────────────────────────────────────────── │
│ 0 100K 200K 300K 400K 500K Steps │
│ │
│ Phases: │
│ 1. Rapid initial decrease (learning basic patterns) │
│ 2. Steady improvement (learning more complex patterns) │
│ 3. Diminishing returns (approaching data/model limits) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Learning Rate Schedule
The learning rate is the most important hyperparameter. Modern LLMs use a warmup + cosine decay schedule:
┌─────────────────────────────────────────────────────────────────────────┐
│ LEARNING RATE SCHEDULE │
│ │
│ LR │
│ 3e-4 │ ╭────────╮ │
│ │ ╱ ╲ │
│ 2e-4 │ ╱ ╲ │
│ │ ╱ ╲ │
│ 1e-4 │ ╱ ╲_____ │
│ │ ╱ ╲____ │
│ 0 │──╱ ╲_________________________ │
│ └────────────────────────────────────────────────────────── │
│ │←─ Warmup ─→│←──────── Cosine Decay ─────────────→│ │
│ (~2K steps) (rest of training) │
│ │
│ Warmup: Gradually increase LR to avoid instability at start │
│ Peak: Maximum learning rate (model-size dependent) │
│ Decay: Smoothly decrease to ~10% of peak │
│ │
│ Typical peak LR by model size: │
│ • 1B params: 3e-4 │
│ • 7B params: 3e-4 │
│ • 70B params: 1.5e-4 │
│ • 175B+ params: 1e-4 or lower │
│ │
│ Larger models need lower learning rates for stability. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Training Instabilities
Large-scale training is notoriously unstable. Common issues:
Loss spikes: Sudden increases in loss, often due to bad data batches or numeric issues.
Divergence: Loss increases and never recovers. Usually requires restarting from earlier checkpoint with lower LR.
Gradient explosion: Gradients become NaN/Inf. Requires gradient clipping.
Gradient vanishing: Gradients become zero, training stalls. Often an architecture issue.
Mitigation strategies:
- Gradient clipping (max norm, typically 1.0)
- Learning rate warmup
- Careful initialization
- Monitoring gradient norms, activation statistics
- Data quality checks
What the Model Learns (Progression)
Research suggests models learn different capabilities at different stages:
Early training (0-10% of tokens):
- Basic syntax and grammar
- Common word associations
- Simple patterns
Mid training (10-50% of tokens):
- Factual knowledge
- More complex reasoning
- Domain knowledge
Late training (50-100% of tokens):
- Subtle linguistic patterns
- Complex reasoning chains
- Edge cases and rare patterns
Implication: You can probe model capabilities throughout training. Some capabilities emerge suddenly; others improve gradually.
Continued Pre-training vs. Pre-training from Scratch
What is Continued Pre-training?
Instead of training a model from random initialization, continued pre-training takes an existing pre-trained model and trains it further on new data.
┌─────────────────────────────────────────────────────────────────────────┐
│ CONTINUED PRE-TRAINING │
│ │
│ FROM SCRATCH: │
│ Random Init ──────────────────────────────────────────→ Final Model │
│ Train on 15T tokens │
│ (months of compute) │
│ │
│ CONTINUED PRE-TRAINING: │
│ Llama 3 Base ─────────────────────→ Domain-Specific Model │
│ Train on 100B domain tokens │
│ (days of compute) │
│ │
│ Use cases: │
│ • Domain adaptation (medical, legal, financial) │
│ • Language adaptation (add new language) │
│ • Knowledge updating (more recent data) │
│ • Context length extension │
│ │
└─────────────────────────────────────────────────────────────────────────┘
When to Use Each Approach
Pre-train from scratch when:
- You have massive compute budget
- Existing models don't fit your needs
- You need full control over data and capabilities
- You're pushing the frontier
Continue pre-training when:
- You need domain specialization
- You want to add capabilities to existing model
- Budget is limited (10-100× cheaper)
- Base model is close to what you need
Continued Pre-training Considerations
Catastrophic forgetting: The model may "forget" general capabilities while learning domain-specific ones. Mitigation: mix domain data with general data (e.g., 50/50).
Learning rate: Use a lower learning rate than original pre-training (typically 10-30% of original peak).
Data quality: Domain data must be high quality; the model has strong priors from original training that bad data will conflict with.
Practical Considerations
Compute Requirements by Model Size
┌─────────────────────────────────────────────────────────────────────────┐
│ COMPUTE REQUIREMENTS GUIDE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Model Size │ Min GPUs │ Typical GPUs │ Training Time │ Est. Cost │
│ ─────────────────────────────────────────────────────────────────── │
│ 1B │ 1-4 │ 8 A100s │ 1-2 days │ $5K-20K │
│ 7B │ 8 │ 32-64 A100s │ 1-2 weeks │ $50K-200K │
│ 13B │ 16 │ 64-128 A100s │ 2-4 weeks │ $100K-500K │
│ 70B │ 64 │ 256-512 A100s│ 1-2 months │ $1M-5M │
│ 175B+ │ 256+ │ 1000+ A100s │ 2-4 months │ $5M-50M+ │
│ │
│ Notes: │
│ • Costs assume cloud pricing; owned hardware is cheaper long-term │
│ • Times assume Chinchilla-optimal; overtraining takes longer │
│ • H100s are ~2× faster than A100s │
│ • Costs dropping ~50% per year │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Data Requirements
Rough guidelines for tokens needed:
- Chinchilla-optimal: 20× parameter count
- Modern practice: 50-200× parameter count
- Maximum useful: Unknown, but >1000× seems wasteful
Data collection is expensive:
- Licensing costs for books, papers
- Compute for web crawling and filtering
- Human annotation for quality assessment
- Legal review for compliance
Common Failure Modes
- Data contamination: Test set data appears in training, invalidating benchmarks
- Under-filtering: Low-quality data hurts model quality
- Over-filtering: Too aggressive filtering loses valuable content
- Imbalanced mixture: Wrong proportions hurt specific capabilities
- Training instability: Loss spikes, divergence requiring restarts
- Infrastructure failures: Hardware failures, networking issues
Summary: The Pre-training Recipe
┌─────────────────────────────────────────────────────────────────────────┐
│ PRE-TRAINING RECIPE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. DATA │
│ ├── Collect diverse sources (web, books, code, papers) │
│ ├── Filter aggressively for quality │
│ ├── Deduplicate thoroughly │
│ ├── Remove PII and toxic content │
│ └── Mix in appropriate proportions │
│ │
│ 2. ARCHITECTURE │
│ ├── Transformer decoder (standard) │
│ ├── Modern improvements (RoPE, SwiGLU, RMSNorm, GQA) │
│ ├── Size based on compute budget │
│ └── Consider MoE for efficiency │
│ │
│ 3. TRAINING │
│ ├── Distributed training (FSDP/ZeRO + tensor/pipeline parallel) │
│ ├── Mixed precision (BF16) │
│ ├── Adam optimizer (or variants) │
│ ├── Warmup + cosine LR schedule │
│ ├── Gradient clipping │
│ └── Regular checkpointing │
│ │
│ 4. MONITORING │
│ ├── Loss curves │
│ ├── Gradient norms │
│ ├── Periodic evaluation on benchmarks │
│ └── Hardware utilization │
│ │
│ 5. OUTPUT │
│ └── Base model ready for SFT and RLHF │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Tokenization: The Foundation of Everything
Before any training can happen, text must be converted into tokens—discrete units the model can process. Tokenization decisions have profound effects on model capabilities.
Why Tokenization Matters
┌─────────────────────────────────────────────────────────────────────────┐
│ TOKENIZATION IMPACT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE SAME TEXT, DIFFERENT TOKENIZATIONS: │
│ │
│ Text: "tokenization is important" │
│ │
│ Word-level: ["tokenization", "is", "important"] → 3 tokens │
│ Character: ["t","o","k","e","n","i","z",...] → 23 tokens │
│ BPE (typical):["token", "ization", "is", "important"] → 4 tokens │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY THIS MATTERS: │
│ │
│ 1. EFFECTIVE CONTEXT LENGTH │
│ A 4K context window holds: │
│ • ~3,200 words with efficient tokenizer │
│ • ~2,000 words with inefficient tokenizer │
│ • The difference is massive for long documents │
│ │
│ 2. TRAINING EFFICIENCY │
│ More tokens per word = more compute per concept learned │
│ Efficient tokenization = faster, cheaper training │
│ │
│ 3. MULTILINGUAL CAPABILITY │
│ Poor tokenizers fragment non-English text: │
│ • "Hello" → 1 token │
│ • "你好" → 3-4 tokens (same meaning!) │
│ This makes non-English slower and uses more context │
│ │
│ 4. CAPABILITY BOUNDARIES │
│ Tokenizers affect what the model "sees": │
│ • "1234" as one token vs "12" "34" → different arithmetic │
│ • Code symbols tokenized together vs split → affects coding │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Modern Tokenization: BPE and Beyond
Byte Pair Encoding (BPE) is the dominant tokenization algorithm. It learns a vocabulary from data by iteratively merging frequent character pairs:
┌─────────────────────────────────────────────────────────────────────────┐
│ BPE ALGORITHM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRAINING (learning the vocabulary): │
│ │
│ Start: Each character is a token │
│ Corpus: "low lower lowest" │
│ Initial: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't'] │
│ │
│ Iteration 1: Most common pair = ('l', 'o') → merge to 'lo' │
│ Iteration 2: Most common pair = ('lo', 'w') → merge to 'low' │
│ Iteration 3: Most common pair = ('e', 'r') → merge to 'er' │
│ Iteration 4: Most common pair = ('low', 'er') → merge to 'lower' │
│ ...continue until vocabulary size reached │
│ │
│ Final vocabulary includes: 'low', 'lower', 'lowest', 'er', 'est', ... │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ENCODING (using the vocabulary): │
│ │
│ Greedy longest match from learned vocabulary: │
│ "lowest prices" → ["lowest", " ", "prices"] │
│ │
│ Unknown characters fall back to byte representation │
│ (handles any Unicode, even emoji) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
SentencePiece is a popular implementation that treats the input as raw bytes (no pre-tokenization needed), making it truly language-agnostic. Used by Llama, Mistral, and most modern models.
Tiktoken (used by OpenAI) is a fast BPE implementation with optimizations for efficiency.
Vocabulary Size Trade-offs
┌─────────────────────────────────────────────────────────────────────────┐
│ VOCABULARY SIZE DECISIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Smaller Vocabulary (e.g., 32K tokens): │
│ ├── ✓ Smaller embedding matrix (less memory) │
│ ├── ✓ Each token seen more often (better learning) │
│ ├── ✗ More tokens per word (longer sequences) │
│ └── ✗ Worse at rare words (more fragmentation) │
│ │
│ Larger Vocabulary (e.g., 128K tokens): │
│ ├── ✓ Fewer tokens per word (shorter sequences) │
│ ├── ✓ Better coverage of words and phrases │
│ ├── ✗ Larger embedding matrix (more memory) │
│ └── ✗ Rare tokens undertrained │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TYPICAL VOCABULARY SIZES: │
│ │
│ GPT-2: 50,257 tokens │
│ Llama 2: 32,000 tokens │
│ Llama 3: 128,000 tokens │
│ GPT-4: 100,000+ tokens (estimated) │
│ Mistral: 32,000 tokens │
│ │
│ Trend: Larger vocabularies as models scale (memory less constrained) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Special Tokens
Models use special tokens for control and structure:
<|begin_of_text|>: Start of sequence<|end_of_text|>: End of sequence (important for knowing when to stop)<|pad|>: Padding for batching variable-length sequences<|user|>,<|assistant|>: Role markers (added during SFT, not pre-training)
Getting special tokens right is crucial—mistakes here cause mysterious failures.
Evaluation During Pre-training
Pre-training runs for months. How do you know if things are going well?
Continuous Monitoring
┌─────────────────────────────────────────────────────────────────────────┐
│ MONITORING METRICS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LOSS METRICS (every step): │
│ ├── Training loss (main signal) │
│ ├── Validation loss (generalization check) │
│ ├── Per-domain loss (track different data sources) │
│ └── Loss variance (stability indicator) │
│ │
│ GRADIENT METRICS (every step): │
│ ├── Gradient norm (explosion/vanishing detection) │
│ ├── Per-layer gradient norms (identify problematic layers) │
│ └── Gradient noise scale (batch size adequacy) │
│ │
│ THROUGHPUT METRICS (every step): │
│ ├── Tokens per second │
│ ├── GPU utilization │
│ ├── Memory usage │
│ └── Communication overhead │
│ │
│ CAPABILITY METRICS (periodic): │
│ ├── Benchmark evaluations (every few thousand steps) │
│ ├── Perplexity on held-out sets │
│ ├── Few-shot performance on key tasks │
│ └── Generation quality samples │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Early Warning Signs
Loss spikes: Sudden increases in loss. Often caused by:
- Bad data batches (quality filter failures)
- Numeric instability (need gradient clipping)
- Learning rate too high
- Hardware issues (bit flips, memory errors)
Response: Check recent data, reduce LR, increase gradient clipping, rollback if needed.
Gradient norm spikes: Large gradient norms without loss spikes. Often precedes instability. Consider preemptive LR reduction.
Validation/training gap growing: Model is overfitting. Could indicate:
- Need more data
- Need more regularization
- Training too long
Benchmark Evaluation During Training
Running full evaluations is expensive, but periodic checks are valuable:
┌─────────────────────────────────────────────────────────────────────────┐
│ EVALUATION STRATEGY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ FREQUENCY vs THOROUGHNESS TRADE-OFF: │
│ │
│ Every 1K steps: Quick proxy metrics (5-10 minutes) │
│ ├── Perplexity on validation set │
│ ├── 5-shot accuracy on 2-3 key benchmarks │
│ └── Generation samples (qualitative check) │
│ │
│ Every 10K steps: Moderate evaluation (1-2 hours) │
│ ├── Full perplexity evaluation │
│ ├── Common benchmarks (MMLU, HellaSwag, ARC, etc.) │
│ └── Longer generation samples │
│ │
│ Every 50K steps: Comprehensive evaluation (4-8 hours) │
│ ├── Full benchmark suite │
│ ├── Coding benchmarks (HumanEval) │
│ ├── Math benchmarks (GSM8K) │
│ └── Human evaluation samples │
│ │
│ Note: Evaluations use separate GPU allocation to not slow training │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Data Quality vs Quantity Debate
A fundamental question in pre-training: Is it better to have more data or better data?
The Phi Model Insight
Microsoft's Phi models demonstrated that data quality can substitute for scale. Phi-1.5 (1.3B parameters) achieved GPT-3.5-level performance on some benchmarks by:
- Using heavily filtered, high-quality web data
- Creating synthetic "textbook-quality" data
- Focusing on educational, well-structured content
This suggests the scaling laws may overstate data quantity needs if quality is high enough.
Quality Indicators
What makes data "high quality" for pre-training?
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA QUALITY DIMENSIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LINGUISTIC QUALITY: │
│ ├── Grammatical correctness │
│ ├── Coherent structure │
│ ├── Rich vocabulary │
│ └── Clear expression │
│ │
│ INFORMATIONAL QUALITY: │
│ ├── Factual accuracy │
│ ├── Depth of explanation │
│ ├── Logical reasoning present │
│ └── Novel information (not repetitive) │
│ │
│ STRUCTURAL QUALITY: │
│ ├── Well-organized content │
│ ├── Clear paragraph structure │
│ ├── Appropriate use of formatting │
│ └── Good signal-to-noise ratio │
│ │
│ DIVERSITY QUALITY: │
│ ├── Covers many topics │
│ ├── Multiple perspectives represented │
│ ├── Different writing styles │
│ └── Range of complexity levels │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Diminishing Returns of Data Scale
Research increasingly shows diminishing returns from raw data scale:
- First 1T tokens: Massive capability gains
- 1T-5T tokens: Significant but smaller gains
- 5T-15T tokens: Incremental improvements
- Beyond 15T: Unclear if more helps much
This is why data quality and diversity become more important than raw scale at frontier levels. Everyone has access to similar web crawls; differentiation comes from curation.
Context Length: Training for Long Documents
Modern models need to handle long contexts (documents, codebases, conversations). This creates unique training challenges.
The Challenge of Long Contexts
┌─────────────────────────────────────────────────────────────────────────┐
│ LONG CONTEXT CHALLENGES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MEMORY SCALING: │
│ Standard attention: O(n²) memory in sequence length │
│ │
│ Sequence Length │ Attention Memory (per layer, per batch) │
│ ───────────────────┼──────────────────────────────────────── │
│ 2K tokens │ 16 MB │
│ 8K tokens │ 256 MB │
│ 32K tokens │ 4 GB │
│ 128K tokens │ 64 GB │
│ 1M tokens │ 4 TB (!) │
│ │
│ This is per layer, per sample. 32 layers × 8 batch = impossible │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TRAINING CHALLENGES: │
│ │
│ 1. Memory: Can't fit long sequences without special techniques │
│ 2. Data: Need documents actually that long (most aren't) │
│ 3. Learning: Model must learn to use distant context │
│ 4. Cost: Long sequences = expensive training │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Techniques for Long Context Training
Staged context extension: Train at shorter context first, then extend:
- Pre-train at 4K context
- Continue training at 32K with adjusted RoPE scaling
- Final phase at 128K with ring attention
Memory-efficient attention:
- Flash Attention: Fused kernels, never materializes full attention matrix
- Ring Attention: Distribute attention computation across devices
- Sliding window: Only attend to nearby tokens (Mistral)
Data for long context:
- Concatenate related documents
- Use naturally long documents (books, papers, code repos)
- Synthetic long-range dependency tasks
Reproducibility and Open Science
The Reproducibility Challenge
Pre-training is expensive and difficult to reproduce:
┌─────────────────────────────────────────────────────────────────────────┐
│ REPRODUCIBILITY CHALLENGES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ WHAT'S TYPICALLY DISCLOSED: │
│ ├── Model architecture │
│ ├── Parameter count │
│ ├── Token count (approximate) │
│ └── General data sources │
│ │
│ WHAT'S TYPICALLY NOT DISCLOSED: │
│ ├── Exact data mixture proportions │
│ ├── Filtering pipeline details │
│ ├── Hyperparameters (learning rate schedules, etc.) │
│ ├── Training instabilities and how they were resolved │
│ ├── Checkpoint selection criteria │
│ └── Compute infrastructure details │
│ │
│ RESULT: │
│ Even with model weights released, training process can't be │
│ reproduced. Published scaling laws may not transfer to your setup. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Open Efforts
Several initiatives aim to improve pre-training reproducibility:
Open datasets:
- RedPajama: Attempt to replicate Llama training data
- The Stack: Permissively licensed code
- Dolma: Open pre-training dataset with documentation
Open training:
- BLOOM: Fully documented training process
- OLMo: Open language model with training code and data
- Pythia: Suite of models with training data and checkpoints released
Frequently Asked Questions
Related Articles
SFT and RLHF: The Complete Guide to Post-Training LLMs
A deep dive into Supervised Fine-Tuning and Reinforcement Learning from Human Feedback—the techniques that transform base models into useful assistants.
SFT Deep Dive: Instruction Tuning Techniques and Best Practices
A comprehensive guide to Supervised Fine-Tuning (SFT) for LLMs—covering full fine-tuning vs LoRA vs QLoRA vs DoRA, data curation strategies, instruction formats, multi-task learning, and avoiding catastrophic forgetting.
RLHF Complete Guide: Aligning LLMs with Human Preferences
A comprehensive deep dive into Reinforcement Learning from Human Feedback—from reward modeling to PPO to DPO. Understanding how AI assistants learn to be helpful, harmless, and honest.
Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR
A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.
Knowledge Distillation for LLMs: Compressing Intelligence
A comprehensive guide to knowledge distillation—transferring capabilities from large teacher models to smaller, faster student models. From theory to implementation, including chain-of-thought distillation and synthetic data generation.
Open-Source LLMs: The Complete 2025 Guide
A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.