How much does it cost to pre-train an LLM from scratch?

Costs range widely based on model size and training duration: These are compute costs only; data acquisition, engineering, and research add significantly more. 1B model: $10K-50K, 7B model: $50K-500K, 70B model: $1M-10M, Frontier models (175B+): $10M-100M+.

Can I pre-train on a single GPU?

Technically yes for very small models (<1B), but it's not practical. A 1B model trained for a reasonable duration would take months on a single GPU. Most practitioners use at least 4-8 GPUs even for small models.

How do I know when to stop training?

Common stopping criteria: In practice, many models are trained for a fixed number of tokens determined upfront based on scaling laws and budget. Loss plateau (diminishing returns), Validation loss starts increasing (overfitting), Target benchmark performance achieved, Compute budget exhausted, Chinchilla-optimal tokens reached.

Is it better to train a smaller model longer or a larger model shorter?

For the same compute budget: Chinchilla says equal scaling is optimal for final loss, But smaller models are cheaper to deploy, Modern trend: Overtrain smaller models for production efficiency.

What's the minimum data needed for pre-training?

This depends heavily on model size. Rough minimums: With less data, the model will overfit and not generalize well. 1B model: 20B tokens (bare minimum), 100B+ (better), 7B model: 200B tokens (minimum), 1T+ (better).

LLM Pre-training: Building Foundation Models from Scratch | Enrico Piovano

Q: How do I know when to stop training?

Common stopping criteria: In practice, many models are trained for a fixed number of tokens determined upfront based on scaling laws and budget. Loss plateau (diminishing returns), Validation loss starts increasing (overfitting), Target benchmark performance achieved, Compute budget exhausted, Chinchilla-optimal tokens reached.

Q: Is it better to train a smaller model longer or a larger model shorter?

For the same compute budget: Chinchilla says equal scaling is optimal for final loss, But smaller models are cheaper to deploy, Modern trend: Overtrain smaller models for production efficiency.

Q: What's the minimum data needed for pre-training?

This depends heavily on model size. Rough minimums: With less data, the model will overfit and not generalize well. 1B model: 20B tokens (bare minimum), 100B+ (better), 7B model: 200B tokens (minimum), 1T+ (better).

What is Pre-training?

Pre-training is the foundational phase of building a large language model. It's where a model learns language itself—grammar, facts, reasoning patterns, and the statistical structure of human text—by processing massive amounts of data.

Think of pre-training as teaching a child to read and understand language by exposing them to millions of books, articles, conversations, and documents. The child doesn't memorize specific facts (though some stick); they develop an intuition for how language works, what words mean, how ideas connect, and how to reason about the world.

Pre-training is distinct from later training phases:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE LLM TRAINING PIPELINE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PRE-TRAINING (this post)                                                │
│  ─────────────────────────                                               │
│  • Train on trillions of tokens from the internet                       │
│  • Self-supervised learning (predict next token)                         │
│  • Result: Base model that can complete text                            │
│  • Cost: $1M - $100M+ compute                                           │
│  • Time: Weeks to months                                                 │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  SUPERVISED FINE-TUNING (SFT)                                           │
│  ───────────────────────────                                             │
│  • Train on instruction-response pairs                                   │
│  • Teaches model to follow instructions                                  │
│  • Result: Model that responds helpfully                                │
│  • Cost: $1K - $100K compute                                            │
│  • Time: Hours to days                                                   │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  REINFORCEMENT LEARNING (RLHF/DPO)                                      │
│  ─────────────────────────────────                                       │
│  • Train on human preferences                                            │
│  • Aligns model with human values                                        │
│  • Result: Model that's helpful, harmless, honest                       │
│  • Cost: $10K - $1M compute                                             │
│  • Time: Days to weeks                                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Pre-training is by far the most expensive and foundational phase. Everything that follows—SFT, RLHF, fine-tuning for specific tasks—builds on the capabilities established during pre-training. A model can only be as good as its pre-training allows.

The Pre-training Objective: Learning to Predict

Next Token Prediction (Autoregressive LMs)

The dominant pre-training objective for modern LLMs is deceptively simple: predict the next token.

Given a sequence of tokens, predict what comes next:

Code

Input:  "The capital of France is"
Target: "Paris"

Input:  "def fibonacci(n):\n    if n <= 1:\n        return"
Target: "n"

This objective is called autoregressive language modeling or causal language modeling. Models like GPT-4, Claude, Llama, and Mistral all use this approach.

Why Next Token Prediction Works So Well

At first glance, predicting the next word seems too simple to produce intelligent behavior. But consider what the model must learn to predict well:

To predict the next word in a sentence about physics:

The model must understand physics concepts
It must know how these concepts relate
It must follow logical reasoning chains

To predict the next token in code:

The model must understand syntax
It must track variable types and scopes
It must follow programming logic

To predict the next word in a dialogue:

The model must understand context and intent
It must model different perspectives
It must follow conversational norms

The next token prediction objective is a "universal task" that requires mastering language at every level—from grammar and spelling to reasoning and world knowledge. The model isn't explicitly taught any of these skills; they emerge from the pressure to predict accurately.

The Mathematical Formulation

The pre-training objective minimizes cross-entropy loss over the training corpus:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    NEXT TOKEN PREDICTION LOSS                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  For a sequence of tokens x₁, x₂, ..., xₙ:                              │
│                                                                          │
│  Loss = -∑ log P(xᵢ | x₁, x₂, ..., xᵢ₋₁)                               │
│                                                                          │
│  In words: For each position, how surprised is the model                │
│  by the actual token given everything that came before?                 │
│                                                                          │
│  Lower loss = Better predictions = Better understanding                 │
│                                                                          │
│  Example:                                                                │
│  "The cat sat on the [mat]"                                             │
│                                                                          │
│  If model predicts:                                                      │
│  • P("mat") = 0.3    → Loss contribution = -log(0.3) = 1.2             │
│  • P("mat") = 0.01   → Loss contribution = -log(0.01) = 4.6            │
│  • P("mat") = 0.9    → Loss contribution = -log(0.9) = 0.1             │
│                                                                          │
│  The model is trained to maximize probability of correct tokens.        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Perplexity: The Standard Metric

Perplexity is the standard metric for evaluating pre-trained language models. It's the exponential of the average loss:

Code

Perplexity = exp(Loss / N)

Intuitively, perplexity represents how "confused" the model is. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options.

Typical perplexity ranges:

Random guessing (50K vocab): ~50,000
Bad language model: ~100-500
Good language model: ~10-30
State-of-the-art (on common benchmarks): ~5-15

Masked Language Modeling (BERT-style)

An alternative pre-training objective is masked language modeling (MLM), used by BERT and its variants:

Code

Input:  "The [MASK] of France is Paris"
Target: "capital"

Instead of predicting the next token, the model predicts randomly masked tokens. This creates a bidirectional model—it can look both forward and backward when making predictions.

Comparison:

Aspect	Autoregressive (GPT)	Masked (BERT)
Direction	Left-to-right only	Bidirectional
Use case	Text generation	Understanding/classification
Generation	Natural (token by token)	Unnatural (must iterate)
Context	Only past tokens	Full context
Modern preference	✅ Dominant for LLMs	Used for embeddings

Modern LLMs almost universally use autoregressive pre-training because generation is natural and the same model works for both understanding and generation.

Data: The Fuel for Pre-training

Pre-training data quality and quantity are arguably more important than architecture or training techniques. A model trained on high-quality data will outperform a larger model trained on low-quality data.

Scale: How Much Data?

Modern LLMs are trained on staggering amounts of text:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PRE-TRAINING DATA SCALE                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model                    Training Tokens         Approximate Size       │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2 (2019)            40 billion               ~40 GB text           │
│  GPT-3 (2020)            300 billion              ~570 GB text          │
│  Chinchilla (2022)       1.4 trillion             ~1.4 TB text          │
│  Llama 2 (2023)          2 trillion               ~2 TB text            │
│  Llama 3 (2024)          15 trillion              ~15 TB text           │
│  GPT-4 (estimated)       ~10-13 trillion          ~10+ TB text          │
│                                                                          │
│  For reference:                                                          │
│  • Wikipedia: ~4 billion tokens (~20 GB)                                │
│  • All books ever written: ~500 billion tokens (estimate)               │
│  • Common Crawl (filtered): ~1-3 trillion tokens                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The trend is clear: more data leads to better models, but with diminishing returns. The question becomes: where does all this data come from?

Data Sources

1. Web Crawls (Common Crawl)

The foundation of most pre-training datasets is Common Crawl, a non-profit that has been crawling and archiving the web since 2008.

Contains petabytes of raw web data
New crawls released monthly
Covers billions of web pages

But raw web data is mostly garbage. Common Crawl contains:

Spam and SEO content
Duplicated pages
Boilerplate (navigation, ads, footers)
Low-quality machine-generated text
Malicious content
Personally identifiable information

The art of using web data is in filtering. Models like Llama use only a small fraction of Common Crawl after aggressive filtering.

2. Books

Books provide high-quality, long-form, well-edited text:

Books1/Books2: Datasets of digitized books (used by GPT-3)
Project Gutenberg: Public domain books
Internet Archive: Digital library

Books are valuable because they contain:

Sustained reasoning and arguments
Diverse writing styles
Edited, high-quality prose
Long-range dependencies

3. Code

Code has become increasingly important for LLM capabilities:

GitHub: Public repositories
Stack Overflow: Q&A with code
The Stack: Curated code dataset

Code training improves:

Reasoning ability (code requires logical thinking)
Structured output (JSON, XML, etc.)
Instruction following (code is precise)
General capability (surprisingly broad transfer)

4. Scientific Papers

Academic content provides high-quality technical knowledge:

ArXiv: Pre-prints across sciences
PubMed: Medical literature
Semantic Scholar: Academic papers

5. Curated/Synthetic Data

Increasingly, pre-training includes curated or synthetic data:

Wikipedia: High-quality encyclopedic content (often upweighted)
Textbooks: Educational content (Phi models heavily use this)
Synthetic data: Generated by other LLMs for specific capabilities

Data Curation Pipeline

Raw data must be processed through an extensive pipeline before training:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA CURATION PIPELINE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RAW WEB CRAWL                                                           │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────┐                                                    │
│  │  URL FILTERING  │  Remove known bad domains, adult content,          │
│  │                 │  spam domains, etc.                                 │
│  └────────┬────────┘                                                    │
│           │ ~50% removed                                                 │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │ TEXT EXTRACTION │  Extract text from HTML, remove boilerplate,       │
│  │                 │  navigation, ads, scripts                          │
│  └────────┬────────┘                                                    │
│           │                                                              │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │   LANGUAGE ID   │  Keep only target languages                        │
│  │                 │  (usually English + selected others)               │
│  └────────┬────────┘                                                    │
│           │ ~30% removed                                                 │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │ QUALITY FILTER  │  Remove low-quality text using classifiers         │
│  │                 │  (trained on Wikipedia vs web text)                │
│  └────────┬────────┘                                                    │
│           │ ~60-80% removed                                              │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │  DEDUPLICATION  │  Remove duplicate documents (exact + fuzzy)        │
│  │                 │  Critical for training stability                   │
│  └────────┬────────┘                                                    │
│           │ ~30-50% removed                                              │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │   PII REMOVAL   │  Remove emails, phone numbers, addresses,          │
│  │                 │  social security numbers, etc.                     │
│  └────────┬────────┘                                                    │
│           │                                                              │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │ TOXICITY FILTER │  Remove hate speech, extreme content               │
│  │                 │  (classifiers or keyword lists)                    │
│  └────────┬────────┘                                                    │
│           │                                                              │
│           ▼                                                              │
│  CLEAN TRAINING DATA                                                     │
│  (typically 1-5% of raw crawl)                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Quality Filtering in Detail

Quality filtering is perhaps the most impactful step. The goal is to keep text that looks like "good" writing and remove text that looks like spam, machine-generated content, or low-effort writing.

Common approaches:

Classifier-based: Train a classifier to distinguish Wikipedia/books (positive) from random web text (negative). Apply to all web data, keep high-scoring documents.

Heuristic-based: Apply rules like:

Minimum/maximum document length
Ratio of alphabetic characters to total
Presence of stop words (real text has "the", "and", "is")
Average word length (spam often has unusual patterns)
Repetition detection (spam repeats phrases)
Symbol ratio (too many special characters = bad)

Perplexity-based: Use a small pre-trained model to score text. Very high perplexity = unusual/bad text.

Human evaluation: Sample and manually rate documents to calibrate filters.

Deduplication: More Important Than You'd Think

Duplicate and near-duplicate documents cause serious problems:

Wasted compute: Training on the same content twice doesn't help
Memorization: Exact duplicates encourage memorization over generalization
Evaluation contamination: If test data appears in training, benchmarks are invalid
Privacy risks: Duplicated private information is more likely to be memorized

Deduplication methods:

Exact deduplication: Hash documents, remove duplicates. Fast but misses near-duplicates.

MinHash/LSH: Create document fingerprints, find similar documents efficiently. Can catch documents that differ by a few words.

Suffix array: Find repeated substrings across the corpus. Can remove duplicated paragraphs even if documents differ overall.

Data Mixing: The Art of Proportions

Pre-training data comes from multiple sources. The mixture proportions significantly impact model capabilities:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    EXAMPLE DATA MIXTURES                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LLAMA 2 (reported):                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ Web data:       ~89%                                             │   │
│  │ Code:           ~4%                                              │   │
│  │ Wikipedia:      ~4%                                              │   │
│  │ Books:          ~3%                                              │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  GPT-3 (reported):                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ Common Crawl:   60%                                              │   │
│  │ WebText2:       22%                                              │   │
│  │ Books:          8%                                               │   │
│  │ Wikipedia:      3%                                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  PHI (Microsoft):                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ Textbooks:      Very high (exact % unknown)                      │   │
│  │ Synthetic:      Very high                                        │   │
│  │ Code:           Significant                                      │   │
│  │ Web data:       Lower than typical                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  Key insight: More high-quality data is often better than more          │
│  low-quality data, even if total tokens is lower.                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Upweighting: High-quality sources like Wikipedia are often shown to the model multiple times (2-10x). This is called "upweighting" or "oversampling."

Curriculum: Some approaches vary the mixture during training—starting with easier/cleaner data and adding harder/noisier data later.

Architecture: The Transformer and Its Variants

The Transformer Foundation

All modern LLMs are based on the Transformer architecture, introduced in "Attention Is All You Need" (2017). The key components:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TRANSFORMER DECODER BLOCK                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                        Input (sequence of tokens)                        │
│                                 │                                        │
│                                 ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      TOKEN EMBEDDING                             │   │
│  │    Convert discrete tokens to continuous vectors                 │   │
│  │    vocab_size × hidden_dim matrix lookup                         │   │
│  └──────────────────────────────┬──────────────────────────────────┘   │
│                                 │                                        │
│                                 ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                   POSITIONAL ENCODING                            │   │
│  │    Add position information (since attention is permutation-     │   │
│  │    invariant without it)                                         │   │
│  └──────────────────────────────┬──────────────────────────────────┘   │
│                                 │                                        │
│                                 ▼                                        │
│           ╔═══════════════════════════════════════════╗                 │
│           ║         TRANSFORMER BLOCK (×N)            ║                 │
│           ║  ┌─────────────────────────────────────┐  ║                 │
│           ║  │     MASKED SELF-ATTENTION           │  ║                 │
│           ║  │  Each position attends to previous  │  ║                 │
│           ║  │  positions only (causal mask)       │  ║                 │
│           ║  └──────────────────┬──────────────────┘  ║                 │
│           ║                     │                      ║                 │
│           ║                     ▼                      ║                 │
│           ║  ┌─────────────────────────────────────┐  ║                 │
│           ║  │        ADD & LAYER NORM             │  ║                 │
│           ║  │   Residual connection + normalize   │  ║                 │
│           ║  └──────────────────┬──────────────────┘  ║                 │
│           ║                     │                      ║                 │
│           ║                     ▼                      ║                 │
│           ║  ┌─────────────────────────────────────┐  ║                 │
│           ║  │      FEED-FORWARD NETWORK           │  ║                 │
│           ║  │  Two linear layers with activation  │  ║                 │
│           ║  │  (typically 4× hidden_dim)          │  ║                 │
│           ║  └──────────────────┬──────────────────┘  ║                 │
│           ║                     │                      ║                 │
│           ║                     ▼                      ║                 │
│           ║  ┌─────────────────────────────────────┐  ║                 │
│           ║  │        ADD & LAYER NORM             │  ║                 │
│           ║  └──────────────────┬──────────────────┘  ║                 │
│           ╚══════════════════════╪════════════════════╝                 │
│                                 │                                        │
│                                 ▼ (repeat N times)                      │
│                                 │                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                     OUTPUT PROJECTION                            │   │
│  │    Project back to vocabulary size for next token prediction     │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Key Architecture Decisions

1. Model Dimensions

The "size" of a model is determined by several hyperparameters:

Parameter	Description	Typical Values
hidden_dim	Width of representations	768 - 8192
num_layers	Depth of transformer stack	12 - 96
num_heads	Parallel attention heads	12 - 64
vocab_size	Number of tokens	32K - 128K
context_length	Maximum sequence length	2K - 128K

Parameter count formula:

Code

params ≈ 12 × num_layers × hidden_dim²

This is approximate; the exact count depends on vocabulary size and other factors.

2. Attention Variants

Standard self-attention is O(n²) in sequence length, which becomes prohibitive for long contexts. Many variants address this:

Multi-Head Attention (standard):

Multiple parallel attention "heads"
Each head can focus on different relationship types
Outputs concatenated and projected

Grouped Query Attention (GQA):

Used in Llama 2/3, Mistral
Multiple query heads share fewer key-value heads
Reduces memory usage during inference
Similar quality to full multi-head

Multi-Query Attention (MQA):

Extreme version: all queries share one KV head
Maximum memory efficiency
Slight quality trade-off

Sliding Window Attention:

Used in Mistral, Mixtral
Each position only attends to nearby positions
Enables very long sequences efficiently

Ring Attention / Sequence Parallelism:

Distribute attention across devices
Enables training on very long sequences

3. Positional Encoding

Transformers need position information added explicitly. Evolution of approaches:

Absolute Positional Embeddings (original):

Learned embedding for each position
Limited to trained context length
Can't extrapolate to longer sequences

Rotary Position Embedding (RoPE):

Used in Llama, Mistral, most modern LLMs
Encodes relative positions through rotation
Better length generalization
Can be extended with interpolation

ALiBi (Attention with Linear Biases):

Used in BLOOM, MPT
Adds linear bias based on position distance
No learned parameters for position
Good length generalization

4. Normalization

Where and how to normalize activations:

Post-Norm (original Transformer):

Normalize after attention/FFN
Can be unstable for deep networks

Pre-Norm (GPT-2 onwards):

Normalize before attention/FFN
More stable training
Now standard for LLMs

RMSNorm (Llama):

Simplified LayerNorm (removes mean centering)
Slightly faster, similar quality
Increasingly standard

5. Feed-Forward Network Variants

Standard FFN: $\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 \cdot x)$

SwiGLU (Llama, most modern LLMs): $\text{FFN}(x) = (W_1 \cdot x \otimes \text{Swish}(W_3 \cdot x)) \cdot W_2$

SwiGLU adds a gating mechanism that empirically improves quality.

Model Scale Comparison

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MODEL ARCHITECTURE COMPARISON                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model          Params   Layers  Hidden   Heads   Context   Year        │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2          1.5B     48      1600     25      1K        2019        │
│  GPT-3          175B     96      12288    96      2K        2020        │
│  Llama 2 7B     7B       32      4096     32      4K        2023        │
│  Llama 2 70B    70B      80      8192     64      4K        2023        │
│  Llama 3 8B     8B       32      4096     32      8K        2024        │
│  Llama 3 70B    70B      80      8192     64      8K        2024        │
│  Llama 3 405B   405B     126     16384    128     128K      2024        │
│  Mistral 7B     7B       32      4096     32      32K       2023        │
│  Mixtral 8×7B   47B*     32      4096     32      32K       2024        │
│  Qwen2.5 72B    72B      80      8192     64      128K      2024        │
│                                                                          │
│  *Mixtral uses Mixture-of-Experts; 47B total but 13B active per token  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mixture of Experts (MoE)

A major architectural trend is Mixture of Experts, where the model has multiple "expert" feed-forward networks and a router selects which experts to use for each token:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MIXTURE OF EXPERTS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Standard Transformer:                                                   │
│  ┌────────────┐                                                         │
│  │    FFN     │  Same FFN for every token                               │
│  └────────────┘                                                         │
│                                                                          │
│  Mixture of Experts:                                                     │
│                    ┌─────────────┐                                      │
│                    │   Router    │  Decides which experts to use        │
│                    └──────┬──────┘                                      │
│                           │                                              │
│         ┌─────────────────┼─────────────────┐                           │
│         ▼                 ▼                 ▼                            │
│    ┌─────────┐       ┌─────────┐       ┌─────────┐                      │
│    │Expert 1 │       │Expert 2 │  ...  │Expert N │                      │
│    └─────────┘       └─────────┘       └─────────┘                      │
│                                                                          │
│  Typically: 8 experts, top-2 selected per token                         │
│                                                                          │
│  Benefits:                                                               │
│  • More parameters without more compute (sparse activation)             │
│  • Experts can specialize (one for code, one for math, etc.)           │
│  • Better scaling properties                                            │
│                                                                          │
│  Examples:                                                               │
│  • Mixtral 8×7B: 47B total params, 13B active                          │
│  • Grok-1: 314B total, mixture of experts                               │
│  • GPT-4 (rumored): MoE architecture                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Scaling Laws: The Science of Bigger Models

The Scaling Laws Discovery

One of the most important discoveries in modern AI is that LLM performance follows predictable scaling laws. Given more compute, data, or parameters, we can predict how much better the model will be.

The foundational paper, "Scaling Laws for Neural Language Models" (Kaplan et al., 2020), showed that loss decreases as a power law with scale:

Code

L(N) = (Nₑ/N)^αₙ    for model size
L(D) = (Dₑ/D)^αᵈ    for data size
L(C) = (Cₑ/C)^αᴄ    for compute

Where:
- L = loss (lower is better)
- N = number of parameters
- D = dataset size (tokens)
- C = compute (FLOPs)
- α values ≈ 0.07-0.1

Key insight: Performance improves smoothly and predictably with scale. There's no magic threshold; just keep scaling.

Chinchilla Scaling Laws

DeepMind's "Chinchilla" paper (2022) refined the scaling laws and revealed that most models were undertrained:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CHINCHILLA OPTIMAL SCALING                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Previous wisdom (GPT-3 era):                                           │
│  "Make models as big as possible, train on fixed data"                  │
│                                                                          │
│  Chinchilla discovery:                                                   │
│  "Optimal allocation: scale data and parameters equally"                │
│                                                                          │
│  OPTIMAL RATIO: ~20 tokens per parameter                                │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │  Model Size    │ Optimal Training Tokens │ Actual (GPT-3 era)  │    │
│  ├────────────────┼─────────────────────────┼─────────────────────│    │
│  │  1B params     │  20B tokens             │  ~300B (overtrained)│    │
│  │  10B params    │  200B tokens            │  ~300B              │    │
│  │  70B params    │  1.4T tokens            │  ~300B (undertrained)│   │
│  │  175B params   │  3.5T tokens            │  ~300B (very under) │    │
│  └────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  Implication: GPT-3 (175B, 300B tokens) was massively undertrained.    │
│  Chinchilla (70B, 1.4T tokens) matched GPT-3 with 4× less compute.     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Beyond Chinchilla: Modern Scaling

Post-Chinchilla, the field has moved toward even longer training:

Llama approach: Train smaller models for much longer than Chinchilla-optimal:

Llama 2 7B: Trained on 2T tokens (100× Chinchilla ratio)
Llama 3 8B: Trained on 15T tokens (nearly 2000× ratio)

Why overtrain? Inference cost matters. A smaller model trained longer:

Costs less to run in production
Is faster for users
Uses less memory
Can run on smaller GPUs/edge devices

The trade-off: You spend more compute during training (once) to save compute during inference (millions of times).

Emergent Abilities

An important phenomenon in scaling: emergent abilities appear suddenly at certain scales.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    EMERGENT ABILITIES                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Performance                                                             │
│       ▲                                                                  │
│       │                          ┌────── Emergent ability appears       │
│       │                          │       (sudden jump)                  │
│       │                          ▼                                       │
│       │                    ╭─────────────                               │
│       │               ╭────╯                                            │
│       │          ╭────╯                                                 │
│       │     ╭────╯                                                      │
│       │╭────╯                                                           │
│       └──────────────────────────────────────▶ Scale                    │
│              Small    Medium    Large    Very Large                     │
│                                                                          │
│  Examples of emergent abilities:                                         │
│  • Chain-of-thought reasoning (~100B params)                            │
│  • Multi-step arithmetic (~10B params)                                  │
│  • Word unscrambling (~10B params)                                      │
│  • Instruction following (~1B params)                                   │
│                                                                          │
│  Note: Recent research suggests emergence may be a measurement          │
│  artifact—abilities may improve smoothly but our metrics have          │
│  thresholds that create apparent discontinuities.                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Training Infrastructure: Making It Work at Scale

Pre-training a large language model requires massive distributed computing infrastructure. The challenges are immense:

The Scale of the Problem

Consider training a 70B parameter model on 2T tokens:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TRAINING COMPUTE REQUIREMENTS                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model: 70B parameters, 2T tokens                                       │
│                                                                          │
│  FLOPs required: ~6 × N × D = 6 × 70B × 2T = 8.4 × 10²³ FLOPs          │
│                                                                          │
│  Single A100 (80GB):                                                     │
│  • Peak: 312 TFLOPS (bfloat16)                                          │
│  • Realistic: ~150 TFLOPS (50% utilization)                             │
│  • Time: 8.4×10²³ / 150×10¹² = 5.6×10⁹ seconds = 177 years             │
│                                                                          │
│  1024 A100s (typical large cluster):                                    │
│  • Time: 177 years / 1024 ≈ 63 days                                     │
│  • Cost: ~$5-10M (cloud pricing)                                        │
│                                                                          │
│  Memory requirement:                                                     │
│  • Model parameters: 70B × 2 bytes = 140 GB                             │
│  • Optimizer states: 70B × 12 bytes = 840 GB (Adam)                     │
│  • Gradients: 70B × 2 bytes = 140 GB                                    │
│  • Activations: Variable, but large                                     │
│  • Total: >1 TB just for model state                                    │
│                                                                          │
│  This is why distributed training is essential.                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distributed Training Strategies

Data Parallelism

The simplest form: replicate the model on each GPU, split data across GPUs.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA PARALLELISM                                      │
│                                                                          │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                       │
│  │  GPU 0  │ │  GPU 1  │ │  GPU 2  │ │  GPU 3  │                       │
│  │ Model   │ │ Model   │ │ Model   │ │ Model   │                       │
│  │ (copy)  │ │ (copy)  │ │ (copy)  │ │ (copy)  │                       │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘                       │
│       │          │          │          │                                │
│       ▼          ▼          ▼          ▼                                │
│   Batch 0    Batch 1    Batch 2    Batch 3                             │
│       │          │          │          │                                │
│       └──────────┴──────────┴──────────┘                                │
│                      │                                                   │
│                      ▼                                                   │
│              Gradient AllReduce                                          │
│              (average gradients)                                         │
│                                                                          │
│  Limitation: Model must fit on single GPU                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Parallelism (Tensor Parallelism)

Split the model's layers across GPUs. Each GPU holds part of each layer.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TENSOR PARALLELISM                                    │
│                                                                          │
│  Linear layer: Y = XW                                                    │
│                                                                          │
│  Split W across GPUs:                                                    │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                       │
│  │  GPU 0  │ │  GPU 1  │ │  GPU 2  │ │  GPU 3  │                       │
│  │   W₀    │ │   W₁    │ │   W₂    │ │   W₃    │                       │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘                       │
│       │          │          │          │                                │
│       ▼          ▼          ▼          ▼                                │
│     X·W₀       X·W₁       X·W₂       X·W₃                              │
│       │          │          │          │                                │
│       └──────────┴──────────┴──────────┘                                │
│                      │                                                   │
│                      ▼                                                   │
│              AllGather [Y₀, Y₁, Y₂, Y₃]                                 │
│                                                                          │
│  Benefit: Can train models larger than single GPU memory                │
│  Cost: Communication overhead between GPUs                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Pipeline Parallelism

Split the model's layers across GPUs sequentially. GPU 0 has layers 1-10, GPU 1 has layers 11-20, etc.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PIPELINE PARALLELISM                                  │
│                                                                          │
│  GPU 0        GPU 1        GPU 2        GPU 3                           │
│  Layers       Layers       Layers       Layers                          │
│   1-8          9-16        17-24        25-32                           │
│    │            │            │            │                              │
│    ▼            │            │            │                              │
│  [===]──────────▼            │            │                              │
│    │         [===]───────────▼            │                              │
│    │            │         [===]───────────▼                              │
│    │            │            │         [===]                             │
│    │            │            │            │                              │
│    ◄────────────◄────────────◄────────────┘  (backward pass)            │
│                                                                          │
│  Problem: "Pipeline bubble" - GPUs idle waiting for others              │
│                                                                          │
│  Solution: Micro-batching (split batch into micro-batches)              │
│                                                                          │
│  GPU 0: [=1=][=2=][=3=][=4=][   ][   ][=4=][=3=][=2=][=1=]              │
│  GPU 1: [   ][=1=][=2=][=3=][=4=][=4=][=3=][=2=][=1=][   ]              │
│  GPU 2: [   ][   ][=1=][=2=][=3=][=3=][=2=][=1=][   ][   ]              │
│  GPU 3: [   ][   ][   ][=1=][=2=][=2=][=1=][   ][   ][   ]              │
│          └── forward ──┘  └── backward ──┘                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Fully Sharded Data Parallelism (FSDP) / ZeRO

The modern standard: shard model parameters, gradients, and optimizer states across GPUs.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FSDP / ZeRO                                           │
│                                                                          │
│  Key insight: Each GPU only needs all parameters during forward/back    │
│  pass. Rest of the time, keep only a shard.                             │
│                                                                          │
│  Standard Data Parallel:                                                 │
│  GPU 0: [Full Model] [Full Optimizer] [Full Gradients]                  │
│  GPU 1: [Full Model] [Full Optimizer] [Full Gradients]                  │
│  GPU 2: [Full Model] [Full Optimizer] [Full Gradients]                  │
│  GPU 3: [Full Model] [Full Optimizer] [Full Gradients]                  │
│  Memory: 4× redundant                                                    │
│                                                                          │
│  FSDP / ZeRO Stage 3:                                                   │
│  GPU 0: [Model¼] [Optim¼] [Grad¼]                                       │
│  GPU 1: [Model¼] [Optim¼] [Grad¼]                                       │
│  GPU 2: [Model¼] [Optim¼] [Grad¼]                                       │
│  GPU 3: [Model¼] [Optim¼] [Grad¼]                                       │
│  Memory: ~4× reduction                                                   │
│                                                                          │
│  During forward: AllGather to reconstruct full layer, compute, discard │
│  During backward: AllGather weights, compute gradients, ReduceScatter  │
│                                                                          │
│  Trade-off: More communication, but can train much larger models        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mixed Precision Training

Modern training uses multiple numeric precisions to balance speed and stability:

FP32 (32-bit float): Full precision, 4 bytes per parameter FP16 (16-bit float): Half precision, 2 bytes, but limited range BF16 (bfloat16): Half precision with FP32's range, 2 bytes FP8: Emerging, even smaller, requires careful handling

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MIXED PRECISION TRAINING                              │
│                                                                          │
│  Typical setup:                                                          │
│  • Master weights: FP32 (for accuracy in optimizer)                     │
│  • Forward pass: BF16 (speed)                                           │
│  • Gradients: BF16 (speed)                                              │
│  • Optimizer states: FP32 (stability)                                   │
│                                                                          │
│  Why BF16 over FP16?                                                    │
│  • FP16 range: ±65,504 (overflow risk)                                  │
│  • BF16 range: ±3.4×10³⁸ (same as FP32)                                │
│  • BF16 has less precision but rarely causes issues                     │
│                                                                          │
│  Memory savings:                                                         │
│  • FP32 throughout: 4 bytes × params × 4 (weights, grads, 2 optimizer) │
│  • Mixed precision: ~6 bytes per param (vs 16)                          │
│  • ~2.5× memory reduction                                               │
│                                                                          │
│  Speed improvement:                                                      │
│  • Modern GPUs have tensor cores optimized for FP16/BF16               │
│  • ~2× throughput for matrix operations                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Checkpointing and Fault Tolerance

Training runs for weeks or months. Hardware fails. Checkpointing is essential:

Regular checkpoints: Save model state every N steps (e.g., every 1000 steps)

Includes:

Model parameters
Optimizer states
Learning rate scheduler state
Random number generator states
Current step/epoch
Data loader state (which samples have been seen)

Checkpoint size: For a 70B model, each checkpoint is ~1-2 TB

Fault tolerance strategies:

Redundant storage (multiple copies)
Automatic restart from latest checkpoint
Elastic training (can add/remove GPUs)
Preemption handling (for cloud spot instances)

Training Dynamics: What Happens During Pre-training

The Loss Curve

Pre-training produces a characteristic loss curve:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TYPICAL TRAINING LOSS CURVE                           │
│                                                                          │
│  Loss                                                                    │
│   12 │╲                                                                 │
│      │ ╲                                                                │
│   10 │  ╲                                                               │
│      │   ╲                                                              │
│    8 │    ╲                                                             │
│      │     ╲                                                            │
│    6 │      ╲                                                           │
│      │       ╲                                                          │
│    4 │        ╲____                                                     │
│      │             ╲____                                                │
│    2 │                  ╲________________________________________       │
│      │                                                                  │
│    0 └──────────────────────────────────────────────────────────────    │
│       0    100K   200K   300K   400K   500K   Steps                     │
│                                                                          │
│  Phases:                                                                 │
│  1. Rapid initial decrease (learning basic patterns)                    │
│  2. Steady improvement (learning more complex patterns)                 │
│  3. Diminishing returns (approaching data/model limits)                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Learning Rate Schedule

The learning rate is the most important hyperparameter. Modern LLMs use a warmup + cosine decay schedule:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LEARNING RATE SCHEDULE                                │
│                                                                          │
│  LR                                                                      │
│  3e-4 │        ╭────────╮                                               │
│       │       ╱          ╲                                              │
│  2e-4 │      ╱            ╲                                             │
│       │     ╱              ╲                                            │
│  1e-4 │    ╱                ╲_____                                      │
│       │   ╱                       ╲____                                 │
│    0  │──╱                             ╲_________________________       │
│       └──────────────────────────────────────────────────────────       │
│          │←─ Warmup ─→│←──────── Cosine Decay ─────────────→│          │
│          (~2K steps)        (rest of training)                          │
│                                                                          │
│  Warmup: Gradually increase LR to avoid instability at start            │
│  Peak: Maximum learning rate (model-size dependent)                     │
│  Decay: Smoothly decrease to ~10% of peak                               │
│                                                                          │
│  Typical peak LR by model size:                                         │
│  • 1B params:  3e-4                                                     │
│  • 7B params:  3e-4                                                     │
│  • 70B params: 1.5e-4                                                   │
│  • 175B+ params: 1e-4 or lower                                          │
│                                                                          │
│  Larger models need lower learning rates for stability.                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Training Instabilities

Large-scale training is notoriously unstable. Common issues:

Loss spikes: Sudden increases in loss, often due to bad data batches or numeric issues.

Divergence: Loss increases and never recovers. Usually requires restarting from earlier checkpoint with lower LR.

Gradient explosion: Gradients become NaN/Inf. Requires gradient clipping.

Gradient vanishing: Gradients become zero, training stalls. Often an architecture issue.

Mitigation strategies:

Gradient clipping (max norm, typically 1.0)
Learning rate warmup
Careful initialization
Monitoring gradient norms, activation statistics
Data quality checks

What the Model Learns (Progression)

Research suggests models learn different capabilities at different stages:

Early training (0-10% of tokens):

Basic syntax and grammar
Common word associations
Simple patterns

Mid training (10-50% of tokens):

Factual knowledge
More complex reasoning
Domain knowledge

Late training (50-100% of tokens):

Subtle linguistic patterns
Complex reasoning chains
Edge cases and rare patterns

Implication: You can probe model capabilities throughout training. Some capabilities emerge suddenly; others improve gradually.

Continued Pre-training vs. Pre-training from Scratch

What is Continued Pre-training?

Instead of training a model from random initialization, continued pre-training takes an existing pre-trained model and trains it further on new data.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CONTINUED PRE-TRAINING                                │
│                                                                          │
│  FROM SCRATCH:                                                           │
│  Random Init ──────────────────────────────────────────→ Final Model    │
│              Train on 15T tokens                                        │
│              (months of compute)                                         │
│                                                                          │
│  CONTINUED PRE-TRAINING:                                                 │
│  Llama 3 Base ─────────────────────→ Domain-Specific Model              │
│               Train on 100B domain tokens                                │
│               (days of compute)                                          │
│                                                                          │
│  Use cases:                                                              │
│  • Domain adaptation (medical, legal, financial)                        │
│  • Language adaptation (add new language)                               │
│  • Knowledge updating (more recent data)                                │
│  • Context length extension                                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

When to Use Each Approach

Pre-train from scratch when:

You have massive compute budget
Existing models don't fit your needs
You need full control over data and capabilities
You're pushing the frontier

Continue pre-training when:

You need domain specialization
You want to add capabilities to existing model
Budget is limited (10-100× cheaper)
Base model is close to what you need

Continued Pre-training Considerations

Catastrophic forgetting: The model may "forget" general capabilities while learning domain-specific ones. Mitigation: mix domain data with general data (e.g., 50/50).

Learning rate: Use a lower learning rate than original pre-training (typically 10-30% of original peak).

Data quality: Domain data must be high quality; the model has strong priors from original training that bad data will conflict with.

Practical Considerations

Compute Requirements by Model Size

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    COMPUTE REQUIREMENTS GUIDE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model Size  │ Min GPUs │ Typical GPUs │ Training Time │ Est. Cost     │
│  ───────────────────────────────────────────────────────────────────    │
│  1B          │ 1-4      │ 8 A100s      │ 1-2 days      │ $5K-20K       │
│  7B          │ 8        │ 32-64 A100s  │ 1-2 weeks     │ $50K-200K     │
│  13B         │ 16       │ 64-128 A100s │ 2-4 weeks     │ $100K-500K    │
│  70B         │ 64       │ 256-512 A100s│ 1-2 months    │ $1M-5M        │
│  175B+       │ 256+     │ 1000+ A100s  │ 2-4 months    │ $5M-50M+      │
│                                                                          │
│  Notes:                                                                  │
│  • Costs assume cloud pricing; owned hardware is cheaper long-term     │
│  • Times assume Chinchilla-optimal; overtraining takes longer          │
│  • H100s are ~2× faster than A100s                                     │
│  • Costs dropping ~50% per year                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Requirements

Rough guidelines for tokens needed:

Chinchilla-optimal: 20× parameter count
Modern practice: 50-200× parameter count
Maximum useful: Unknown, but >1000× seems wasteful

Data collection is expensive:

Licensing costs for books, papers
Compute for web crawling and filtering
Human annotation for quality assessment
Legal review for compliance

Common Failure Modes

Data contamination: Test set data appears in training, invalidating benchmarks
Under-filtering: Low-quality data hurts model quality
Over-filtering: Too aggressive filtering loses valuable content
Imbalanced mixture: Wrong proportions hurt specific capabilities
Training instability: Loss spikes, divergence requiring restarts
Infrastructure failures: Hardware failures, networking issues

Summary: The Pre-training Recipe

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PRE-TRAINING RECIPE                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. DATA                                                                 │
│     ├── Collect diverse sources (web, books, code, papers)              │
│     ├── Filter aggressively for quality                                 │
│     ├── Deduplicate thoroughly                                          │
│     ├── Remove PII and toxic content                                    │
│     └── Mix in appropriate proportions                                  │
│                                                                          │
│  2. ARCHITECTURE                                                         │
│     ├── Transformer decoder (standard)                                  │
│     ├── Modern improvements (RoPE, SwiGLU, RMSNorm, GQA)               │
│     ├── Size based on compute budget                                    │
│     └── Consider MoE for efficiency                                     │
│                                                                          │
│  3. TRAINING                                                             │
│     ├── Distributed training (FSDP/ZeRO + tensor/pipeline parallel)    │
│     ├── Mixed precision (BF16)                                          │
│     ├── Adam optimizer (or variants)                                    │
│     ├── Warmup + cosine LR schedule                                     │
│     ├── Gradient clipping                                               │
│     └── Regular checkpointing                                           │
│                                                                          │
│  4. MONITORING                                                           │
│     ├── Loss curves                                                     │
│     ├── Gradient norms                                                  │
│     ├── Periodic evaluation on benchmarks                               │
│     └── Hardware utilization                                            │
│                                                                          │
│  5. OUTPUT                                                               │
│     └── Base model ready for SFT and RLHF                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Tokenization: The Foundation of Everything

Before any training can happen, text must be converted into tokens—discrete units the model can process. Tokenization decisions have profound effects on model capabilities.

Why Tokenization Matters

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENIZATION IMPACT                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE SAME TEXT, DIFFERENT TOKENIZATIONS:                                │
│                                                                          │
│  Text: "tokenization is important"                                      │
│                                                                          │
│  Word-level:   ["tokenization", "is", "important"]     → 3 tokens       │
│  Character:    ["t","o","k","e","n","i","z",...]       → 23 tokens      │
│  BPE (typical):["token", "ization", "is", "important"] → 4 tokens       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THIS MATTERS:                                                       │
│                                                                          │
│  1. EFFECTIVE CONTEXT LENGTH                                             │
│     A 4K context window holds:                                          │
│     • ~3,200 words with efficient tokenizer                            │
│     • ~2,000 words with inefficient tokenizer                          │
│     • The difference is massive for long documents                     │
│                                                                          │
│  2. TRAINING EFFICIENCY                                                  │
│     More tokens per word = more compute per concept learned            │
│     Efficient tokenization = faster, cheaper training                  │
│                                                                          │
│  3. MULTILINGUAL CAPABILITY                                              │
│     Poor tokenizers fragment non-English text:                         │
│     • "Hello" → 1 token                                                │
│     • "你好" → 3-4 tokens (same meaning!)                              │
│     This makes non-English slower and uses more context                │
│                                                                          │
│  4. CAPABILITY BOUNDARIES                                                │
│     Tokenizers affect what the model "sees":                           │
│     • "1234" as one token vs "12" "34" → different arithmetic         │
│     • Code symbols tokenized together vs split → affects coding       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Modern Tokenization: BPE and Beyond

Byte Pair Encoding (BPE) is the dominant tokenization algorithm. It learns a vocabulary from data by iteratively merging frequent character pairs:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    BPE ALGORITHM                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRAINING (learning the vocabulary):                                     │
│                                                                          │
│  Start: Each character is a token                                       │
│  Corpus: "low lower lowest"                                             │
│  Initial: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't']                     │
│                                                                          │
│  Iteration 1: Most common pair = ('l', 'o') → merge to 'lo'            │
│  Iteration 2: Most common pair = ('lo', 'w') → merge to 'low'          │
│  Iteration 3: Most common pair = ('e', 'r') → merge to 'er'            │
│  Iteration 4: Most common pair = ('low', 'er') → merge to 'lower'      │
│  ...continue until vocabulary size reached                              │
│                                                                          │
│  Final vocabulary includes: 'low', 'lower', 'lowest', 'er', 'est', ... │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ENCODING (using the vocabulary):                                        │
│                                                                          │
│  Greedy longest match from learned vocabulary:                          │
│  "lowest prices" → ["lowest", " ", "prices"]                           │
│                                                                          │
│  Unknown characters fall back to byte representation                   │
│  (handles any Unicode, even emoji)                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SentencePiece is a popular implementation that treats the input as raw bytes (no pre-tokenization needed), making it truly language-agnostic. Used by Llama, Mistral, and most modern models.

Tiktoken (used by OpenAI) is a fast BPE implementation with optimizations for efficiency.

Vocabulary Size Trade-offs

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VOCABULARY SIZE DECISIONS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Smaller Vocabulary (e.g., 32K tokens):                                 │
│  ├── ✓ Smaller embedding matrix (less memory)                          │
│  ├── ✓ Each token seen more often (better learning)                    │
│  ├── ✗ More tokens per word (longer sequences)                         │
│  └── ✗ Worse at rare words (more fragmentation)                        │
│                                                                          │
│  Larger Vocabulary (e.g., 128K tokens):                                 │
│  ├── ✓ Fewer tokens per word (shorter sequences)                       │
│  ├── ✓ Better coverage of words and phrases                            │
│  ├── ✗ Larger embedding matrix (more memory)                           │
│  └── ✗ Rare tokens undertrained                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TYPICAL VOCABULARY SIZES:                                               │
│                                                                          │
│  GPT-2:       50,257 tokens                                             │
│  Llama 2:     32,000 tokens                                             │
│  Llama 3:     128,000 tokens                                            │
│  GPT-4:       100,000+ tokens (estimated)                               │
│  Mistral:     32,000 tokens                                             │
│                                                                          │
│  Trend: Larger vocabularies as models scale (memory less constrained)  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Special Tokens

Models use special tokens for control and structure:

<|begin_of_text|>: Start of sequence
<|end_of_text|>: End of sequence (important for knowing when to stop)
<|pad|>: Padding for batching variable-length sequences
<|user|>, <|assistant|>: Role markers (added during SFT, not pre-training)

Getting special tokens right is crucial—mistakes here cause mysterious failures.

Evaluation During Pre-training

Pre-training runs for months. How do you know if things are going well?

Continuous Monitoring

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MONITORING METRICS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LOSS METRICS (every step):                                              │
│  ├── Training loss (main signal)                                        │
│  ├── Validation loss (generalization check)                            │
│  ├── Per-domain loss (track different data sources)                    │
│  └── Loss variance (stability indicator)                               │
│                                                                          │
│  GRADIENT METRICS (every step):                                          │
│  ├── Gradient norm (explosion/vanishing detection)                     │
│  ├── Per-layer gradient norms (identify problematic layers)            │
│  └── Gradient noise scale (batch size adequacy)                        │
│                                                                          │
│  THROUGHPUT METRICS (every step):                                        │
│  ├── Tokens per second                                                  │
│  ├── GPU utilization                                                    │
│  ├── Memory usage                                                       │
│  └── Communication overhead                                            │
│                                                                          │
│  CAPABILITY METRICS (periodic):                                          │
│  ├── Benchmark evaluations (every few thousand steps)                  │
│  ├── Perplexity on held-out sets                                       │
│  ├── Few-shot performance on key tasks                                 │
│  └── Generation quality samples                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Early Warning Signs

Loss spikes: Sudden increases in loss. Often caused by:

Bad data batches (quality filter failures)
Numeric instability (need gradient clipping)
Learning rate too high
Hardware issues (bit flips, memory errors)

Response: Check recent data, reduce LR, increase gradient clipping, rollback if needed.

Gradient norm spikes: Large gradient norms without loss spikes. Often precedes instability. Consider preemptive LR reduction.

Validation/training gap growing: Model is overfitting. Could indicate:

Need more data
Need more regularization
Training too long

Benchmark Evaluation During Training

Running full evaluations is expensive, but periodic checks are valuable:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    EVALUATION STRATEGY                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FREQUENCY vs THOROUGHNESS TRADE-OFF:                                   │
│                                                                          │
│  Every 1K steps:   Quick proxy metrics (5-10 minutes)                  │
│  ├── Perplexity on validation set                                      │
│  ├── 5-shot accuracy on 2-3 key benchmarks                             │
│  └── Generation samples (qualitative check)                            │
│                                                                          │
│  Every 10K steps:  Moderate evaluation (1-2 hours)                     │
│  ├── Full perplexity evaluation                                        │
│  ├── Common benchmarks (MMLU, HellaSwag, ARC, etc.)                   │
│  └── Longer generation samples                                         │
│                                                                          │
│  Every 50K steps:  Comprehensive evaluation (4-8 hours)                │
│  ├── Full benchmark suite                                              │
│  ├── Coding benchmarks (HumanEval)                                     │
│  ├── Math benchmarks (GSM8K)                                           │
│  └── Human evaluation samples                                          │
│                                                                          │
│  Note: Evaluations use separate GPU allocation to not slow training    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Data Quality vs Quantity Debate

A fundamental question in pre-training: Is it better to have more data or better data?

The Phi Model Insight

Microsoft's Phi models demonstrated that data quality can substitute for scale. Phi-1.5 (1.3B parameters) achieved GPT-3.5-level performance on some benchmarks by:

Using heavily filtered, high-quality web data
Creating synthetic "textbook-quality" data
Focusing on educational, well-structured content

This suggests the scaling laws may overstate data quantity needs if quality is high enough.

Quality Indicators

What makes data "high quality" for pre-training?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA QUALITY DIMENSIONS                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LINGUISTIC QUALITY:                                                     │
│  ├── Grammatical correctness                                           │
│  ├── Coherent structure                                                │
│  ├── Rich vocabulary                                                   │
│  └── Clear expression                                                  │
│                                                                          │
│  INFORMATIONAL QUALITY:                                                  │
│  ├── Factual accuracy                                                  │
│  ├── Depth of explanation                                              │
│  ├── Logical reasoning present                                         │
│  └── Novel information (not repetitive)                                │
│                                                                          │
│  STRUCTURAL QUALITY:                                                     │
│  ├── Well-organized content                                            │
│  ├── Clear paragraph structure                                         │
│  ├── Appropriate use of formatting                                     │
│  └── Good signal-to-noise ratio                                        │
│                                                                          │
│  DIVERSITY QUALITY:                                                      │
│  ├── Covers many topics                                                │
│  ├── Multiple perspectives represented                                 │
│  ├── Different writing styles                                          │
│  └── Range of complexity levels                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Diminishing Returns of Data Scale

Research increasingly shows diminishing returns from raw data scale:

First 1T tokens: Massive capability gains
1T-5T tokens: Significant but smaller gains
5T-15T tokens: Incremental improvements
Beyond 15T: Unclear if more helps much

This is why data quality and diversity become more important than raw scale at frontier levels. Everyone has access to similar web crawls; differentiation comes from curation.

Context Length: Training for Long Documents

Modern models need to handle long contexts (documents, codebases, conversations). This creates unique training challenges.

The Challenge of Long Contexts

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LONG CONTEXT CHALLENGES                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MEMORY SCALING:                                                         │
│  Standard attention: O(n²) memory in sequence length                    │
│                                                                          │
│  Sequence Length    │ Attention Memory (per layer, per batch)           │
│  ───────────────────┼────────────────────────────────────────           │
│  2K tokens          │ 16 MB                                             │
│  8K tokens          │ 256 MB                                            │
│  32K tokens         │ 4 GB                                              │
│  128K tokens        │ 64 GB                                             │
│  1M tokens          │ 4 TB (!)                                          │
│                                                                          │
│  This is per layer, per sample. 32 layers × 8 batch = impossible       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TRAINING CHALLENGES:                                                    │
│                                                                          │
│  1. Memory: Can't fit long sequences without special techniques        │
│  2. Data: Need documents actually that long (most aren't)              │
│  3. Learning: Model must learn to use distant context                  │
│  4. Cost: Long sequences = expensive training                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Techniques for Long Context Training

Staged context extension: Train at shorter context first, then extend:

Pre-train at 4K context
Continue training at 32K with adjusted RoPE scaling
Final phase at 128K with ring attention

Memory-efficient attention:

Flash Attention: Fused kernels, never materializes full attention matrix
Ring Attention: Distribute attention computation across devices
Sliding window: Only attend to nearby tokens (Mistral)

Data for long context:

Concatenate related documents
Use naturally long documents (books, papers, code repos)
Synthetic long-range dependency tasks

Reproducibility and Open Science

The Reproducibility Challenge

Pre-training is expensive and difficult to reproduce:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    REPRODUCIBILITY CHALLENGES                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WHAT'S TYPICALLY DISCLOSED:                                            │
│  ├── Model architecture                                                │
│  ├── Parameter count                                                   │
│  ├── Token count (approximate)                                         │
│  └── General data sources                                              │
│                                                                          │
│  WHAT'S TYPICALLY NOT DISCLOSED:                                        │
│  ├── Exact data mixture proportions                                    │
│  ├── Filtering pipeline details                                        │
│  ├── Hyperparameters (learning rate schedules, etc.)                  │
│  ├── Training instabilities and how they were resolved                │
│  ├── Checkpoint selection criteria                                     │
│  └── Compute infrastructure details                                    │
│                                                                          │
│  RESULT:                                                                 │
│  Even with model weights released, training process can't be          │
│  reproduced. Published scaling laws may not transfer to your setup.   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Open Efforts

Several initiatives aim to improve pre-training reproducibility:

Open datasets:

RedPajama: Attempt to replicate Llama training data
The Stack: Permissively licensed code
Dolma: Open pre-training dataset with documentation

Open training:

BLOOM: Fully documented training process
OLMo: Open language model with training code and data
Pythia: Suite of models with training data and checkpoints released

Table of Contents

What is Pre-training?

The Pre-training Objective: Learning to Predict

Next Token Prediction (Autoregressive LMs)

Why Next Token Prediction Works So Well

The Mathematical Formulation

Perplexity: The Standard Metric

Masked Language Modeling (BERT-style)

Data: The Fuel for Pre-training

Scale: How Much Data?

Data Sources

1. Web Crawls (Common Crawl)

2. Books

3. Code

4. Scientific Papers

5. Curated/Synthetic Data

Data Curation Pipeline

Quality Filtering in Detail

Deduplication: More Important Than You'd Think

Data Mixing: The Art of Proportions

Architecture: The Transformer and Its Variants

The Transformer Foundation

Key Architecture Decisions

1. Model Dimensions

2. Attention Variants

3. Positional Encoding

4. Normalization

5. Feed-Forward Network Variants

Model Scale Comparison

Mixture of Experts (MoE)

Scaling Laws: The Science of Bigger Models

The Scaling Laws Discovery

Chinchilla Scaling Laws

Beyond Chinchilla: Modern Scaling

Emergent Abilities

Training Infrastructure: Making It Work at Scale

The Scale of the Problem

Distributed Training Strategies

Data Parallelism

Model Parallelism (Tensor Parallelism)

Pipeline Parallelism

Fully Sharded Data Parallelism (FSDP) / ZeRO

Mixed Precision Training

Checkpointing and Fault Tolerance

Training Dynamics: What Happens During Pre-training

The Loss Curve

Learning Rate Schedule

Training Instabilities

What the Model Learns (Progression)

Continued Pre-training vs. Pre-training from Scratch

What is Continued Pre-training?

When to Use Each Approach

Continued Pre-training Considerations

Practical Considerations

Compute Requirements by Model Size

Data Requirements

Common Failure Modes

Summary: The Pre-training Recipe

Tokenization: The Foundation of Everything

Why Tokenization Matters

Modern Tokenization: BPE and Beyond

Vocabulary Size Trade-offs

Special Tokens

Evaluation During Pre-training

Continuous Monitoring

Early Warning Signs

Benchmark Evaluation During Training

The Data Quality vs Quantity Debate

The Phi Model Insight

Quality Indicators

The Diminishing Returns of Data Scale

Context Length: Training for Long Documents

The Challenge of Long Contexts

Techniques for Long Context Training

Reproducibility and Open Science

The Reproducibility Challenge

Open Efforts

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles