Skip to main content
Back to Blog

LLM Pre-training: Building Foundation Models from Scratch

A comprehensive guide to pre-training large language models—from data curation and architecture decisions to scaling laws and distributed training infrastructure. Understanding how GPT, Llama, and other foundation models are built.

15 min read
Share:

What is Pre-training?

Pre-training is the foundational phase of building a large language model. It's where a model learns language itself—grammar, facts, reasoning patterns, and the statistical structure of human text—by processing massive amounts of data.

Think of pre-training as teaching a child to read and understand language by exposing them to millions of books, articles, conversations, and documents. The child doesn't memorize specific facts (though some stick); they develop an intuition for how language works, what words mean, how ideas connect, and how to reason about the world.

Pre-training is distinct from later training phases:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    THE LLM TRAINING PIPELINE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PRE-TRAINING (this post)                                                │
│  ─────────────────────────                                               │
│  • Train on trillions of tokens from the internet                       │
│  • Self-supervised learning (predict next token)                         │
│  • Result: Base model that can complete text                            │
│  • Cost: $1M - $100M+ compute                                           │
│  • Time: Weeks to months                                                 │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  SUPERVISED FINE-TUNING (SFT)                                           │
│  ───────────────────────────                                             │
│  • Train on instruction-response pairs                                   │
│  • Teaches model to follow instructions                                  │
│  • Result: Model that responds helpfully                                │
│  • Cost: $1K - $100K compute                                            │
│  • Time: Hours to days                                                   │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  REINFORCEMENT LEARNING (RLHF/DPO)                                      │
│  ─────────────────────────────────                                       │
│  • Train on human preferences                                            │
│  • Aligns model with human values                                        │
│  • Result: Model that's helpful, harmless, honest                       │
│  • Cost: $10K - $1M compute                                             │
│  • Time: Days to weeks                                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Pre-training is by far the most expensive and foundational phase. Everything that follows—SFT, RLHF, fine-tuning for specific tasks—builds on the capabilities established during pre-training. A model can only be as good as its pre-training allows.


The Pre-training Objective: Learning to Predict

Next Token Prediction (Autoregressive LMs)

The dominant pre-training objective for modern LLMs is deceptively simple: predict the next token.

Given a sequence of tokens, predict what comes next:

Code
Input:  "The capital of France is"
Target: "Paris"

Input:  "def fibonacci(n):\n    if n <= 1:\n        return"
Target: "n"

This objective is called autoregressive language modeling or causal language modeling. Models like GPT-4, Claude, Llama, and Mistral all use this approach.

Why Next Token Prediction Works So Well

At first glance, predicting the next word seems too simple to produce intelligent behavior. But consider what the model must learn to predict well:

To predict the next word in a sentence about physics:

  • The model must understand physics concepts
  • It must know how these concepts relate
  • It must follow logical reasoning chains

To predict the next token in code:

  • The model must understand syntax
  • It must track variable types and scopes
  • It must follow programming logic

To predict the next word in a dialogue:

  • The model must understand context and intent
  • It must model different perspectives
  • It must follow conversational norms

The next token prediction objective is a "universal task" that requires mastering language at every level—from grammar and spelling to reasoning and world knowledge. The model isn't explicitly taught any of these skills; they emerge from the pressure to predict accurately.

The Mathematical Formulation

The pre-training objective minimizes cross-entropy loss over the training corpus:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    NEXT TOKEN PREDICTION LOSS                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  For a sequence of tokens x₁, x₂, ..., xₙ:                              │
│                                                                          │
│  Loss = -∑ log P(xᵢ | x₁, x₂, ..., xᵢ₋₁)                               │
│                                                                          │
│  In words: For each position, how surprised is the model                │
│  by the actual token given everything that came before?                 │
│                                                                          │
│  Lower loss = Better predictions = Better understanding                 │
│                                                                          │
│  Example:                                                                │
│  "The cat sat on the [mat]"                                             │
│                                                                          │
│  If model predicts:                                                      │
│  • P("mat") = 0.3    → Loss contribution = -log(0.3) = 1.2             │
│  • P("mat") = 0.01   → Loss contribution = -log(0.01) = 4.6            │
│  • P("mat") = 0.9    → Loss contribution = -log(0.9) = 0.1             │
│                                                                          │
│  The model is trained to maximize probability of correct tokens.        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Perplexity: The Standard Metric

Perplexity is the standard metric for evaluating pre-trained language models. It's the exponential of the average loss:

Code
Perplexity = exp(Loss / N)

Intuitively, perplexity represents how "confused" the model is. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options.

Typical perplexity ranges:

  • Random guessing (50K vocab): ~50,000
  • Bad language model: ~100-500
  • Good language model: ~10-30
  • State-of-the-art (on common benchmarks): ~5-15

Masked Language Modeling (BERT-style)

An alternative pre-training objective is masked language modeling (MLM), used by BERT and its variants:

Code
Input:  "The [MASK] of France is Paris"
Target: "capital"

Instead of predicting the next token, the model predicts randomly masked tokens. This creates a bidirectional model—it can look both forward and backward when making predictions.

Comparison:

AspectAutoregressive (GPT)Masked (BERT)
DirectionLeft-to-right onlyBidirectional
Use caseText generationUnderstanding/classification
GenerationNatural (token by token)Unnatural (must iterate)
ContextOnly past tokensFull context
Modern preference✅ Dominant for LLMsUsed for embeddings

Modern LLMs almost universally use autoregressive pre-training because generation is natural and the same model works for both understanding and generation.


Data: The Fuel for Pre-training

Pre-training data quality and quantity are arguably more important than architecture or training techniques. A model trained on high-quality data will outperform a larger model trained on low-quality data.

Scale: How Much Data?

Modern LLMs are trained on staggering amounts of text:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PRE-TRAINING DATA SCALE                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model                    Training Tokens         Approximate Size       │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2 (2019)            40 billion               ~40 GB text           │
│  GPT-3 (2020)            300 billion              ~570 GB text          │
│  Chinchilla (2022)       1.4 trillion             ~1.4 TB text          │
│  Llama 2 (2023)          2 trillion               ~2 TB text            │
│  Llama 3 (2024)          15 trillion              ~15 TB text           │
│  GPT-4 (estimated)       ~10-13 trillion          ~10+ TB text          │
│                                                                          │
│  For reference:                                                          │
│  • Wikipedia: ~4 billion tokens (~20 GB)                                │
│  • All books ever written: ~500 billion tokens (estimate)               │
│  • Common Crawl (filtered): ~1-3 trillion tokens                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The trend is clear: more data leads to better models, but with diminishing returns. The question becomes: where does all this data come from?

Data Sources

1. Web Crawls (Common Crawl)

The foundation of most pre-training datasets is Common Crawl, a non-profit that has been crawling and archiving the web since 2008.

  • Contains petabytes of raw web data
  • New crawls released monthly
  • Covers billions of web pages

But raw web data is mostly garbage. Common Crawl contains:

  • Spam and SEO content
  • Duplicated pages
  • Boilerplate (navigation, ads, footers)
  • Low-quality machine-generated text
  • Malicious content
  • Personally identifiable information

The art of using web data is in filtering. Models like Llama use only a small fraction of Common Crawl after aggressive filtering.

2. Books

Books provide high-quality, long-form, well-edited text:

  • Books1/Books2: Datasets of digitized books (used by GPT-3)
  • Project Gutenberg: Public domain books
  • Internet Archive: Digital library

Books are valuable because they contain:

  • Sustained reasoning and arguments
  • Diverse writing styles
  • Edited, high-quality prose
  • Long-range dependencies

3. Code

Code has become increasingly important for LLM capabilities:

  • GitHub: Public repositories
  • Stack Overflow: Q&A with code
  • The Stack: Curated code dataset

Code training improves:

  • Reasoning ability (code requires logical thinking)
  • Structured output (JSON, XML, etc.)
  • Instruction following (code is precise)
  • General capability (surprisingly broad transfer)

4. Scientific Papers

Academic content provides high-quality technical knowledge:

  • ArXiv: Pre-prints across sciences
  • PubMed: Medical literature
  • Semantic Scholar: Academic papers

5. Curated/Synthetic Data

Increasingly, pre-training includes curated or synthetic data:

  • Wikipedia: High-quality encyclopedic content (often upweighted)
  • Textbooks: Educational content (Phi models heavily use this)
  • Synthetic data: Generated by other LLMs for specific capabilities

Data Curation Pipeline

Raw data must be processed through an extensive pipeline before training:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA CURATION PIPELINE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RAW WEB CRAWL                                                           │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────┐                                                    │
│  │  URL FILTERING  │  Remove known bad domains, adult content,          │
│  │                 │  spam domains, etc.                                 │
│  └────────┬────────┘                                                    │
│           │ ~50% removed                                                 │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │ TEXT EXTRACTION │  Extract text from HTML, remove boilerplate,       │
│  │                 │  navigation, ads, scripts                          │
│  └────────┬────────┘                                                    │
│           │                                                              │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │   LANGUAGE ID   │  Keep only target languages                        │
│  │                 │  (usually English + selected others)               │
│  └────────┬────────┘                                                    │
│           │ ~30% removed                                                 │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │ QUALITY FILTER  │  Remove low-quality text using classifiers         │
│  │                 │  (trained on Wikipedia vs web text)                │
│  └────────┬────────┘                                                    │
│           │ ~60-80% removed                                              │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │  DEDUPLICATION  │  Remove duplicate documents (exact + fuzzy)        │
│  │                 │  Critical for training stability                   │
│  └────────┬────────┘                                                    │
│           │ ~30-50% removed                                              │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │   PII REMOVAL   │  Remove emails, phone numbers, addresses,          │
│  │                 │  social security numbers, etc.                     │
│  └────────┬────────┘                                                    │
│           │                                                              │
│           ▼                                                              │
│  ┌─────────────────┐                                                    │
│  │ TOXICITY FILTER │  Remove hate speech, extreme content               │
│  │                 │  (classifiers or keyword lists)                    │
│  └────────┬────────┘                                                    │
│           │                                                              │
│           ▼                                                              │
│  CLEAN TRAINING DATA                                                     │
│  (typically 1-5% of raw crawl)                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Quality Filtering in Detail

Quality filtering is perhaps the most impactful step. The goal is to keep text that looks like "good" writing and remove text that looks like spam, machine-generated content, or low-effort writing.

Common approaches:

Classifier-based: Train a classifier to distinguish Wikipedia/books (positive) from random web text (negative). Apply to all web data, keep high-scoring documents.

Heuristic-based: Apply rules like:

  • Minimum/maximum document length
  • Ratio of alphabetic characters to total
  • Presence of stop words (real text has "the", "and", "is")
  • Average word length (spam often has unusual patterns)
  • Repetition detection (spam repeats phrases)
  • Symbol ratio (too many special characters = bad)

Perplexity-based: Use a small pre-trained model to score text. Very high perplexity = unusual/bad text.

Human evaluation: Sample and manually rate documents to calibrate filters.

Deduplication: More Important Than You'd Think

Duplicate and near-duplicate documents cause serious problems:

  1. Wasted compute: Training on the same content twice doesn't help
  2. Memorization: Exact duplicates encourage memorization over generalization
  3. Evaluation contamination: If test data appears in training, benchmarks are invalid
  4. Privacy risks: Duplicated private information is more likely to be memorized

Deduplication methods:

Exact deduplication: Hash documents, remove duplicates. Fast but misses near-duplicates.

MinHash/LSH: Create document fingerprints, find similar documents efficiently. Can catch documents that differ by a few words.

Suffix array: Find repeated substrings across the corpus. Can remove duplicated paragraphs even if documents differ overall.

Data Mixing: The Art of Proportions

Pre-training data comes from multiple sources. The mixture proportions significantly impact model capabilities:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    EXAMPLE DATA MIXTURES                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LLAMA 2 (reported):                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ Web data:       ~89%                                             │   │
│  │ Code:           ~4%                                              │   │
│  │ Wikipedia:      ~4%                                              │   │
│  │ Books:          ~3%                                              │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  GPT-3 (reported):                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ Common Crawl:   60%                                              │   │
│  │ WebText2:       22%                                              │   │
│  │ Books:          8%                                               │   │
│  │ Wikipedia:      3%                                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  PHI (Microsoft):                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ Textbooks:      Very high (exact % unknown)                      │   │
│  │ Synthetic:      Very high                                        │   │
│  │ Code:           Significant                                      │   │
│  │ Web data:       Lower than typical                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  Key insight: More high-quality data is often better than more          │
│  low-quality data, even if total tokens is lower.                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Upweighting: High-quality sources like Wikipedia are often shown to the model multiple times (2-10x). This is called "upweighting" or "oversampling."

Curriculum: Some approaches vary the mixture during training—starting with easier/cleaner data and adding harder/noisier data later.


Architecture: The Transformer and Its Variants

The Transformer Foundation

All modern LLMs are based on the Transformer architecture, introduced in "Attention Is All You Need" (2017). The key components:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TRANSFORMER DECODER BLOCK                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                        Input (sequence of tokens)                        │
│                                 │                                        │
│                                 ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      TOKEN EMBEDDING                             │   │
│  │    Convert discrete tokens to continuous vectors                 │   │
│  │    vocab_size × hidden_dim matrix lookup                         │   │
│  └──────────────────────────────┬──────────────────────────────────┘   │
│                                 │                                        │
│                                 ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                   POSITIONAL ENCODING                            │   │
│  │    Add position information (since attention is permutation-     │   │
│  │    invariant without it)                                         │   │
│  └──────────────────────────────┬──────────────────────────────────┘   │
│                                 │                                        │
│                                 ▼                                        │
│           ╔═══════════════════════════════════════════╗                 │
│           ║         TRANSFORMER BLOCK (×N)            ║                 │
│           ║  ┌─────────────────────────────────────┐  ║                 │
│           ║  │     MASKED SELF-ATTENTION           │  ║                 │
│           ║  │  Each position attends to previous  │  ║                 │
│           ║  │  positions only (causal mask)       │  ║                 │
│           ║  └──────────────────┬──────────────────┘  ║                 │
│           ║                     │                      ║                 │
│           ║                     ▼                      ║                 │
│           ║  ┌─────────────────────────────────────┐  ║                 │
│           ║  │        ADD & LAYER NORM             │  ║                 │
│           ║  │   Residual connection + normalize   │  ║                 │
│           ║  └──────────────────┬──────────────────┘  ║                 │
│           ║                     │                      ║                 │
│           ║                     ▼                      ║                 │
│           ║  ┌─────────────────────────────────────┐  ║                 │
│           ║  │      FEED-FORWARD NETWORK           │  ║                 │
│           ║  │  Two linear layers with activation  │  ║                 │
│           ║  │  (typically 4× hidden_dim)          │  ║                 │
│           ║  └──────────────────┬──────────────────┘  ║                 │
│           ║                     │                      ║                 │
│           ║                     ▼                      ║                 │
│           ║  ┌─────────────────────────────────────┐  ║                 │
│           ║  │        ADD & LAYER NORM             │  ║                 │
│           ║  └──────────────────┬──────────────────┘  ║                 │
│           ╚══════════════════════╪════════════════════╝                 │
│                                 │                                        │
│                                 ▼ (repeat N times)                      │
│                                 │                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                     OUTPUT PROJECTION                            │   │
│  │    Project back to vocabulary size for next token prediction     │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Key Architecture Decisions

1. Model Dimensions

The "size" of a model is determined by several hyperparameters:

ParameterDescriptionTypical Values
hidden_dimWidth of representations768 - 8192
num_layersDepth of transformer stack12 - 96
num_headsParallel attention heads12 - 64
vocab_sizeNumber of tokens32K - 128K
context_lengthMaximum sequence length2K - 128K

Parameter count formula:

Code
params ≈ 12 × num_layers × hidden_dim²

This is approximate; the exact count depends on vocabulary size and other factors.

2. Attention Variants

Standard self-attention is O(n²) in sequence length, which becomes prohibitive for long contexts. Many variants address this:

Multi-Head Attention (standard):

  • Multiple parallel attention "heads"
  • Each head can focus on different relationship types
  • Outputs concatenated and projected

Grouped Query Attention (GQA):

  • Used in Llama 2/3, Mistral
  • Multiple query heads share fewer key-value heads
  • Reduces memory usage during inference
  • Similar quality to full multi-head

Multi-Query Attention (MQA):

  • Extreme version: all queries share one KV head
  • Maximum memory efficiency
  • Slight quality trade-off

Sliding Window Attention:

  • Used in Mistral, Mixtral
  • Each position only attends to nearby positions
  • Enables very long sequences efficiently

Ring Attention / Sequence Parallelism:

  • Distribute attention across devices
  • Enables training on very long sequences

3. Positional Encoding

Transformers need position information added explicitly. Evolution of approaches:

Absolute Positional Embeddings (original):

  • Learned embedding for each position
  • Limited to trained context length
  • Can't extrapolate to longer sequences

Rotary Position Embedding (RoPE):

  • Used in Llama, Mistral, most modern LLMs
  • Encodes relative positions through rotation
  • Better length generalization
  • Can be extended with interpolation

ALiBi (Attention with Linear Biases):

  • Used in BLOOM, MPT
  • Adds linear bias based on position distance
  • No learned parameters for position
  • Good length generalization

4. Normalization

Where and how to normalize activations:

Post-Norm (original Transformer):

  • Normalize after attention/FFN
  • Can be unstable for deep networks

Pre-Norm (GPT-2 onwards):

  • Normalize before attention/FFN
  • More stable training
  • Now standard for LLMs

RMSNorm (Llama):

  • Simplified LayerNorm (removes mean centering)
  • Slightly faster, similar quality
  • Increasingly standard

5. Feed-Forward Network Variants

Standard FFN: FFN(x)=W2ReLU(W1x)\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 \cdot x)

SwiGLU (Llama, most modern LLMs): FFN(x)=(W1xSwish(W3x))W2\text{FFN}(x) = (W_1 \cdot x \otimes \text{Swish}(W_3 \cdot x)) \cdot W_2

SwiGLU adds a gating mechanism that empirically improves quality.

Model Scale Comparison

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MODEL ARCHITECTURE COMPARISON                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model          Params   Layers  Hidden   Heads   Context   Year        │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2          1.5B     48      1600     25      1K        2019        │
│  GPT-3          175B     96      12288    96      2K        2020        │
│  Llama 2 7B     7B       32      4096     32      4K        2023        │
│  Llama 2 70B    70B      80      8192     64      4K        2023        │
│  Llama 3 8B     8B       32      4096     32      8K        2024        │
│  Llama 3 70B    70B      80      8192     64      8K        2024        │
│  Llama 3 405B   405B     126     16384    128     128K      2024        │
│  Mistral 7B     7B       32      4096     32      32K       2023        │
│  Mixtral 8×7B   47B*     32      4096     32      32K       2024        │
│  Qwen2.5 72B    72B      80      8192     64      128K      2024        │
│                                                                          │
│  *Mixtral uses Mixture-of-Experts; 47B total but 13B active per token  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mixture of Experts (MoE)

A major architectural trend is Mixture of Experts, where the model has multiple "expert" feed-forward networks and a router selects which experts to use for each token:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MIXTURE OF EXPERTS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Standard Transformer:                                                   │
│  ┌────────────┐                                                         │
│  │    FFN     │  Same FFN for every token                               │
│  └────────────┘                                                         │
│                                                                          │
│  Mixture of Experts:                                                     │
│                    ┌─────────────┐                                      │
│                    │   Router    │  Decides which experts to use        │
│                    └──────┬──────┘                                      │
│                           │                                              │
│         ┌─────────────────┼─────────────────┐                           │
│         ▼                 ▼                 ▼                            │
│    ┌─────────┐       ┌─────────┐       ┌─────────┐                      │
│    │Expert 1 │       │Expert 2 │  ...  │Expert N │                      │
│    └─────────┘       └─────────┘       └─────────┘                      │
│                                                                          │
│  Typically: 8 experts, top-2 selected per token                         │
│                                                                          │
│  Benefits:                                                               │
│  • More parameters without more compute (sparse activation)             │
│  • Experts can specialize (one for code, one for math, etc.)           │
│  • Better scaling properties                                            │
│                                                                          │
│  Examples:                                                               │
│  • Mixtral 8×7B: 47B total params, 13B active                          │
│  • Grok-1: 314B total, mixture of experts                               │
│  • GPT-4 (rumored): MoE architecture                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Scaling Laws: The Science of Bigger Models

The Scaling Laws Discovery

One of the most important discoveries in modern AI is that LLM performance follows predictable scaling laws. Given more compute, data, or parameters, we can predict how much better the model will be.

The foundational paper, "Scaling Laws for Neural Language Models" (Kaplan et al., 2020), showed that loss decreases as a power law with scale:

Code
L(N) = (Nₑ/N)^αₙ    for model size
L(D) = (Dₑ/D)^αᵈ    for data size
L(C) = (Cₑ/C)^αᴄ    for compute

Where:
- L = loss (lower is better)
- N = number of parameters
- D = dataset size (tokens)
- C = compute (FLOPs)
- α values ≈ 0.07-0.1

Key insight: Performance improves smoothly and predictably with scale. There's no magic threshold; just keep scaling.

Chinchilla Scaling Laws

DeepMind's "Chinchilla" paper (2022) refined the scaling laws and revealed that most models were undertrained:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CHINCHILLA OPTIMAL SCALING                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Previous wisdom (GPT-3 era):                                           │
│  "Make models as big as possible, train on fixed data"                  │
│                                                                          │
│  Chinchilla discovery:                                                   │
│  "Optimal allocation: scale data and parameters equally"                │
│                                                                          │
│  OPTIMAL RATIO: ~20 tokens per parameter                                │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │  Model Size    │ Optimal Training Tokens │ Actual (GPT-3 era)  │    │
│  ├────────────────┼─────────────────────────┼─────────────────────│    │
│  │  1B params     │  20B tokens             │  ~300B (overtrained)│    │
│  │  10B params    │  200B tokens            │  ~300B              │    │
│  │  70B params    │  1.4T tokens            │  ~300B (undertrained)│   │
│  │  175B params   │  3.5T tokens            │  ~300B (very under) │    │
│  └────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  Implication: GPT-3 (175B, 300B tokens) was massively undertrained.    │
│  Chinchilla (70B, 1.4T tokens) matched GPT-3 with 4× less compute.     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Beyond Chinchilla: Modern Scaling

Post-Chinchilla, the field has moved toward even longer training:

Llama approach: Train smaller models for much longer than Chinchilla-optimal:

  • Llama 2 7B: Trained on 2T tokens (100× Chinchilla ratio)
  • Llama 3 8B: Trained on 15T tokens (nearly 2000× ratio)

Why overtrain? Inference cost matters. A smaller model trained longer:

  • Costs less to run in production
  • Is faster for users
  • Uses less memory
  • Can run on smaller GPUs/edge devices

The trade-off: You spend more compute during training (once) to save compute during inference (millions of times).

Emergent Abilities

An important phenomenon in scaling: emergent abilities appear suddenly at certain scales.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    EMERGENT ABILITIES                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Performance                                                             │
│       ▲                                                                  │
│       │                          ┌────── Emergent ability appears       │
│       │                          │       (sudden jump)                  │
│       │                          ▼                                       │
│       │                    ╭─────────────                               │
│       │               ╭────╯                                            │
│       │          ╭────╯                                                 │
│       │     ╭────╯                                                      │
│       │╭────╯                                                           │
│       └──────────────────────────────────────▶ Scale                    │
│              Small    Medium    Large    Very Large                     │
│                                                                          │
│  Examples of emergent abilities:                                         │
│  • Chain-of-thought reasoning (~100B params)                            │
│  • Multi-step arithmetic (~10B params)                                  │
│  • Word unscrambling (~10B params)                                      │
│  • Instruction following (~1B params)                                   │
│                                                                          │
│  Note: Recent research suggests emergence may be a measurement          │
│  artifact—abilities may improve smoothly but our metrics have          │
│  thresholds that create apparent discontinuities.                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Training Infrastructure: Making It Work at Scale

Pre-training a large language model requires massive distributed computing infrastructure. The challenges are immense:

The Scale of the Problem

Consider training a 70B parameter model on 2T tokens:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TRAINING COMPUTE REQUIREMENTS                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model: 70B parameters, 2T tokens                                       │
│                                                                          │
│  FLOPs required: ~6 × N × D = 6 × 70B × 2T = 8.4 × 10²³ FLOPs          │
│                                                                          │
│  Single A100 (80GB):                                                     │
│  • Peak: 312 TFLOPS (bfloat16)                                          │
│  • Realistic: ~150 TFLOPS (50% utilization)                             │
│  • Time: 8.4×10²³ / 150×10¹² = 5.6×10⁹ seconds = 177 years             │
│                                                                          │
│  1024 A100s (typical large cluster):                                    │
│  • Time: 177 years / 1024 ≈ 63 days                                     │
│  • Cost: ~$5-10M (cloud pricing)                                        │
│                                                                          │
│  Memory requirement:                                                     │
│  • Model parameters: 70B × 2 bytes = 140 GB                             │
│  • Optimizer states: 70B × 12 bytes = 840 GB (Adam)                     │
│  • Gradients: 70B × 2 bytes = 140 GB                                    │
│  • Activations: Variable, but large                                     │
│  • Total: >1 TB just for model state                                    │
│                                                                          │
│  This is why distributed training is essential.                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distributed Training Strategies

Data Parallelism

The simplest form: replicate the model on each GPU, split data across GPUs.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA PARALLELISM                                      │
│                                                                          │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                       │
│  │  GPU 0  │ │  GPU 1  │ │  GPU 2  │ │  GPU 3  │                       │
│  │ Model   │ │ Model   │ │ Model   │ │ Model   │                       │
│  │ (copy)  │ │ (copy)  │ │ (copy)  │ │ (copy)  │                       │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘                       │
│       │          │          │          │                                │
│       ▼          ▼          ▼          ▼                                │
│   Batch 0    Batch 1    Batch 2    Batch 3                             │
│       │          │          │          │                                │
│       └──────────┴──────────┴──────────┘                                │
│                      │                                                   │
│                      ▼                                                   │
│              Gradient AllReduce                                          │
│              (average gradients)                                         │
│                                                                          │
│  Limitation: Model must fit on single GPU                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Parallelism (Tensor Parallelism)

Split the model's layers across GPUs. Each GPU holds part of each layer.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TENSOR PARALLELISM                                    │
│                                                                          │
│  Linear layer: Y = XW                                                    │
│                                                                          │
│  Split W across GPUs:                                                    │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                       │
│  │  GPU 0  │ │  GPU 1  │ │  GPU 2  │ │  GPU 3  │                       │
│  │   W₀    │ │   W₁    │ │   W₂    │ │   W₃    │                       │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘                       │
│       │          │          │          │                                │
│       ▼          ▼          ▼          ▼                                │
│     X·W₀       X·W₁       X·W₂       X·W₃                              │
│       │          │          │          │                                │
│       └──────────┴──────────┴──────────┘                                │
│                      │                                                   │
│                      ▼                                                   │
│              AllGather [Y₀, Y₁, Y₂, Y₃]                                 │
│                                                                          │
│  Benefit: Can train models larger than single GPU memory                │
│  Cost: Communication overhead between GPUs                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Pipeline Parallelism

Split the model's layers across GPUs sequentially. GPU 0 has layers 1-10, GPU 1 has layers 11-20, etc.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PIPELINE PARALLELISM                                  │
│                                                                          │
│  GPU 0        GPU 1        GPU 2        GPU 3                           │
│  Layers       Layers       Layers       Layers                          │
│   1-8          9-16        17-24        25-32                           │
│    │            │            │            │                              │
│    ▼            │            │            │                              │
│  [===]──────────▼            │            │                              │
│    │         [===]───────────▼            │                              │
│    │            │         [===]───────────▼                              │
│    │            │            │         [===]                             │
│    │            │            │            │                              │
│    ◄────────────◄────────────◄────────────┘  (backward pass)            │
│                                                                          │
│  Problem: "Pipeline bubble" - GPUs idle waiting for others              │
│                                                                          │
│  Solution: Micro-batching (split batch into micro-batches)              │
│                                                                          │
│  GPU 0: [=1=][=2=][=3=][=4=][   ][   ][=4=][=3=][=2=][=1=]              │
│  GPU 1: [   ][=1=][=2=][=3=][=4=][=4=][=3=][=2=][=1=][   ]              │
│  GPU 2: [   ][   ][=1=][=2=][=3=][=3=][=2=][=1=][   ][   ]              │
│  GPU 3: [   ][   ][   ][=1=][=2=][=2=][=1=][   ][   ][   ]              │
│          └── forward ──┘  └── backward ──┘                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Fully Sharded Data Parallelism (FSDP) / ZeRO

The modern standard: shard model parameters, gradients, and optimizer states across GPUs.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FSDP / ZeRO                                           │
│                                                                          │
│  Key insight: Each GPU only needs all parameters during forward/back    │
│  pass. Rest of the time, keep only a shard.                             │
│                                                                          │
│  Standard Data Parallel:                                                 │
│  GPU 0: [Full Model] [Full Optimizer] [Full Gradients]                  │
│  GPU 1: [Full Model] [Full Optimizer] [Full Gradients]                  │
│  GPU 2: [Full Model] [Full Optimizer] [Full Gradients]                  │
│  GPU 3: [Full Model] [Full Optimizer] [Full Gradients]                  │
│  Memory: 4× redundant                                                    │
│                                                                          │
│  FSDP / ZeRO Stage 3:                                                   │
│  GPU 0: [Model¼] [Optim¼] [Grad¼]                                       │
│  GPU 1: [Model¼] [Optim¼] [Grad¼]                                       │
│  GPU 2: [Model¼] [Optim¼] [Grad¼]                                       │
│  GPU 3: [Model¼] [Optim¼] [Grad¼]                                       │
│  Memory: ~4× reduction                                                   │
│                                                                          │
│  During forward: AllGather to reconstruct full layer, compute, discard │
│  During backward: AllGather weights, compute gradients, ReduceScatter  │
│                                                                          │
│  Trade-off: More communication, but can train much larger models        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mixed Precision Training

Modern training uses multiple numeric precisions to balance speed and stability:

FP32 (32-bit float): Full precision, 4 bytes per parameter FP16 (16-bit float): Half precision, 2 bytes, but limited range BF16 (bfloat16): Half precision with FP32's range, 2 bytes FP8: Emerging, even smaller, requires careful handling

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MIXED PRECISION TRAINING                              │
│                                                                          │
│  Typical setup:                                                          │
│  • Master weights: FP32 (for accuracy in optimizer)                     │
│  • Forward pass: BF16 (speed)                                           │
│  • Gradients: BF16 (speed)                                              │
│  • Optimizer states: FP32 (stability)                                   │
│                                                                          │
│  Why BF16 over FP16?                                                    │
│  • FP16 range: ±65,504 (overflow risk)                                  │
│  • BF16 range: ±3.4×10³⁸ (same as FP32)                                │
│  • BF16 has less precision but rarely causes issues                     │
│                                                                          │
│  Memory savings:                                                         │
│  • FP32 throughout: 4 bytes × params × 4 (weights, grads, 2 optimizer) │
│  • Mixed precision: ~6 bytes per param (vs 16)                          │
│  • ~2.5× memory reduction                                               │
│                                                                          │
│  Speed improvement:                                                      │
│  • Modern GPUs have tensor cores optimized for FP16/BF16               │
│  • ~2× throughput for matrix operations                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Checkpointing and Fault Tolerance

Training runs for weeks or months. Hardware fails. Checkpointing is essential:

Regular checkpoints: Save model state every N steps (e.g., every 1000 steps)

Includes:

  • Model parameters
  • Optimizer states
  • Learning rate scheduler state
  • Random number generator states
  • Current step/epoch
  • Data loader state (which samples have been seen)

Checkpoint size: For a 70B model, each checkpoint is ~1-2 TB

Fault tolerance strategies:

  • Redundant storage (multiple copies)
  • Automatic restart from latest checkpoint
  • Elastic training (can add/remove GPUs)
  • Preemption handling (for cloud spot instances)

Training Dynamics: What Happens During Pre-training

The Loss Curve

Pre-training produces a characteristic loss curve:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TYPICAL TRAINING LOSS CURVE                           │
│                                                                          │
│  Loss                                                                    │
│   12 │╲                                                                 │
│      │ ╲                                                                │
│   10 │  ╲                                                               │
│      │   ╲                                                              │
│    8 │    ╲                                                             │
│      │     ╲                                                            │
│    6 │      ╲                                                           │
│      │       ╲                                                          │
│    4 │        ╲____                                                     │
│      │             ╲____                                                │
│    2 │                  ╲________________________________________       │
│      │                                                                  │
│    0 └──────────────────────────────────────────────────────────────    │
│       0    100K   200K   300K   400K   500K   Steps                     │
│                                                                          │
│  Phases:                                                                 │
│  1. Rapid initial decrease (learning basic patterns)                    │
│  2. Steady improvement (learning more complex patterns)                 │
│  3. Diminishing returns (approaching data/model limits)                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Learning Rate Schedule

The learning rate is the most important hyperparameter. Modern LLMs use a warmup + cosine decay schedule:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LEARNING RATE SCHEDULE                                │
│                                                                          │
│  LR                                                                      │
│  3e-4 │        ╭────────╮                                               │
│       │       ╱          ╲                                              │
│  2e-4 │      ╱            ╲                                             │
│       │     ╱              ╲                                            │
│  1e-4 │    ╱                ╲_____                                      │
│       │   ╱                       ╲____                                 │
│    0  │──╱                             ╲_________________________       │
│       └──────────────────────────────────────────────────────────       │
│          │←─ Warmup ─→│←──────── Cosine Decay ─────────────→│          │
│          (~2K steps)        (rest of training)                          │
│                                                                          │
│  Warmup: Gradually increase LR to avoid instability at start            │
│  Peak: Maximum learning rate (model-size dependent)                     │
│  Decay: Smoothly decrease to ~10% of peak                               │
│                                                                          │
│  Typical peak LR by model size:                                         │
│  • 1B params:  3e-4                                                     │
│  • 7B params:  3e-4                                                     │
│  • 70B params: 1.5e-4                                                   │
│  • 175B+ params: 1e-4 or lower                                          │
│                                                                          │
│  Larger models need lower learning rates for stability.                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Training Instabilities

Large-scale training is notoriously unstable. Common issues:

Loss spikes: Sudden increases in loss, often due to bad data batches or numeric issues.

Divergence: Loss increases and never recovers. Usually requires restarting from earlier checkpoint with lower LR.

Gradient explosion: Gradients become NaN/Inf. Requires gradient clipping.

Gradient vanishing: Gradients become zero, training stalls. Often an architecture issue.

Mitigation strategies:

  • Gradient clipping (max norm, typically 1.0)
  • Learning rate warmup
  • Careful initialization
  • Monitoring gradient norms, activation statistics
  • Data quality checks

What the Model Learns (Progression)

Research suggests models learn different capabilities at different stages:

Early training (0-10% of tokens):

  • Basic syntax and grammar
  • Common word associations
  • Simple patterns

Mid training (10-50% of tokens):

  • Factual knowledge
  • More complex reasoning
  • Domain knowledge

Late training (50-100% of tokens):

  • Subtle linguistic patterns
  • Complex reasoning chains
  • Edge cases and rare patterns

Implication: You can probe model capabilities throughout training. Some capabilities emerge suddenly; others improve gradually.


Continued Pre-training vs. Pre-training from Scratch

What is Continued Pre-training?

Instead of training a model from random initialization, continued pre-training takes an existing pre-trained model and trains it further on new data.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CONTINUED PRE-TRAINING                                │
│                                                                          │
│  FROM SCRATCH:                                                           │
│  Random Init ──────────────────────────────────────────→ Final Model    │
│              Train on 15T tokens                                        │
│              (months of compute)                                         │
│                                                                          │
│  CONTINUED PRE-TRAINING:                                                 │
│  Llama 3 Base ─────────────────────→ Domain-Specific Model              │
│               Train on 100B domain tokens                                │
│               (days of compute)                                          │
│                                                                          │
│  Use cases:                                                              │
│  • Domain adaptation (medical, legal, financial)                        │
│  • Language adaptation (add new language)                               │
│  • Knowledge updating (more recent data)                                │
│  • Context length extension                                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

When to Use Each Approach

Pre-train from scratch when:

  • You have massive compute budget
  • Existing models don't fit your needs
  • You need full control over data and capabilities
  • You're pushing the frontier

Continue pre-training when:

  • You need domain specialization
  • You want to add capabilities to existing model
  • Budget is limited (10-100× cheaper)
  • Base model is close to what you need

Continued Pre-training Considerations

Catastrophic forgetting: The model may "forget" general capabilities while learning domain-specific ones. Mitigation: mix domain data with general data (e.g., 50/50).

Learning rate: Use a lower learning rate than original pre-training (typically 10-30% of original peak).

Data quality: Domain data must be high quality; the model has strong priors from original training that bad data will conflict with.


Practical Considerations

Compute Requirements by Model Size

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    COMPUTE REQUIREMENTS GUIDE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model Size  │ Min GPUs │ Typical GPUs │ Training Time │ Est. Cost     │
│  ───────────────────────────────────────────────────────────────────    │
│  1B          │ 1-4      │ 8 A100s      │ 1-2 days      │ $5K-20K       │
│  7B          │ 8        │ 32-64 A100s  │ 1-2 weeks     │ $50K-200K     │
│  13B         │ 16       │ 64-128 A100s │ 2-4 weeks     │ $100K-500K    │
│  70B         │ 64       │ 256-512 A100s│ 1-2 months    │ $1M-5M        │
│  175B+       │ 256+     │ 1000+ A100s  │ 2-4 months    │ $5M-50M+      │
│                                                                          │
│  Notes:                                                                  │
│  • Costs assume cloud pricing; owned hardware is cheaper long-term     │
│  • Times assume Chinchilla-optimal; overtraining takes longer          │
│  • H100s are ~2× faster than A100s                                     │
│  • Costs dropping ~50% per year                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Requirements

Rough guidelines for tokens needed:

  • Chinchilla-optimal: 20× parameter count
  • Modern practice: 50-200× parameter count
  • Maximum useful: Unknown, but >1000× seems wasteful

Data collection is expensive:

  • Licensing costs for books, papers
  • Compute for web crawling and filtering
  • Human annotation for quality assessment
  • Legal review for compliance

Common Failure Modes

  1. Data contamination: Test set data appears in training, invalidating benchmarks
  2. Under-filtering: Low-quality data hurts model quality
  3. Over-filtering: Too aggressive filtering loses valuable content
  4. Imbalanced mixture: Wrong proportions hurt specific capabilities
  5. Training instability: Loss spikes, divergence requiring restarts
  6. Infrastructure failures: Hardware failures, networking issues

Summary: The Pre-training Recipe

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PRE-TRAINING RECIPE                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. DATA                                                                 │
│     ├── Collect diverse sources (web, books, code, papers)              │
│     ├── Filter aggressively for quality                                 │
│     ├── Deduplicate thoroughly                                          │
│     ├── Remove PII and toxic content                                    │
│     └── Mix in appropriate proportions                                  │
│                                                                          │
│  2. ARCHITECTURE                                                         │
│     ├── Transformer decoder (standard)                                  │
│     ├── Modern improvements (RoPE, SwiGLU, RMSNorm, GQA)               │
│     ├── Size based on compute budget                                    │
│     └── Consider MoE for efficiency                                     │
│                                                                          │
│  3. TRAINING                                                             │
│     ├── Distributed training (FSDP/ZeRO + tensor/pipeline parallel)    │
│     ├── Mixed precision (BF16)                                          │
│     ├── Adam optimizer (or variants)                                    │
│     ├── Warmup + cosine LR schedule                                     │
│     ├── Gradient clipping                                               │
│     └── Regular checkpointing                                           │
│                                                                          │
│  4. MONITORING                                                           │
│     ├── Loss curves                                                     │
│     ├── Gradient norms                                                  │
│     ├── Periodic evaluation on benchmarks                               │
│     └── Hardware utilization                                            │
│                                                                          │
│  5. OUTPUT                                                               │
│     └── Base model ready for SFT and RLHF                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Tokenization: The Foundation of Everything

Before any training can happen, text must be converted into tokens—discrete units the model can process. Tokenization decisions have profound effects on model capabilities.

Why Tokenization Matters

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENIZATION IMPACT                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE SAME TEXT, DIFFERENT TOKENIZATIONS:                                │
│                                                                          │
│  Text: "tokenization is important"                                      │
│                                                                          │
│  Word-level:   ["tokenization", "is", "important"]     → 3 tokens       │
│  Character:    ["t","o","k","e","n","i","z",...]       → 23 tokens      │
│  BPE (typical):["token", "ization", "is", "important"] → 4 tokens       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THIS MATTERS:                                                       │
│                                                                          │
│  1. EFFECTIVE CONTEXT LENGTH                                             │
│     A 4K context window holds:                                          │
│     • ~3,200 words with efficient tokenizer                            │
│     • ~2,000 words with inefficient tokenizer                          │
│     • The difference is massive for long documents                     │
│                                                                          │
│  2. TRAINING EFFICIENCY                                                  │
│     More tokens per word = more compute per concept learned            │
│     Efficient tokenization = faster, cheaper training                  │
│                                                                          │
│  3. MULTILINGUAL CAPABILITY                                              │
│     Poor tokenizers fragment non-English text:                         │
│     • "Hello" → 1 token                                                │
│     • "你好" → 3-4 tokens (same meaning!)                              │
│     This makes non-English slower and uses more context                │
│                                                                          │
│  4. CAPABILITY BOUNDARIES                                                │
│     Tokenizers affect what the model "sees":                           │
│     • "1234" as one token vs "12" "34" → different arithmetic         │
│     • Code symbols tokenized together vs split → affects coding       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Modern Tokenization: BPE and Beyond

Byte Pair Encoding (BPE) is the dominant tokenization algorithm. It learns a vocabulary from data by iteratively merging frequent character pairs:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    BPE ALGORITHM                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRAINING (learning the vocabulary):                                     │
│                                                                          │
│  Start: Each character is a token                                       │
│  Corpus: "low lower lowest"                                             │
│  Initial: ['l', 'o', 'w', ' ', 'e', 'r', 's', 't']                     │
│                                                                          │
│  Iteration 1: Most common pair = ('l', 'o') → merge to 'lo'            │
│  Iteration 2: Most common pair = ('lo', 'w') → merge to 'low'          │
│  Iteration 3: Most common pair = ('e', 'r') → merge to 'er'            │
│  Iteration 4: Most common pair = ('low', 'er') → merge to 'lower'      │
│  ...continue until vocabulary size reached                              │
│                                                                          │
│  Final vocabulary includes: 'low', 'lower', 'lowest', 'er', 'est', ... │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ENCODING (using the vocabulary):                                        │
│                                                                          │
│  Greedy longest match from learned vocabulary:                          │
│  "lowest prices" → ["lowest", " ", "prices"]                           │
│                                                                          │
│  Unknown characters fall back to byte representation                   │
│  (handles any Unicode, even emoji)                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SentencePiece is a popular implementation that treats the input as raw bytes (no pre-tokenization needed), making it truly language-agnostic. Used by Llama, Mistral, and most modern models.

Tiktoken (used by OpenAI) is a fast BPE implementation with optimizations for efficiency.

Vocabulary Size Trade-offs

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    VOCABULARY SIZE DECISIONS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Smaller Vocabulary (e.g., 32K tokens):                                 │
│  ├── ✓ Smaller embedding matrix (less memory)                          │
│  ├── ✓ Each token seen more often (better learning)                    │
│  ├── ✗ More tokens per word (longer sequences)                         │
│  └── ✗ Worse at rare words (more fragmentation)                        │
│                                                                          │
│  Larger Vocabulary (e.g., 128K tokens):                                 │
│  ├── ✓ Fewer tokens per word (shorter sequences)                       │
│  ├── ✓ Better coverage of words and phrases                            │
│  ├── ✗ Larger embedding matrix (more memory)                           │
│  └── ✗ Rare tokens undertrained                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TYPICAL VOCABULARY SIZES:                                               │
│                                                                          │
│  GPT-2:       50,257 tokens                                             │
│  Llama 2:     32,000 tokens                                             │
│  Llama 3:     128,000 tokens                                            │
│  GPT-4:       100,000+ tokens (estimated)                               │
│  Mistral:     32,000 tokens                                             │
│                                                                          │
│  Trend: Larger vocabularies as models scale (memory less constrained)  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Special Tokens

Models use special tokens for control and structure:

  • <|begin_of_text|>: Start of sequence
  • <|end_of_text|>: End of sequence (important for knowing when to stop)
  • <|pad|>: Padding for batching variable-length sequences
  • <|user|>, <|assistant|>: Role markers (added during SFT, not pre-training)

Getting special tokens right is crucial—mistakes here cause mysterious failures.


Evaluation During Pre-training

Pre-training runs for months. How do you know if things are going well?

Continuous Monitoring

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MONITORING METRICS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LOSS METRICS (every step):                                              │
│  ├── Training loss (main signal)                                        │
│  ├── Validation loss (generalization check)                            │
│  ├── Per-domain loss (track different data sources)                    │
│  └── Loss variance (stability indicator)                               │
│                                                                          │
│  GRADIENT METRICS (every step):                                          │
│  ├── Gradient norm (explosion/vanishing detection)                     │
│  ├── Per-layer gradient norms (identify problematic layers)            │
│  └── Gradient noise scale (batch size adequacy)                        │
│                                                                          │
│  THROUGHPUT METRICS (every step):                                        │
│  ├── Tokens per second                                                  │
│  ├── GPU utilization                                                    │
│  ├── Memory usage                                                       │
│  └── Communication overhead                                            │
│                                                                          │
│  CAPABILITY METRICS (periodic):                                          │
│  ├── Benchmark evaluations (every few thousand steps)                  │
│  ├── Perplexity on held-out sets                                       │
│  ├── Few-shot performance on key tasks                                 │
│  └── Generation quality samples                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Early Warning Signs

Loss spikes: Sudden increases in loss. Often caused by:

  • Bad data batches (quality filter failures)
  • Numeric instability (need gradient clipping)
  • Learning rate too high
  • Hardware issues (bit flips, memory errors)

Response: Check recent data, reduce LR, increase gradient clipping, rollback if needed.

Gradient norm spikes: Large gradient norms without loss spikes. Often precedes instability. Consider preemptive LR reduction.

Validation/training gap growing: Model is overfitting. Could indicate:

  • Need more data
  • Need more regularization
  • Training too long

Benchmark Evaluation During Training

Running full evaluations is expensive, but periodic checks are valuable:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    EVALUATION STRATEGY                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FREQUENCY vs THOROUGHNESS TRADE-OFF:                                   │
│                                                                          │
│  Every 1K steps:   Quick proxy metrics (5-10 minutes)                  │
│  ├── Perplexity on validation set                                      │
│  ├── 5-shot accuracy on 2-3 key benchmarks                             │
│  └── Generation samples (qualitative check)                            │
│                                                                          │
│  Every 10K steps:  Moderate evaluation (1-2 hours)                     │
│  ├── Full perplexity evaluation                                        │
│  ├── Common benchmarks (MMLU, HellaSwag, ARC, etc.)                   │
│  └── Longer generation samples                                         │
│                                                                          │
│  Every 50K steps:  Comprehensive evaluation (4-8 hours)                │
│  ├── Full benchmark suite                                              │
│  ├── Coding benchmarks (HumanEval)                                     │
│  ├── Math benchmarks (GSM8K)                                           │
│  └── Human evaluation samples                                          │
│                                                                          │
│  Note: Evaluations use separate GPU allocation to not slow training    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Data Quality vs Quantity Debate

A fundamental question in pre-training: Is it better to have more data or better data?

The Phi Model Insight

Microsoft's Phi models demonstrated that data quality can substitute for scale. Phi-1.5 (1.3B parameters) achieved GPT-3.5-level performance on some benchmarks by:

  1. Using heavily filtered, high-quality web data
  2. Creating synthetic "textbook-quality" data
  3. Focusing on educational, well-structured content

This suggests the scaling laws may overstate data quantity needs if quality is high enough.

Quality Indicators

What makes data "high quality" for pre-training?

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA QUALITY DIMENSIONS                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LINGUISTIC QUALITY:                                                     │
│  ├── Grammatical correctness                                           │
│  ├── Coherent structure                                                │
│  ├── Rich vocabulary                                                   │
│  └── Clear expression                                                  │
│                                                                          │
│  INFORMATIONAL QUALITY:                                                  │
│  ├── Factual accuracy                                                  │
│  ├── Depth of explanation                                              │
│  ├── Logical reasoning present                                         │
│  └── Novel information (not repetitive)                                │
│                                                                          │
│  STRUCTURAL QUALITY:                                                     │
│  ├── Well-organized content                                            │
│  ├── Clear paragraph structure                                         │
│  ├── Appropriate use of formatting                                     │
│  └── Good signal-to-noise ratio                                        │
│                                                                          │
│  DIVERSITY QUALITY:                                                      │
│  ├── Covers many topics                                                │
│  ├── Multiple perspectives represented                                 │
│  ├── Different writing styles                                          │
│  └── Range of complexity levels                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Diminishing Returns of Data Scale

Research increasingly shows diminishing returns from raw data scale:

  • First 1T tokens: Massive capability gains
  • 1T-5T tokens: Significant but smaller gains
  • 5T-15T tokens: Incremental improvements
  • Beyond 15T: Unclear if more helps much

This is why data quality and diversity become more important than raw scale at frontier levels. Everyone has access to similar web crawls; differentiation comes from curation.


Context Length: Training for Long Documents

Modern models need to handle long contexts (documents, codebases, conversations). This creates unique training challenges.

The Challenge of Long Contexts

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LONG CONTEXT CHALLENGES                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MEMORY SCALING:                                                         │
│  Standard attention: O(n²) memory in sequence length                    │
│                                                                          │
│  Sequence Length    │ Attention Memory (per layer, per batch)           │
│  ───────────────────┼────────────────────────────────────────           │
│  2K tokens          │ 16 MB                                             │
│  8K tokens          │ 256 MB                                            │
│  32K tokens         │ 4 GB                                              │
│  128K tokens        │ 64 GB                                             │
│  1M tokens          │ 4 TB (!)                                          │
│                                                                          │
│  This is per layer, per sample. 32 layers × 8 batch = impossible       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TRAINING CHALLENGES:                                                    │
│                                                                          │
│  1. Memory: Can't fit long sequences without special techniques        │
│  2. Data: Need documents actually that long (most aren't)              │
│  3. Learning: Model must learn to use distant context                  │
│  4. Cost: Long sequences = expensive training                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Techniques for Long Context Training

Staged context extension: Train at shorter context first, then extend:

  1. Pre-train at 4K context
  2. Continue training at 32K with adjusted RoPE scaling
  3. Final phase at 128K with ring attention

Memory-efficient attention:

  • Flash Attention: Fused kernels, never materializes full attention matrix
  • Ring Attention: Distribute attention computation across devices
  • Sliding window: Only attend to nearby tokens (Mistral)

Data for long context:

  • Concatenate related documents
  • Use naturally long documents (books, papers, code repos)
  • Synthetic long-range dependency tasks

Reproducibility and Open Science

The Reproducibility Challenge

Pre-training is expensive and difficult to reproduce:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    REPRODUCIBILITY CHALLENGES                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WHAT'S TYPICALLY DISCLOSED:                                            │
│  ├── Model architecture                                                │
│  ├── Parameter count                                                   │
│  ├── Token count (approximate)                                         │
│  └── General data sources                                              │
│                                                                          │
│  WHAT'S TYPICALLY NOT DISCLOSED:                                        │
│  ├── Exact data mixture proportions                                    │
│  ├── Filtering pipeline details                                        │
│  ├── Hyperparameters (learning rate schedules, etc.)                  │
│  ├── Training instabilities and how they were resolved                │
│  ├── Checkpoint selection criteria                                     │
│  └── Compute infrastructure details                                    │
│                                                                          │
│  RESULT:                                                                 │
│  Even with model weights released, training process can't be          │
│  reproduced. Published scaling laws may not transfer to your setup.   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Open Efforts

Several initiatives aim to improve pre-training reproducibility:

Open datasets:

  • RedPajama: Attempt to replicate Llama training data
  • The Stack: Permissively licensed code
  • Dolma: Open pre-training dataset with documentation

Open training:

  • BLOOM: Fully documented training process
  • OLMo: Open language model with training code and data
  • Pythia: Suite of models with training data and checkpoints released

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles

LLMsML Engineering

SFT and RLHF: The Complete Guide to Post-Training LLMs

A deep dive into Supervised Fine-Tuning and Reinforcement Learning from Human Feedback—the techniques that transform base models into useful assistants.

13 min read
EducationLLMs

SFT Deep Dive: Instruction Tuning Techniques and Best Practices

A comprehensive guide to Supervised Fine-Tuning (SFT) for LLMs—covering full fine-tuning vs LoRA vs QLoRA vs DoRA, data curation strategies, instruction formats, multi-task learning, and avoiding catastrophic forgetting.

4 min read
EducationLLMs

RLHF Complete Guide: Aligning LLMs with Human Preferences

A comprehensive deep dive into Reinforcement Learning from Human Feedback—from reward modeling to PPO to DPO. Understanding how AI assistants learn to be helpful, harmless, and honest.

9 min read
LLMsResearch

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.

21 min read
LLMsML Engineering

Knowledge Distillation for LLMs: Compressing Intelligence

A comprehensive guide to knowledge distillation—transferring capabilities from large teacher models to smaller, faster student models. From theory to implementation, including chain-of-thought distillation and synthetic data generation.

7 min read
LLMsML Engineering

Open-Source LLMs: The Complete 2025 Guide

A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.

3 min read