Why is the dropout default 0.0?

Large-scale pretraining on diverse data rarely overfits—the data itself provides regularization. Dropout wastes computation. For small datasets (e.g., Shakespeare), use `dropout=0.1`.

Why pad vocab_size to 50304?

GPU matrix operations are most efficient when dimensions align with warp sizes and memory access patterns. Padding to multiples of 64 provides 3-5% speedup with negligible parameter increase.

What's the difference between pre-norm and post-norm?

Pre-norm applies LayerNorm before each sublayer; post-norm applies it after. Pre-norm provides better gradient flow and is standard for GPT-style models.

Why scale residual projections by $1/\sqrt{2N}$?

Each block has 2 residual additions. Without scaling, variance grows linearly with depth. The scaling maintains approximately unit variance throughout.

How does Flash Attention help?

Reduces memory from $O(T^2)$ to $O(T)$ by computing attention in tiles that fit in SRAM. Provides 2-4× speedup from reduced memory bandwidth.

Can I train on a single GPU?

Yes. Remove `torchrun` and the code uses gradient accumulation automatically. Training takes longer but produces identical results.

What is MFU and why does it matter?

Model Flops Utilization measures what fraction of theoretical GPU performance you achieve. 50% MFU is good; below 30% indicates optimization opportunities.

nanoGPT: Andrej Karpathy's Minimal GPT Training Framework | Enrico Piovano

Repository Overview

nanoGPT is Andrej Karpathy's minimal GPT-2 implementation, demonstrating that training a language model requires only ~600 lines of Python. The repository structure:

File	Lines	Purpose
`model.py`	331	Complete GPT architecture: LayerNorm, CausalSelfAttention, MLP, Block, GPT
`train.py`	337	Training loop: DDP, mixed precision, gradient accumulation, checkpointing
`sample.py`	~50	Text generation with temperature and top-k
`configurator.py`	~20	CLI argument parsing via exec()
`config/*.py`	varies	Hyperparameter presets for different runs
`data/*/prepare.py`	varies	Dataset preprocessing scripts

GitHub: github.com/karpathy/nanoGPT

The Language Modeling Objective

Cross-Entropy Loss

Given a sequence of tokens $x_1, x_2, \ldots, x_T$ , the model learns to predict each token from its prefix. The training objective minimizes the negative log-likelihood:

$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | x_1, \ldots, x_{t-1})$

where $P_\theta$ is the model's predicted probability distribution over the vocabulary. This is equivalent to cross-entropy between the true one-hot distribution and the predicted distribution.

Implementation (model.py:187):

Python

loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

The ignore_index=-1 allows masking certain positions (e.g., padding tokens).

Perplexity

Loss is often reported as perplexity, the exponential of cross-entropy:

$\text{PPL} = e^{\mathcal{L}} = \exp\left(-\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | x_{<t})\right)$

Intuition: Perplexity represents the "effective vocabulary size" the model is choosing from. A perplexity of 20 means the model is, on average, as uncertain as uniformly choosing among 20 tokens.

Model	Perplexity (WikiText-103)	Loss (nats)
Random (50K vocab)	50,000	10.8
n-gram LM	~150	5.0
GPT-2 124M	~29	3.4
GPT-2 1.5B	~18	2.9

Bits Per Byte (BPB)

For tokenizer-independent comparison, convert loss to bits per byte:

$\text{BPB} = \frac{\mathcal{L}}{\ln 2} \cdot \frac{\text{tokens}}{\text{bytes}}$

Typical compression ratios are 3.5-4.5 bytes per token for English text.

Model Configuration

The architecture is fully specified by a dataclass (model.py:108-116):

Python

@dataclass
class GPTConfig:
    block_size: int = 1024      # Maximum sequence length (context window)
    vocab_size: int = 50304     # Vocabulary size (50257 padded for efficiency)
    n_layer: int = 12           # Number of transformer layers
    n_head: int = 12            # Number of attention heads
    n_embd: int = 768           # Embedding/hidden dimension
    dropout: float = 0.0        # Dropout probability (0 for pretraining)
    bias: bool = True           # Use bias in Linear/LayerNorm layers

Vocabulary Size Padding

The padding from 50257 to 50304 (nearest multiple of 64) is a GPU optimization. CUDA matrix multiplication is most efficient when dimensions align with warp sizes and memory access patterns. The extra 47 unused tokens add negligible parameters but provide measurable speedup (typically 3-5%).

GPT-2 Model Family

Model	Layers	Heads	Dim	Parameters	Head Dim
GPT-2 Small	12	12	768	124M	64
GPT-2 Medium	24	16	1024	350M	64
GPT-2 Large	36	20	1280	774M	64
GPT-2 XL	48	25	1600	1558M	64

Implementation (model.py:216-220):

Python

config_args = {
    'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
    'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
    'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
    'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
}

Parameter Count Formula

The total parameter count can be estimated as:

$N \approx 12 L d^2 + V d$

where $L$ = layers, $d$ = model dimension, $V$ = vocabulary size.

Detailed breakdown for GPT-2 Small (768d, 12L):

Component	Formula	Count
Token embeddings	$V \times d$	50,304 × 768 = 38.6M
Position embeddings	$T \times d$	1,024 × 768 = 0.8M
Attention Q,K,V	$3 \times d^2 \times L$	3 × 768² × 12 = 21.2M
Attention output	$d^2 \times L$	768² × 12 = 7.1M
MLP fc	$d \times 4d \times L$	768 × 3072 × 12 = 28.3M
MLP proj	$4d \times d \times L$	3072 × 768 × 12 = 28.3M
LayerNorm (2 per block)	$2 \times 2d \times L$	4 × 768 × 12 = 37K
Final LayerNorm	$2d$	2 × 768 = 1.5K
Total		~124M

Note: lm_head shares weights with token embeddings (weight tying), so it's not counted separately.

Implementation (model.py:150-160):

Python

def get_num_params(self, non_embedding=True):
    n_params = sum(p.numel() for p in self.parameters())
    if non_embedding:
        n_params -= self.transformer.wpe.weight.numel()
    return n_params

Position embeddings are subtracted because they don't contribute to FLOPS (just lookup), but token embeddings are kept because they're tied with lm_head.

Token and Position Embeddings

Token Embeddings

Each input token $x_t \in \{0, 1, \ldots, V-1\}$ is mapped to a dense vector:

$e_t = W_e[x_t] \in \mathbb{R}^d$

where $W_e \in \mathbb{R}^{V \times d}$ is the embedding matrix. This is a simple lookup (indexing), not matrix multiplication.

Learned Positional Embeddings

GPT-2 uses learned absolute position embeddings:

$p_t = W_p[t] \in \mathbb{R}^d$

where $W_p \in \mathbb{R}^{T_{\max} \times d}$ contains a learned vector for each position up to maximum sequence length $T_{\max} = 1024$ .

Combined Input Representation

The input to the transformer is the sum of token and position embeddings:

$h_t^{(0)} = e_t + p_t = W_e[x_t] + W_p[t]$

Implementation (model.py:126-132, model.py:174-179):

Python

self.transformer = nn.ModuleDict(dict(
    wte = nn.Embedding(config.vocab_size, config.n_embd),
    wpe = nn.Embedding(config.block_size, config.n_embd),
    drop = nn.Dropout(config.dropout),
    ...
))
# Forward pass:
tok_emb = self.transformer.wte(idx)   # (b, t, n_embd)
pos_emb = self.transformer.wpe(pos)   # (t, n_embd), broadcasts over batch
x = self.transformer.drop(tok_emb + pos_emb)

Why Position Encoding is Necessary

Attention is permutation-invariant without position encoding. Given input sequence $[x_1, x_2, x_3]$ , the attention mechanism treats it identically to $[x_3, x_1, x_2]$ . Position embeddings break this symmetry, allowing the model to distinguish "the cat sat" from "sat cat the".

Mathematical proof: Let $\pi$ be a permutation. Without position encoding:

$\text{Attention}(\pi(Q), \pi(K), \pi(V)) = \pi(\text{Attention}(Q, K, V))$

The output is just a permutation of the input—the model cannot learn word order!

Layer Normalization

Mathematical Definition

Layer normalization computes statistics over the feature dimension for each position independently:

$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$

$\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$

$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$

where:

$\gamma \in \mathbb{R}^d$ is the learned scale parameter
$\beta \in \mathbb{R}^d$ is the learned shift parameter
$\epsilon = 10^{-5}$ prevents division by zero
$\odot$ denotes element-wise multiplication

Why LayerNorm, Not BatchNorm?

Property	BatchNorm	LayerNorm
Statistics computed over	Batch dimension	Feature dimension
Batch size 1 at inference	Breaks	Works
Variable sequence lengths	Problematic	No issue
Running statistics needed	Yes	No

LayerNorm is essential for autoregressive generation where batch size is typically 1.

Custom Implementation

PyTorch's built-in LayerNorm doesn't support bias=False independently from scale. nanoGPT implements a custom version (model.py:18-27):

Python

class LayerNorm(nn.Module):
    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

Setting bias=False throughout provides a small efficiency gain—empirically, models perform equally well without biases.

Multi-Head Self-Attention

Query, Key, Value Projections

For input hidden states $h \in \mathbb{R}^{T \times d}$ , we project to queries, keys, and values:

$Q = hW_Q, \quad K = hW_K, \quad V = hW_V$

where $W_Q, W_K, W_V \in \mathbb{R}^{d \times d}$ .

Intuition:

Query ( $Q$ ): "What am I looking for?"
Key ( $K$ ): "What do I contain?"
Value ( $V$ ): "What information do I provide if selected?"

nanoGPT computes all three in a single fused projection (model.py:35, model.py:56):

Python

self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
# Forward:
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)

Scaled Dot-Product Attention

The attention weights are computed as:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

where $d_k = d / n_{\text{heads}}$ is the dimension per head.

Why Scale by $\sqrt{d_k}$ ?

Consider query and key vectors with independent components, each with zero mean and unit variance. The dot product:

$q \cdot k = \sum_{i=1}^{d_k} q_i k_i$

has variance $d_k$ (sum of $d_k$ independent products with variance 1 each). The standard deviation is $\sqrt{d_k}$ .

Without scaling, as $d_k$ grows, dot products become large, pushing softmax into saturation:

$\text{softmax}([100, 0, 0]) \approx [1.0, 0.0, 0.0]$

In saturation, gradients vanish (all outputs ~0 except one ~1). Scaling by $\frac{1}{\sqrt{d_k}}$ normalizes variance to 1, keeping softmax in a well-behaved regime.

Softmax Numerical Stability

The naive softmax can overflow for large logits:

$\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$

The stable version subtracts the maximum:

$\text{softmax}(x)_i = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$

This is mathematically equivalent (numerator and denominator scale by $e^{-\max(x)}$ ), but all exponents are now $\leq 0$ , preventing overflow. PyTorch's F.softmax implements this automatically.

Causal Masking

For autoregressive language modeling, position $t$ can only attend to positions $\leq t$ . This is enforced by adding a mask before softmax:

$M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}$

$\text{Attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right) V$

Since $e^{-\infty} = 0$ , masked positions contribute zero weight.

Implementation (model.py:48-50, model.py:67-68):

Python

# Register causal mask buffer
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                            .view(1, 1, config.block_size, config.block_size))
# Apply mask
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))

Multi-Head Attention

Rather than a single attention function, transformers use multiple "heads" operating on different subspaces:

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O$

where each head operates on a $d_k = d/h$ dimensional slice:

$\text{head}_i = \text{Attention}(QW_Q^i, KW_K^i, VW_V^i)$

With 12 heads and 768-dimensional embeddings, each head works with 64 dimensions. Different heads can specialize:

Syntactic relationships (subject-verb agreement)
Semantic dependencies (coreference)
Positional patterns (nearby tokens)

Implementation (model.py:57-59):

Python

k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

Flash Attention

Standard attention materializes the full $T \times T$ attention matrix, requiring $O(T^2)$ memory. Flash Attention (PyTorch 2.0+) computes attention in tiles that fit in GPU SRAM, reducing memory to $O(T)$ and providing 2-4× speedup.

Implementation (model.py:45, model.py:62-64):

Python

self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
# Forward:
if self.flash:
    y = torch.nn.functional.scaled_dot_product_attention(
        q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)

Memory comparison (12 heads, seq_len=1024):

Standard: $12 \times 1024^2 \times 4$ bytes = 50MB per layer
Flash: $O(T)$ ≈ 50KB per layer (1000× reduction)

Complete Attention Module

The full implementation (model.py:29-76):

Python

class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        if self.flash:
            y = torch.nn.functional.scaled_dot_product_attention(
                q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
        else:
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v

        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.resid_dropout(self.c_proj(y))
        return y

Feedforward Network

Each transformer block contains a position-wise feedforward network (MLP) that processes each position independently.

Mathematical Definition

$\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2$

where:

$W_1 \in \mathbb{R}^{d \times 4d}$ expands to 4× dimension
$W_2 \in \mathbb{R}^{4d \times d}$ projects back

GELU Activation

The Gaussian Error Linear Unit:

$\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$

where $\Phi(x)$ is the CDF of the standard normal distribution.

Intuition: GELU is a smooth, probabilistic ReLU. For input $x$ , it's multiplied by the probability that a standard normal is less than $x$ . Large positive values pass through; large negative values are zeroed; the transition is smooth.

Fast approximation:

$\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right]\right)$

Comparison of Activations

Activation	Formula	Smooth at 0	Sparse	Dead Neurons
ReLU	$\max(0, x)$	No	High	Yes
GELU	$x \cdot \Phi(x)$	Yes	Low	No
SiLU/Swish	$x \cdot \sigma(x)$	Yes	Low	No
ReLU²	$(\max(0,x))^2$	Yes	High	Yes

Implementation

model.py:78-92:

Python

class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

Transformer Block with Pre-Normalization

Pre-Norm vs Post-Norm

The original transformer (2017) used post-normalization:

$h' = \text{LayerNorm}(h + \text{Attention}(h))$

GPT-2 and nanoGPT use pre-normalization:

$h' = h + \text{Attention}(\text{LayerNorm}(h))$

Pre-norm provides better gradient flow, especially in deep networks. Normalizing the input to each sublayer ensures consistent activation scales regardless of depth.

Complete Block Equations

$h' = h + \text{Attention}(\text{LayerNorm}(h))$ $h'' = h' + \text{FFN}(\text{LayerNorm}(h'))$

Residual Connections and Gradient Flow

The residual connection allows gradients to flow directly through the network:

$\frac{\partial \mathcal{L}}{\partial h} = \frac{\partial \mathcal{L}}{\partial h''} \cdot \left(1 + \frac{\partial f(h)}{\partial h}\right)$

The "1" term provides a gradient highway regardless of how the sublayer $f$ behaves. Without residuals, gradients must flow through every transformation, often vanishing in deep networks.

Implementation

model.py:94-106:

Python

class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

Weight Initialization

Standard Initialization

Proper initialization prevents exploding or vanishing signals. nanoGPT uses normal initialization with $\sigma = 0.02$ (model.py:162-168):

Python

def _init_weights(self, module):
    if isinstance(module, nn.Linear):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

Residual Projection Scaling

The output projections (c_proj) receive special scaled initialization (model.py:143-145):

Python

for pn, p in self.named_parameters():
    if pn.endswith('c_proj.weight'):
        torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

Why $\frac{1}{\sqrt{2N}}$ ?

Each block has two residual additions (attention and MLP). After $N$ blocks:

$\text{Var}(h^{(N)}) = \text{Var}(h^{(0)}) + \sum_{i=1}^{2N} \text{Var}(\text{residual}_i)$

Without scaling, if each residual has variance 1, total variance grows as $2N$ . The scaling ensures:

$\text{Var}(\text{residual}) \approx \frac{1}{2N}$

maintaining approximately unit variance regardless of depth.

Weight Tying

nanoGPT ties the input embedding matrix to the output projection (model.py:138):

Python

self.transformer.wte.weight = self.lm_head.weight  # https://paperswithcode.com/method/weight-tying

The logit for token $i$ at position $t$ is:

$\text{logit}_i = h_t^{(N)} \cdot W_e[i]^T$

Benefits:

Parameter efficiency: Saves $V \times d$ parameters (~38.6M for GPT-2)
Implicit regularization: Forces embeddings useful for both input and output
Semantic consistency: Related tokens have similar input and output representations

Complete Forward Pass

model.py:170-193:

Python

def forward(self, idx, targets=None):
    device = idx.device
    b, t = idx.size()
    assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
    pos = torch.arange(0, t, dtype=torch.long, device=device)

    # Forward the GPT model
    tok_emb = self.transformer.wte(idx)   # (b, t, n_embd)
    pos_emb = self.transformer.wpe(pos)   # (t, n_embd)
    x = self.transformer.drop(tok_emb + pos_emb)
    for block in self.transformer.h:
        x = block(x)
    x = self.transformer.ln_f(x)

    if targets is not None:
        logits = self.lm_head(x)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
    else:
        # Inference optimization: only compute logits for last position
        logits = self.lm_head(x[:, [-1], :])
        loss = None

    return logits, loss

The inference optimization x[:, [-1], :] saves $O(T \times V)$ computation per generation step.

Loading Pretrained GPT-2 Weights

nanoGPT can load OpenAI's pretrained GPT-2 weights via HuggingFace (model.py:206-261):

Python

@classmethod
def from_pretrained(cls, model_type, override_args=None):
    assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
    from transformers import GPT2LMHeadModel

    # Create nanoGPT model with matching config
    config = GPTConfig(**config_args)
    model = GPT(config)
    sd = model.state_dict()

    # Load HuggingFace model
    model_hf = GPT2LMHeadModel.from_pretrained(model_type)
    sd_hf = model_hf.state_dict()

    # Copy weights, transposing Conv1D weights
    transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
    for k in sd_keys_hf:
        if any(k.endswith(w) for w in transposed):
            sd[k].copy_(sd_hf[k].t())  # Transpose Conv1D -> Linear
        else:
            sd[k].copy_(sd_hf[k])

    return model

The transposition is needed because OpenAI's implementation uses Conv1D (weight shape [out, in]) while nanoGPT uses Linear (weight shape [in, out]).

The Complete Training Loop

The training loop (train.py:249-336) implements a production-ready optimization procedure with gradient accumulation, checkpointing, and distributed training support.

Training Loop Structure

train.py:249-336:

Python

# training loop
X, Y = get_batch('train')  # fetch the very first batch
t0 = time.time()
local_iter_num = 0
raw_model = model.module if ddp else model  # unwrap DDP container if needed
running_mfu = -1.0

while True:
    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and master_process:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))

    # forward backward update, with optional gradient accumulation
    for micro_step in range(gradient_accumulation_steps):
        if ddp:
            model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps  # scale the loss
        X, Y = get_batch('train')  # async prefetch next batch
        scaler.scale(loss).backward()

    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)

    # step the optimizer and scaler
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad(set_to_none=True)

    iter_num += 1
    if iter_num > max_iters:
        break

Gradient Accumulation

Gradient accumulation simulates larger batch sizes when GPU memory is limited. The key equation:

$\text{Effective batch size} = \text{micro\_batch} \times \text{grad\_accum\_steps} \times \text{world\_size}$

Example (config/train_gpt2.py:9-13):

Python

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 12
block_size = 1024
gradient_accumulation_steps = 5 * 8

The loss is scaled by $1/\text{grad\_accum\_steps}$ before each backward pass:

$\mathcal{L}_{\text{scaled}} = \frac{\mathcal{L}}{G}$

This ensures gradients sum correctly across accumulation steps.

Evaluation During Training

train.py:215-228:

Python

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Default hyperparameters (train.py:36-41):

eval_interval = 2000 steps between evaluations
eval_iters = 200 batches per evaluation
eval_only = False to continue training after first eval

Checkpointing

The checkpoint contains everything needed to resume training (train.py:277-286):

Python

checkpoint = {
    'model': raw_model.state_dict(),        # Model weights
    'optimizer': optimizer.state_dict(),    # Optimizer state (momentum, variance)
    'model_args': model_args,               # Architecture config
    'iter_num': iter_num,                   # Training progress
    'best_val_loss': best_val_loss,         # Best validation loss
    'config': config,                       # All hyperparameters
}
torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))

Resuming from checkpoint (train.py:158-180):

Python

if init_from == 'resume':
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    # Force architecture params to match
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = checkpoint_model_args[k]
    # Load model
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    model.load_state_dict(checkpoint['model'])
    iter_num = checkpoint['iter_num']
    best_val_loss = checkpoint['best_val_loss']

Tokens Per Iteration Calculation

train.py:101-102:

Python

tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

For the default GPT-2 training config: $\text{tokens/iter} = 40 \times 8 \times 12 \times 1024 = 3,932,160$

With 600K iterations, total tokens = $\approx 2.4 \times 10^{12}$ (2.4 trillion).

Data Loading

train.py:114-131:

Python

data_dir = os.path.join('data', dataset)
def get_batch(split):
    # Recreate np.memmap every batch to avoid memory leak
    if split == 'train':
        data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
    else:
        data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

Key optimizations:

Memory mapping: Only loads accessed pages, enabling datasets larger than RAM
Pinned memory: Page-locked host memory for faster CPU→GPU transfer
Non-blocking transfer: Overlaps data transfer with computation
uint16 storage: Token IDs < 65536 fit in 2 bytes, halving storage

Why Recreate Memmap Each Batch?

The comment references a memory leak in numpy's memmap. Each memmap object holds references that can accumulate. Recreating ensures clean state.

Learning Rate Schedule

Cosine Decay with Linear Warmup

$\eta(t) = \begin{cases} \eta_{\max} \cdot \frac{t + 1}{T_{\text{warmup}} + 1} & \text{if } t < T_{\text{warmup}} \\ \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi(t - T_{\text{warmup}})}{T_{\text{decay}} - T_{\text{warmup}}}\right)\right) & \text{otherwise} \end{cases}$

Implementation (train.py:231-242):

Python

def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * (it + 1) / (warmup_iters + 1)
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))  # ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

Default hyperparameters (train.py:58-68):

$\eta_{\max} = 6 \times 10^{-4}$
$\eta_{\min} = 6 \times 10^{-5}$ (ratio of 10)
$T_{\text{warmup}} = 2000$ steps
$T_{\text{decay}} = 600000$ steps

AdamW Optimizer

Mathematical Formulation

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$ $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$ $\theta_t = \theta_{t-1} - \eta \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1}\right)$

Components:

$m_t$ : Exponential moving average of gradients (momentum)
$v_t$ : Exponential moving average of squared gradients (adaptive LR)
$\hat{m}_t, \hat{v}_t$ : Bias correction for zero initialization
$\lambda \theta_{t-1}$ : Decoupled weight decay (the "W" in AdamW)

AdamW vs Adam

Original Adam applies weight decay through the gradient (L2 regularization):

$g_t' = g_t + \lambda \theta_{t-1}$

This couples weight decay with the adaptive learning rate. AdamW decouples them, applying weight decay directly to parameters. This improves generalization.

Selective Weight Decay

model.py:263-287:

Python

def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
    param_dict = {pn: p for pn, p in self.named_parameters()}
    param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
    # 2D params (weight matrices) get decay, 1D params (biases, norms) don't
    decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
    nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
    optim_groups = [
        {'params': decay_params, 'weight_decay': weight_decay},
        {'params': nodecay_params, 'weight_decay': 0.0}
    ]
    # Use fused AdamW if available
    fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
    use_fused = fused_available and device_type == 'cuda'
    optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, fused=use_fused)
    return optimizer

Default hyperparameters (train.py:57-62):

$\beta_1 = 0.9$ , $\beta_2 = 0.95$
$\epsilon = 10^{-8}$
$\lambda = 0.1$

Gradient Clipping

Global Norm Clipping

$g \leftarrow g \cdot \min\left(1, \frac{\tau}{\|g\|_2}\right)$

where $\|g\|_2 = \sqrt{\sum_{\text{all params}} \sum_i g_i^2}$ is the global norm.

Implementation (train.py:307-309):

Python

if grad_clip != 0.0:
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)

Default $\tau = 1.0$ (train.py:63).

Mixed Precision Training

BF16 vs FP16

Format	Exponent	Mantissa	Range	Precision
FP32	8 bits	23 bits	±3.4×10³⁸	High
BF16	8 bits	7 bits	±3.4×10³⁸	Low
FP16	5 bits	10 bits	±65,504	Medium

BF16 has the same exponent range as FP32, eliminating gradient underflow issues. FP16 requires dynamic loss scaling via GradScaler.

Implementation (train.py:111-112, train.py:196, train.py:299-312):

Python

ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype)
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# Training step
with ctx:
    logits, loss = model(X, Y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Distributed Training (DDP)

Setup

train.py:82-100:

Python

ddp = int(os.environ.get('RANK', -1)) != -1
if ddp:
    init_process_group(backend=backend)
    ddp_rank = int(os.environ['RANK'])
    ddp_local_rank = int(os.environ['LOCAL_RANK'])
    ddp_world_size = int(os.environ['WORLD_SIZE'])
    device = f'cuda:{ddp_local_rank}'
    torch.cuda.set_device(device)
    master_process = ddp_rank == 0
    seed_offset = ddp_rank  # Different data per rank
    gradient_accumulation_steps //= ddp_world_size

Gradient Synchronization Optimization

Only sync gradients on the final accumulation step (train.py:292-298):

Python

for micro_step in range(gradient_accumulation_steps):
    if ddp:
        model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
    with ctx:
        logits, loss = model(X, Y)
        loss = loss / gradient_accumulation_steps
    scaler.scale(loss).backward()

This reduces all-reduce communication by the accumulation factor.

Model Flops Utilization (MFU)

FLOPS Estimation

For transformers, FLOPS per forward-backward pass:

$\text{FLOPS} = (6N + 12LHQT) \cdot T \cdot B$

where:

$N$ = parameters (excluding embeddings)
$L$ = layers, $H$ = heads, $Q$ = head dimension, $T$ = sequence length
$B$ = batch size
Factor 6: 2× forward (multiply-add), 2× backward for activations, 2× backward for weights

Implementation (model.py:289-303):

Python

def estimate_mfu(self, fwdbwd_per_iter, dt):
    N = self.get_num_params()
    cfg = self.config
    L, H, Q, T = cfg.n_layer, cfg.n_head, cfg.n_embd//cfg.n_head, cfg.block_size
    flops_per_token = 6*N + 12*L*H*Q*T
    flops_per_fwdbwd = flops_per_token * T
    flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
    flops_achieved = flops_per_iter * (1.0/dt)
    flops_promised = 312e12  # A100 bfloat16 peak: 312 TFLOPS
    mfu = flops_achieved / flops_promised
    return mfu

Typical MFU values: 40-60% for well-tuned implementations.

Text Generation

Temperature Scaling

$P(x) = \frac{\exp(\text{logit}_x / \tau)}{\sum_{x'} \exp(\text{logit}_{x'} / \tau)}$

Temperature	Effect
$\tau \to 0$	Greedy (argmax)
$\tau = 0.7$	Focused but varied
$\tau = 1.0$	Standard softmax
$\tau > 1.0$	More random

Top-k Sampling

Restrict to $k$ most probable tokens:

$P'(x) = \begin{cases} \frac{P(x)}{\sum_{x' \in \text{top-}k} P(x')} & x \in \text{top-}k \\ 0 & \text{otherwise} \end{cases}$

Implementation (model.py:305-330):

Python

@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
    for _ in range(max_new_tokens):
        idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
        logits, _ = self(idx_cond)
        logits = logits[:, -1, :] / temperature
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = -float('Inf')
        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)
    return idx

Finetuning from Pretrained Models

nanoGPT supports finetuning pretrained GPT-2 models on custom datasets.

Finetuning Configuration

config/finetune_shakespeare.py:

Python

out_dir = 'out-shakespeare'
eval_interval = 5
eval_iters = 40
wandb_log = False

dataset = 'shakespeare'
init_from = 'gpt2-xl'  # Start from largest GPT-2

# Only save if validation improves
always_save_checkpoint = False

# Small batch size for finetuning
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20

# Constant LR, much smaller than pretraining
learning_rate = 3e-5
decay_lr = False

Key Finetuning Differences

Setting	Pretraining	Finetuning
`init_from`	`'scratch'`	`'gpt2-xl'`
`learning_rate`	6e-4	3e-5 (20× smaller)
`decay_lr`	True (cosine)	False (constant)
`max_iters`	600,000	20
`dropout`	0.0	0.1+

Loading Pretrained Weights

train.py:181-188:

Python

elif init_from.startswith('gpt2'):
    print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
    override_args = dict(dropout=dropout)
    model = GPT.from_pretrained(init_from, override_args)
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = getattr(model.config, k)

Block Size Cropping

If your dataset uses shorter sequences, crop the model's position embeddings:

train.py:189-192:

Python

if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size

model.py:262-270:

Python

def crop_block_size(self, block_size):
    assert block_size <= self.config.block_size
    self.config.block_size = block_size
    self.transformer.wpe.weight = nn.Parameter(self.transformer.wpe.weight[:block_size])
    for block in self.transformer.h:
        if hasattr(block.attn, 'bias'):
            block.attn.bias = block.attn.bias[:,:,:block_size,:block_size]

Configuration System

nanoGPT uses exec() to load configuration files, allowing any Python expression:

configurator.py:

Python

import sys
for arg in sys.argv[1:]:
    if '=' not in arg:
        # Assume it's a config file
        with open(arg) as f:
            exec(f.read())
    else:
        # Override individual setting
        exec(arg)

Usage:

Bash

# Load config file
python train.py config/train_gpt2.py

# Override settings
python train.py config/train_gpt2.py --batch_size=8 --learning_rate=1e-4

# Multiple config files (later overrides earlier)
python train.py config/train_gpt2.py config/my_overrides.py

Table of Contents

Repository Overview

The Language Modeling Objective

Cross-Entropy Loss

Perplexity

Bits Per Byte (BPB)

Model Configuration

Vocabulary Size Padding

GPT-2 Model Family

Parameter Count Formula

Token and Position Embeddings

Token Embeddings

Learned Positional Embeddings

Combined Input Representation

Why Position Encoding is Necessary

Layer Normalization

Mathematical Definition

Why LayerNorm, Not BatchNorm?

Custom Implementation

Multi-Head Self-Attention

Query, Key, Value Projections

Scaled Dot-Product Attention

Why Scale by dk\sqrt{d_k}dk​​?

Softmax Numerical Stability

Causal Masking

Multi-Head Attention

Flash Attention

Complete Attention Module

Feedforward Network

Mathematical Definition

GELU Activation

Comparison of Activations

Implementation

Transformer Block with Pre-Normalization

Pre-Norm vs Post-Norm

Complete Block Equations

Residual Connections and Gradient Flow

Implementation

Weight Initialization

Standard Initialization

Residual Projection Scaling

Weight Tying

Complete Forward Pass

Loading Pretrained GPT-2 Weights

The Complete Training Loop

Training Loop Structure

Gradient Accumulation

Evaluation During Training

Checkpointing

Tokens Per Iteration Calculation

Data Loading

Why Recreate Memmap Each Batch?

Learning Rate Schedule

Cosine Decay with Linear Warmup

AdamW Optimizer

Mathematical Formulation

AdamW vs Adam

Selective Weight Decay

Gradient Clipping

Global Norm Clipping

Mixed Precision Training

BF16 vs FP16

Distributed Training (DDP)

Setup

Gradient Synchronization Optimization

Model Flops Utilization (MFU)

FLOPS Estimation

Text Generation

Temperature Scaling

Top-k Sampling

Finetuning from Pretrained Models

Finetuning Configuration

Key Finetuning Differences

Loading Pretrained Weights

Block Size Cropping

Configuration System

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

nanochat: Andrej Karpathy's Full-Stack ChatGPT Clone

Why Scale by $\sqrt{d_k}$ ?