Skip to main content
Back to Blog

nanoGPT: Andrej Karpathy's Minimal GPT Training Framework

A comprehensive, equation-complete analysis of nanoGPT—Andrej Karpathy's influential minimal GPT implementation. Deep dive into the ~300-line model definition (model.py), training loop (train.py), Flash Attention, weight initialization, and the mathematical foundations behind every component.

3 min read
Share:

Repository Overview

nanoGPT is Andrej Karpathy's minimal GPT-2 implementation, demonstrating that training a language model requires only ~600 lines of Python. The repository structure:

FileLinesPurpose
model.py331Complete GPT architecture: LayerNorm, CausalSelfAttention, MLP, Block, GPT
train.py337Training loop: DDP, mixed precision, gradient accumulation, checkpointing
sample.py~50Text generation with temperature and top-k
configurator.py~20CLI argument parsing via exec()
config/*.pyvariesHyperparameter presets for different runs
data/*/prepare.pyvariesDataset preprocessing scripts

GitHub: github.com/karpathy/nanoGPT

The Language Modeling Objective

Cross-Entropy Loss

Given a sequence of tokens x1,x2,,xTx_1, x_2, \ldots, x_T, the model learns to predict each token from its prefix. The training objective minimizes the negative log-likelihood:

L=1Tt=1TlogPθ(xtx1,,xt1)\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | x_1, \ldots, x_{t-1})

where PθP_\theta is the model's predicted probability distribution over the vocabulary. This is equivalent to cross-entropy between the true one-hot distribution and the predicted distribution.

Implementation (model.py:187):

Python
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

The ignore_index=-1 allows masking certain positions (e.g., padding tokens).

Perplexity

Loss is often reported as perplexity, the exponential of cross-entropy:

PPL=eL=exp(1Tt=1TlogPθ(xtx<t))\text{PPL} = e^{\mathcal{L}} = \exp\left(-\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | x_{<t})\right)

Intuition: Perplexity represents the "effective vocabulary size" the model is choosing from. A perplexity of 20 means the model is, on average, as uncertain as uniformly choosing among 20 tokens.

ModelPerplexity (WikiText-103)Loss (nats)
Random (50K vocab)50,00010.8
n-gram LM~1505.0
GPT-2 124M~293.4
GPT-2 1.5B~182.9

Bits Per Byte (BPB)

For tokenizer-independent comparison, convert loss to bits per byte:

BPB=Lln2tokensbytes\text{BPB} = \frac{\mathcal{L}}{\ln 2} \cdot \frac{\text{tokens}}{\text{bytes}}

Typical compression ratios are 3.5-4.5 bytes per token for English text.

Model Configuration

The architecture is fully specified by a dataclass (model.py:108-116):

Python
@dataclass
class GPTConfig:
    block_size: int = 1024      # Maximum sequence length (context window)
    vocab_size: int = 50304     # Vocabulary size (50257 padded for efficiency)
    n_layer: int = 12           # Number of transformer layers
    n_head: int = 12            # Number of attention heads
    n_embd: int = 768           # Embedding/hidden dimension
    dropout: float = 0.0        # Dropout probability (0 for pretraining)
    bias: bool = True           # Use bias in Linear/LayerNorm layers

Vocabulary Size Padding

The padding from 50257 to 50304 (nearest multiple of 64) is a GPU optimization. CUDA matrix multiplication is most efficient when dimensions align with warp sizes and memory access patterns. The extra 47 unused tokens add negligible parameters but provide measurable speedup (typically 3-5%).

GPT-2 Model Family

ModelLayersHeadsDimParametersHead Dim
GPT-2 Small1212768124M64
GPT-2 Medium24161024350M64
GPT-2 Large36201280774M64
GPT-2 XL482516001558M64

Implementation (model.py:216-220):

Python
config_args = {
    'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
    'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
    'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
    'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
}

Parameter Count Formula

The total parameter count can be estimated as:

N12Ld2+VdN \approx 12 L d^2 + V d

where LL = layers, dd = model dimension, VV = vocabulary size.

Detailed breakdown for GPT-2 Small (768d, 12L):

ComponentFormulaCount
Token embeddingsV×dV \times d50,304 × 768 = 38.6M
Position embeddingsT×dT \times d1,024 × 768 = 0.8M
Attention Q,K,V3×d2×L3 \times d^2 \times L3 × 768² × 12 = 21.2M
Attention outputd2×Ld^2 \times L768² × 12 = 7.1M
MLP fcd×4d×Ld \times 4d \times L768 × 3072 × 12 = 28.3M
MLP proj4d×d×L4d \times d \times L3072 × 768 × 12 = 28.3M
LayerNorm (2 per block)2×2d×L2 \times 2d \times L4 × 768 × 12 = 37K
Final LayerNorm2d2d2 × 768 = 1.5K
Total~124M

Note: lm_head shares weights with token embeddings (weight tying), so it's not counted separately.

Implementation (model.py:150-160):

Python
def get_num_params(self, non_embedding=True):
    n_params = sum(p.numel() for p in self.parameters())
    if non_embedding:
        n_params -= self.transformer.wpe.weight.numel()
    return n_params

Position embeddings are subtracted because they don't contribute to FLOPS (just lookup), but token embeddings are kept because they're tied with lm_head.

Token and Position Embeddings

Token Embeddings

Each input token xt{0,1,,V1}x_t \in \{0, 1, \ldots, V-1\} is mapped to a dense vector:

et=We[xt]Rde_t = W_e[x_t] \in \mathbb{R}^d

where WeRV×dW_e \in \mathbb{R}^{V \times d} is the embedding matrix. This is a simple lookup (indexing), not matrix multiplication.

Learned Positional Embeddings

GPT-2 uses learned absolute position embeddings:

pt=Wp[t]Rdp_t = W_p[t] \in \mathbb{R}^d

where WpRTmax×dW_p \in \mathbb{R}^{T_{\max} \times d} contains a learned vector for each position up to maximum sequence length Tmax=1024T_{\max} = 1024.

Combined Input Representation

The input to the transformer is the sum of token and position embeddings:

ht(0)=et+pt=We[xt]+Wp[t]h_t^{(0)} = e_t + p_t = W_e[x_t] + W_p[t]

Implementation (model.py:126-132, model.py:174-179):

Python
self.transformer = nn.ModuleDict(dict(
    wte = nn.Embedding(config.vocab_size, config.n_embd),
    wpe = nn.Embedding(config.block_size, config.n_embd),
    drop = nn.Dropout(config.dropout),
    ...
))
# Forward pass:
tok_emb = self.transformer.wte(idx)   # (b, t, n_embd)
pos_emb = self.transformer.wpe(pos)   # (t, n_embd), broadcasts over batch
x = self.transformer.drop(tok_emb + pos_emb)

Why Position Encoding is Necessary

Attention is permutation-invariant without position encoding. Given input sequence [x1,x2,x3][x_1, x_2, x_3], the attention mechanism treats it identically to [x3,x1,x2][x_3, x_1, x_2]. Position embeddings break this symmetry, allowing the model to distinguish "the cat sat" from "sat cat the".

Mathematical proof: Let π\pi be a permutation. Without position encoding:

Attention(π(Q),π(K),π(V))=π(Attention(Q,K,V))\text{Attention}(\pi(Q), \pi(K), \pi(V)) = \pi(\text{Attention}(Q, K, V))

The output is just a permutation of the input—the model cannot learn word order!

Layer Normalization

Mathematical Definition

Layer normalization computes statistics over the feature dimension for each position independently:

μ=1di=1dxi\mu = \frac{1}{d} \sum_{i=1}^{d} x_i

σ2=1di=1d(xiμ)2\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2

LayerNorm(x)=γxμσ2+ϵ+β\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where:

  • γRd\gamma \in \mathbb{R}^d is the learned scale parameter
  • βRd\beta \in \mathbb{R}^d is the learned shift parameter
  • ϵ=105\epsilon = 10^{-5} prevents division by zero
  • \odot denotes element-wise multiplication

Why LayerNorm, Not BatchNorm?

PropertyBatchNormLayerNorm
Statistics computed overBatch dimensionFeature dimension
Batch size 1 at inferenceBreaksWorks
Variable sequence lengthsProblematicNo issue
Running statistics neededYesNo

LayerNorm is essential for autoregressive generation where batch size is typically 1.

Custom Implementation

PyTorch's built-in LayerNorm doesn't support bias=False independently from scale. nanoGPT implements a custom version (model.py:18-27):

Python
class LayerNorm(nn.Module):
    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

Setting bias=False throughout provides a small efficiency gain—empirically, models perform equally well without biases.

Multi-Head Self-Attention

Query, Key, Value Projections

For input hidden states hRT×dh \in \mathbb{R}^{T \times d}, we project to queries, keys, and values:

Q=hWQ,K=hWK,V=hWVQ = hW_Q, \quad K = hW_K, \quad V = hW_V

where WQ,WK,WVRd×dW_Q, W_K, W_V \in \mathbb{R}^{d \times d}.

Intuition:

  • Query (QQ): "What am I looking for?"
  • Key (KK): "What do I contain?"
  • Value (VV): "What information do I provide if selected?"

nanoGPT computes all three in a single fused projection (model.py:35, model.py:56):

Python
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
# Forward:
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)

Scaled Dot-Product Attention

The attention weights are computed as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where dk=d/nheadsd_k = d / n_{\text{heads}} is the dimension per head.

Why Scale by dk\sqrt{d_k}?

Consider query and key vectors with independent components, each with zero mean and unit variance. The dot product:

qk=i=1dkqikiq \cdot k = \sum_{i=1}^{d_k} q_i k_i

has variance dkd_k (sum of dkd_k independent products with variance 1 each). The standard deviation is dk\sqrt{d_k}.

Without scaling, as dkd_k grows, dot products become large, pushing softmax into saturation:

softmax([100,0,0])[1.0,0.0,0.0]\text{softmax}([100, 0, 0]) \approx [1.0, 0.0, 0.0]

In saturation, gradients vanish (all outputs ~0 except one ~1). Scaling by 1dk\frac{1}{\sqrt{d_k}} normalizes variance to 1, keeping softmax in a well-behaved regime.

Softmax Numerical Stability

The naive softmax can overflow for large logits:

softmax(x)i=exijexj\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}

The stable version subtracts the maximum:

softmax(x)i=eximax(x)jexjmax(x)\text{softmax}(x)_i = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}

This is mathematically equivalent (numerator and denominator scale by emax(x)e^{-\max(x)}), but all exponents are now 0\leq 0, preventing overflow. PyTorch's F.softmax implements this automatically.

Causal Masking

For autoregressive language modeling, position tt can only attend to positions t\leq t. This is enforced by adding a mask before softmax:

Mij={0if ijif i<jM_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}

Attention=softmax(QKTdk+M)V\text{Attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right) V

Since e=0e^{-\infty} = 0, masked positions contribute zero weight.

Implementation (model.py:48-50, model.py:67-68):

Python
# Register causal mask buffer
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                            .view(1, 1, config.block_size, config.block_size))
# Apply mask
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))

Multi-Head Attention

Rather than a single attention function, transformers use multiple "heads" operating on different subspaces:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O

where each head operates on a dk=d/hd_k = d/h dimensional slice:

headi=Attention(QWQi,KWKi,VWVi)\text{head}_i = \text{Attention}(QW_Q^i, KW_K^i, VW_V^i)

With 12 heads and 768-dimensional embeddings, each head works with 64 dimensions. Different heads can specialize:

  • Syntactic relationships (subject-verb agreement)
  • Semantic dependencies (coreference)
  • Positional patterns (nearby tokens)

Implementation (model.py:57-59):

Python
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

Flash Attention

Standard attention materializes the full T×TT \times T attention matrix, requiring O(T2)O(T^2) memory. Flash Attention (PyTorch 2.0+) computes attention in tiles that fit in GPU SRAM, reducing memory to O(T)O(T) and providing 2-4× speedup.

Implementation (model.py:45, model.py:62-64):

Python
self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
# Forward:
if self.flash:
    y = torch.nn.functional.scaled_dot_product_attention(
        q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)

Memory comparison (12 heads, seq_len=1024):

  • Standard: 12×10242×412 \times 1024^2 \times 4 bytes = 50MB per layer
  • Flash: O(T)O(T) ≈ 50KB per layer (1000× reduction)

Complete Attention Module

The full implementation (model.py:29-76):

Python
class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        if self.flash:
            y = torch.nn.functional.scaled_dot_product_attention(
                q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
        else:
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v

        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.resid_dropout(self.c_proj(y))
        return y

Feedforward Network

Each transformer block contains a position-wise feedforward network (MLP) that processes each position independently.

Mathematical Definition

FFN(x)=W2GELU(W1x+b1)+b2\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2

where:

  • W1Rd×4dW_1 \in \mathbb{R}^{d \times 4d} expands to 4× dimension
  • W2R4d×dW_2 \in \mathbb{R}^{4d \times d} projects back

GELU Activation

The Gaussian Error Linear Unit:

GELU(x)=xΦ(x)=x12[1+erf(x2)]\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

where Φ(x)\Phi(x) is the CDF of the standard normal distribution.

Intuition: GELU is a smooth, probabilistic ReLU. For input xx, it's multiplied by the probability that a standard normal is less than xx. Large positive values pass through; large negative values are zeroed; the transition is smooth.

Fast approximation:

GELU(x)0.5x(1+tanh[2π(x+0.044715x3)])\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right]\right)

Comparison of Activations

ActivationFormulaSmooth at 0SparseDead Neurons
ReLUmax(0,x)\max(0, x)NoHighYes
GELUxΦ(x)x \cdot \Phi(x)YesLowNo
SiLU/Swishxσ(x)x \cdot \sigma(x)YesLowNo
ReLU²(max(0,x))2(\max(0,x))^2YesHighYes

Implementation

model.py:78-92:

Python
class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

Transformer Block with Pre-Normalization

Pre-Norm vs Post-Norm

The original transformer (2017) used post-normalization:

h=LayerNorm(h+Attention(h))h' = \text{LayerNorm}(h + \text{Attention}(h))

GPT-2 and nanoGPT use pre-normalization:

h=h+Attention(LayerNorm(h))h' = h + \text{Attention}(\text{LayerNorm}(h))

Pre-norm provides better gradient flow, especially in deep networks. Normalizing the input to each sublayer ensures consistent activation scales regardless of depth.

Complete Block Equations

h=h+Attention(LayerNorm(h))h' = h + \text{Attention}(\text{LayerNorm}(h)) h=h+FFN(LayerNorm(h))h'' = h' + \text{FFN}(\text{LayerNorm}(h'))

Residual Connections and Gradient Flow

The residual connection allows gradients to flow directly through the network:

Lh=Lh(1+f(h)h)\frac{\partial \mathcal{L}}{\partial h} = \frac{\partial \mathcal{L}}{\partial h''} \cdot \left(1 + \frac{\partial f(h)}{\partial h}\right)

The "1" term provides a gradient highway regardless of how the sublayer ff behaves. Without residuals, gradients must flow through every transformation, often vanishing in deep networks.

Implementation

model.py:94-106:

Python
class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

Weight Initialization

Standard Initialization

Proper initialization prevents exploding or vanishing signals. nanoGPT uses normal initialization with σ=0.02\sigma = 0.02 (model.py:162-168):

Python
def _init_weights(self, module):
    if isinstance(module, nn.Linear):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

Residual Projection Scaling

The output projections (c_proj) receive special scaled initialization (model.py:143-145):

Python
for pn, p in self.named_parameters():
    if pn.endswith('c_proj.weight'):
        torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

Why 12N\frac{1}{\sqrt{2N}}?

Each block has two residual additions (attention and MLP). After NN blocks:

Var(h(N))=Var(h(0))+i=12NVar(residuali)\text{Var}(h^{(N)}) = \text{Var}(h^{(0)}) + \sum_{i=1}^{2N} \text{Var}(\text{residual}_i)

Without scaling, if each residual has variance 1, total variance grows as 2N2N. The scaling ensures:

Var(residual)12N\text{Var}(\text{residual}) \approx \frac{1}{2N}

maintaining approximately unit variance regardless of depth.

Weight Tying

nanoGPT ties the input embedding matrix to the output projection (model.py:138):

Python
self.transformer.wte.weight = self.lm_head.weight  # https://paperswithcode.com/method/weight-tying

The logit for token ii at position tt is:

logiti=ht(N)We[i]T\text{logit}_i = h_t^{(N)} \cdot W_e[i]^T

Benefits:

  1. Parameter efficiency: Saves V×dV \times d parameters (~38.6M for GPT-2)
  2. Implicit regularization: Forces embeddings useful for both input and output
  3. Semantic consistency: Related tokens have similar input and output representations

Complete Forward Pass

model.py:170-193:

Python
def forward(self, idx, targets=None):
    device = idx.device
    b, t = idx.size()
    assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
    pos = torch.arange(0, t, dtype=torch.long, device=device)

    # Forward the GPT model
    tok_emb = self.transformer.wte(idx)   # (b, t, n_embd)
    pos_emb = self.transformer.wpe(pos)   # (t, n_embd)
    x = self.transformer.drop(tok_emb + pos_emb)
    for block in self.transformer.h:
        x = block(x)
    x = self.transformer.ln_f(x)

    if targets is not None:
        logits = self.lm_head(x)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
    else:
        # Inference optimization: only compute logits for last position
        logits = self.lm_head(x[:, [-1], :])
        loss = None

    return logits, loss

The inference optimization x[:, [-1], :] saves O(T×V)O(T \times V) computation per generation step.

Loading Pretrained GPT-2 Weights

nanoGPT can load OpenAI's pretrained GPT-2 weights via HuggingFace (model.py:206-261):

Python
@classmethod
def from_pretrained(cls, model_type, override_args=None):
    assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
    from transformers import GPT2LMHeadModel

    # Create nanoGPT model with matching config
    config = GPTConfig(**config_args)
    model = GPT(config)
    sd = model.state_dict()

    # Load HuggingFace model
    model_hf = GPT2LMHeadModel.from_pretrained(model_type)
    sd_hf = model_hf.state_dict()

    # Copy weights, transposing Conv1D weights
    transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
    for k in sd_keys_hf:
        if any(k.endswith(w) for w in transposed):
            sd[k].copy_(sd_hf[k].t())  # Transpose Conv1D -> Linear
        else:
            sd[k].copy_(sd_hf[k])

    return model

The transposition is needed because OpenAI's implementation uses Conv1D (weight shape [out, in]) while nanoGPT uses Linear (weight shape [in, out]).

The Complete Training Loop

The training loop (train.py:249-336) implements a production-ready optimization procedure with gradient accumulation, checkpointing, and distributed training support.

Training Loop Structure

train.py:249-336:

Python
# training loop
X, Y = get_batch('train')  # fetch the very first batch
t0 = time.time()
local_iter_num = 0
raw_model = model.module if ddp else model  # unwrap DDP container if needed
running_mfu = -1.0

while True:
    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and master_process:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))

    # forward backward update, with optional gradient accumulation
    for micro_step in range(gradient_accumulation_steps):
        if ddp:
            model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps  # scale the loss
        X, Y = get_batch('train')  # async prefetch next batch
        scaler.scale(loss).backward()

    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)

    # step the optimizer and scaler
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad(set_to_none=True)

    iter_num += 1
    if iter_num > max_iters:
        break

Gradient Accumulation

Gradient accumulation simulates larger batch sizes when GPU memory is limited. The key equation:

Effective batch size=micro_batch×grad_accum_steps×world_size\text{Effective batch size} = \text{micro\_batch} \times \text{grad\_accum\_steps} \times \text{world\_size}

Example (config/train_gpt2.py:9-13):

Python
# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 12
block_size = 1024
gradient_accumulation_steps = 5 * 8

The loss is scaled by 1/grad_accum_steps1/\text{grad\_accum\_steps} before each backward pass:

Lscaled=LG\mathcal{L}_{\text{scaled}} = \frac{\mathcal{L}}{G}

This ensures gradients sum correctly across accumulation steps.

Evaluation During Training

train.py:215-228:

Python
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Default hyperparameters (train.py:36-41):

  • eval_interval = 2000 steps between evaluations
  • eval_iters = 200 batches per evaluation
  • eval_only = False to continue training after first eval

Checkpointing

The checkpoint contains everything needed to resume training (train.py:277-286):

Python
checkpoint = {
    'model': raw_model.state_dict(),        # Model weights
    'optimizer': optimizer.state_dict(),    # Optimizer state (momentum, variance)
    'model_args': model_args,               # Architecture config
    'iter_num': iter_num,                   # Training progress
    'best_val_loss': best_val_loss,         # Best validation loss
    'config': config,                       # All hyperparameters
}
torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))

Resuming from checkpoint (train.py:158-180):

Python
if init_from == 'resume':
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    # Force architecture params to match
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = checkpoint_model_args[k]
    # Load model
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    model.load_state_dict(checkpoint['model'])
    iter_num = checkpoint['iter_num']
    best_val_loss = checkpoint['best_val_loss']

Tokens Per Iteration Calculation

train.py:101-102:

Python
tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

For the default GPT-2 training config: tokens/iter=40×8×12×1024=3,932,160\text{tokens/iter} = 40 \times 8 \times 12 \times 1024 = 3,932,160

With 600K iterations, total tokens = 2.4×1012\approx 2.4 \times 10^{12} (2.4 trillion).

Data Loading

train.py:114-131:

Python
data_dir = os.path.join('data', dataset)
def get_batch(split):
    # Recreate np.memmap every batch to avoid memory leak
    if split == 'train':
        data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
    else:
        data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

Key optimizations:

  • Memory mapping: Only loads accessed pages, enabling datasets larger than RAM
  • Pinned memory: Page-locked host memory for faster CPU→GPU transfer
  • Non-blocking transfer: Overlaps data transfer with computation
  • uint16 storage: Token IDs < 65536 fit in 2 bytes, halving storage

Why Recreate Memmap Each Batch?

The comment references a memory leak in numpy's memmap. Each memmap object holds references that can accumulate. Recreating ensures clean state.

Learning Rate Schedule

Cosine Decay with Linear Warmup

η(t)={ηmaxt+1Twarmup+1if t<Twarmupηmin+12(ηmaxηmin)(1+cos(π(tTwarmup)TdecayTwarmup))otherwise\eta(t) = \begin{cases} \eta_{\max} \cdot \frac{t + 1}{T_{\text{warmup}} + 1} & \text{if } t < T_{\text{warmup}} \\ \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi(t - T_{\text{warmup}})}{T_{\text{decay}} - T_{\text{warmup}}}\right)\right) & \text{otherwise} \end{cases}

Implementation (train.py:231-242):

Python
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * (it + 1) / (warmup_iters + 1)
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))  # ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

Default hyperparameters (train.py:58-68):

  • ηmax=6×104\eta_{\max} = 6 \times 10^{-4}
  • ηmin=6×105\eta_{\min} = 6 \times 10^{-5} (ratio of 10)
  • Twarmup=2000T_{\text{warmup}} = 2000 steps
  • Tdecay=600000T_{\text{decay}} = 600000 steps

AdamW Optimizer

Mathematical Formulation

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} θt=θt1η(m^tv^t+ϵ+λθt1)\theta_t = \theta_{t-1} - \eta \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1}\right)

Components:

  • mtm_t: Exponential moving average of gradients (momentum)
  • vtv_t: Exponential moving average of squared gradients (adaptive LR)
  • m^t,v^t\hat{m}_t, \hat{v}_t: Bias correction for zero initialization
  • λθt1\lambda \theta_{t-1}: Decoupled weight decay (the "W" in AdamW)

AdamW vs Adam

Original Adam applies weight decay through the gradient (L2 regularization):

gt=gt+λθt1g_t' = g_t + \lambda \theta_{t-1}

This couples weight decay with the adaptive learning rate. AdamW decouples them, applying weight decay directly to parameters. This improves generalization.

Selective Weight Decay

model.py:263-287:

Python
def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
    param_dict = {pn: p for pn, p in self.named_parameters()}
    param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
    # 2D params (weight matrices) get decay, 1D params (biases, norms) don't
    decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
    nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
    optim_groups = [
        {'params': decay_params, 'weight_decay': weight_decay},
        {'params': nodecay_params, 'weight_decay': 0.0}
    ]
    # Use fused AdamW if available
    fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
    use_fused = fused_available and device_type == 'cuda'
    optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, fused=use_fused)
    return optimizer

Default hyperparameters (train.py:57-62):

  • β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95
  • ϵ=108\epsilon = 10^{-8}
  • λ=0.1\lambda = 0.1

Gradient Clipping

Global Norm Clipping

ggmin(1,τg2)g \leftarrow g \cdot \min\left(1, \frac{\tau}{\|g\|_2}\right)

where g2=all paramsigi2\|g\|_2 = \sqrt{\sum_{\text{all params}} \sum_i g_i^2} is the global norm.

Implementation (train.py:307-309):

Python
if grad_clip != 0.0:
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)

Default τ=1.0\tau = 1.0 (train.py:63).

Mixed Precision Training

BF16 vs FP16

FormatExponentMantissaRangePrecision
FP328 bits23 bits±3.4×10³⁸High
BF168 bits7 bits±3.4×10³⁸Low
FP165 bits10 bits±65,504Medium

BF16 has the same exponent range as FP32, eliminating gradient underflow issues. FP16 requires dynamic loss scaling via GradScaler.

Implementation (train.py:111-112, train.py:196, train.py:299-312):

Python
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype)
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# Training step
with ctx:
    logits, loss = model(X, Y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Distributed Training (DDP)

Setup

train.py:82-100:

Python
ddp = int(os.environ.get('RANK', -1)) != -1
if ddp:
    init_process_group(backend=backend)
    ddp_rank = int(os.environ['RANK'])
    ddp_local_rank = int(os.environ['LOCAL_RANK'])
    ddp_world_size = int(os.environ['WORLD_SIZE'])
    device = f'cuda:{ddp_local_rank}'
    torch.cuda.set_device(device)
    master_process = ddp_rank == 0
    seed_offset = ddp_rank  # Different data per rank
    gradient_accumulation_steps //= ddp_world_size

Gradient Synchronization Optimization

Only sync gradients on the final accumulation step (train.py:292-298):

Python
for micro_step in range(gradient_accumulation_steps):
    if ddp:
        model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
    with ctx:
        logits, loss = model(X, Y)
        loss = loss / gradient_accumulation_steps
    scaler.scale(loss).backward()

This reduces all-reduce communication by the accumulation factor.

Model Flops Utilization (MFU)

FLOPS Estimation

For transformers, FLOPS per forward-backward pass:

FLOPS=(6N+12LHQT)TB\text{FLOPS} = (6N + 12LHQT) \cdot T \cdot B

where:

  • NN = parameters (excluding embeddings)
  • LL = layers, HH = heads, QQ = head dimension, TT = sequence length
  • BB = batch size
  • Factor 6: 2× forward (multiply-add), 2× backward for activations, 2× backward for weights

Implementation (model.py:289-303):

Python
def estimate_mfu(self, fwdbwd_per_iter, dt):
    N = self.get_num_params()
    cfg = self.config
    L, H, Q, T = cfg.n_layer, cfg.n_head, cfg.n_embd//cfg.n_head, cfg.block_size
    flops_per_token = 6*N + 12*L*H*Q*T
    flops_per_fwdbwd = flops_per_token * T
    flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
    flops_achieved = flops_per_iter * (1.0/dt)
    flops_promised = 312e12  # A100 bfloat16 peak: 312 TFLOPS
    mfu = flops_achieved / flops_promised
    return mfu

Typical MFU values: 40-60% for well-tuned implementations.

Text Generation

Temperature Scaling

P(x)=exp(logitx/τ)xexp(logitx/τ)P(x) = \frac{\exp(\text{logit}_x / \tau)}{\sum_{x'} \exp(\text{logit}_{x'} / \tau)}

TemperatureEffect
τ0\tau \to 0Greedy (argmax)
τ=0.7\tau = 0.7Focused but varied
τ=1.0\tau = 1.0Standard softmax
τ>1.0\tau > 1.0More random

Top-k Sampling

Restrict to kk most probable tokens:

P(x)={P(x)xtop-kP(x)xtop-k0otherwiseP'(x) = \begin{cases} \frac{P(x)}{\sum_{x' \in \text{top-}k} P(x')} & x \in \text{top-}k \\ 0 & \text{otherwise} \end{cases}

Implementation (model.py:305-330):

Python
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
    for _ in range(max_new_tokens):
        idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
        logits, _ = self(idx_cond)
        logits = logits[:, -1, :] / temperature
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = -float('Inf')
        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)
    return idx

Finetuning from Pretrained Models

nanoGPT supports finetuning pretrained GPT-2 models on custom datasets.

Finetuning Configuration

config/finetune_shakespeare.py:

Python
out_dir = 'out-shakespeare'
eval_interval = 5
eval_iters = 40
wandb_log = False

dataset = 'shakespeare'
init_from = 'gpt2-xl'  # Start from largest GPT-2

# Only save if validation improves
always_save_checkpoint = False

# Small batch size for finetuning
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20

# Constant LR, much smaller than pretraining
learning_rate = 3e-5
decay_lr = False

Key Finetuning Differences

SettingPretrainingFinetuning
init_from'scratch''gpt2-xl'
learning_rate6e-43e-5 (20× smaller)
decay_lrTrue (cosine)False (constant)
max_iters600,00020
dropout0.00.1+

Loading Pretrained Weights

train.py:181-188:

Python
elif init_from.startswith('gpt2'):
    print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
    override_args = dict(dropout=dropout)
    model = GPT.from_pretrained(init_from, override_args)
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = getattr(model.config, k)

Block Size Cropping

If your dataset uses shorter sequences, crop the model's position embeddings:

train.py:189-192:

Python
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size

model.py:262-270:

Python
def crop_block_size(self, block_size):
    assert block_size <= self.config.block_size
    self.config.block_size = block_size
    self.transformer.wpe.weight = nn.Parameter(self.transformer.wpe.weight[:block_size])
    for block in self.transformer.h:
        if hasattr(block.attn, 'bias'):
            block.attn.bias = block.attn.bias[:,:,:block_size,:block_size]

Configuration System

nanoGPT uses exec() to load configuration files, allowing any Python expression:

configurator.py:

Python
import sys
for arg in sys.argv[1:]:
    if '=' not in arg:
        # Assume it's a config file
        with open(arg) as f:
            exec(f.read())
    else:
        # Override individual setting
        exec(arg)

Usage:

Bash
# Load config file
python train.py config/train_gpt2.py

# Override settings
python train.py config/train_gpt2.py --batch_size=8 --learning_rate=1e-4

# Multiple config files (later overrides earlier)
python train.py config/train_gpt2.py config/my_overrides.py

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles