nanoGPT: Andrej Karpathy's Minimal GPT Training Framework
A comprehensive, equation-complete analysis of nanoGPT—Andrej Karpathy's influential minimal GPT implementation. Deep dive into the ~300-line model definition (model.py), training loop (train.py), Flash Attention, weight initialization, and the mathematical foundations behind every component.
Table of Contents
Repository Overview
nanoGPT is Andrej Karpathy's minimal GPT-2 implementation, demonstrating that training a language model requires only ~600 lines of Python. The repository structure:
| File | Lines | Purpose |
|---|---|---|
model.py | 331 | Complete GPT architecture: LayerNorm, CausalSelfAttention, MLP, Block, GPT |
train.py | 337 | Training loop: DDP, mixed precision, gradient accumulation, checkpointing |
sample.py | ~50 | Text generation with temperature and top-k |
configurator.py | ~20 | CLI argument parsing via exec() |
config/*.py | varies | Hyperparameter presets for different runs |
data/*/prepare.py | varies | Dataset preprocessing scripts |
GitHub: github.com/karpathy/nanoGPT
The Language Modeling Objective
Cross-Entropy Loss
Given a sequence of tokens , the model learns to predict each token from its prefix. The training objective minimizes the negative log-likelihood:
where is the model's predicted probability distribution over the vocabulary. This is equivalent to cross-entropy between the true one-hot distribution and the predicted distribution.
Implementation (model.py:187):
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
The ignore_index=-1 allows masking certain positions (e.g., padding tokens).
Perplexity
Loss is often reported as perplexity, the exponential of cross-entropy:
Intuition: Perplexity represents the "effective vocabulary size" the model is choosing from. A perplexity of 20 means the model is, on average, as uncertain as uniformly choosing among 20 tokens.
| Model | Perplexity (WikiText-103) | Loss (nats) |
|---|---|---|
| Random (50K vocab) | 50,000 | 10.8 |
| n-gram LM | ~150 | 5.0 |
| GPT-2 124M | ~29 | 3.4 |
| GPT-2 1.5B | ~18 | 2.9 |
Bits Per Byte (BPB)
For tokenizer-independent comparison, convert loss to bits per byte:
Typical compression ratios are 3.5-4.5 bytes per token for English text.
Model Configuration
The architecture is fully specified by a dataclass (model.py:108-116):
@dataclass
class GPTConfig:
block_size: int = 1024 # Maximum sequence length (context window)
vocab_size: int = 50304 # Vocabulary size (50257 padded for efficiency)
n_layer: int = 12 # Number of transformer layers
n_head: int = 12 # Number of attention heads
n_embd: int = 768 # Embedding/hidden dimension
dropout: float = 0.0 # Dropout probability (0 for pretraining)
bias: bool = True # Use bias in Linear/LayerNorm layers
Vocabulary Size Padding
The padding from 50257 to 50304 (nearest multiple of 64) is a GPU optimization. CUDA matrix multiplication is most efficient when dimensions align with warp sizes and memory access patterns. The extra 47 unused tokens add negligible parameters but provide measurable speedup (typically 3-5%).
GPT-2 Model Family
| Model | Layers | Heads | Dim | Parameters | Head Dim |
|---|---|---|---|---|---|
| GPT-2 Small | 12 | 12 | 768 | 124M | 64 |
| GPT-2 Medium | 24 | 16 | 1024 | 350M | 64 |
| GPT-2 Large | 36 | 20 | 1280 | 774M | 64 |
| GPT-2 XL | 48 | 25 | 1600 | 1558M | 64 |
Implementation (model.py:216-220):
config_args = {
'gpt2': dict(n_layer=12, n_head=12, n_embd=768), # 124M params
'gpt2-medium': dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
'gpt2-large': dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
'gpt2-xl': dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
}
Parameter Count Formula
The total parameter count can be estimated as:
where = layers, = model dimension, = vocabulary size.
Detailed breakdown for GPT-2 Small (768d, 12L):
| Component | Formula | Count |
|---|---|---|
| Token embeddings | 50,304 × 768 = 38.6M | |
| Position embeddings | 1,024 × 768 = 0.8M | |
| Attention Q,K,V | 3 × 768² × 12 = 21.2M | |
| Attention output | 768² × 12 = 7.1M | |
| MLP fc | 768 × 3072 × 12 = 28.3M | |
| MLP proj | 3072 × 768 × 12 = 28.3M | |
| LayerNorm (2 per block) | 4 × 768 × 12 = 37K | |
| Final LayerNorm | 2 × 768 = 1.5K | |
| Total | ~124M |
Note: lm_head shares weights with token embeddings (weight tying), so it's not counted separately.
Implementation (model.py:150-160):
def get_num_params(self, non_embedding=True):
n_params = sum(p.numel() for p in self.parameters())
if non_embedding:
n_params -= self.transformer.wpe.weight.numel()
return n_params
Position embeddings are subtracted because they don't contribute to FLOPS (just lookup), but token embeddings are kept because they're tied with lm_head.
Token and Position Embeddings
Token Embeddings
Each input token is mapped to a dense vector:
where is the embedding matrix. This is a simple lookup (indexing), not matrix multiplication.
Learned Positional Embeddings
GPT-2 uses learned absolute position embeddings:
where contains a learned vector for each position up to maximum sequence length .
Combined Input Representation
The input to the transformer is the sum of token and position embeddings:
Implementation (model.py:126-132, model.py:174-179):
self.transformer = nn.ModuleDict(dict(
wte = nn.Embedding(config.vocab_size, config.n_embd),
wpe = nn.Embedding(config.block_size, config.n_embd),
drop = nn.Dropout(config.dropout),
...
))
# Forward pass:
tok_emb = self.transformer.wte(idx) # (b, t, n_embd)
pos_emb = self.transformer.wpe(pos) # (t, n_embd), broadcasts over batch
x = self.transformer.drop(tok_emb + pos_emb)
Why Position Encoding is Necessary
Attention is permutation-invariant without position encoding. Given input sequence , the attention mechanism treats it identically to . Position embeddings break this symmetry, allowing the model to distinguish "the cat sat" from "sat cat the".
Mathematical proof: Let be a permutation. Without position encoding:
The output is just a permutation of the input—the model cannot learn word order!
Layer Normalization
Mathematical Definition
Layer normalization computes statistics over the feature dimension for each position independently:
where:
- is the learned scale parameter
- is the learned shift parameter
- prevents division by zero
- denotes element-wise multiplication
Why LayerNorm, Not BatchNorm?
| Property | BatchNorm | LayerNorm |
|---|---|---|
| Statistics computed over | Batch dimension | Feature dimension |
| Batch size 1 at inference | Breaks | Works |
| Variable sequence lengths | Problematic | No issue |
| Running statistics needed | Yes | No |
LayerNorm is essential for autoregressive generation where batch size is typically 1.
Custom Implementation
PyTorch's built-in LayerNorm doesn't support bias=False independently from scale. nanoGPT implements a custom version (model.py:18-27):
class LayerNorm(nn.Module):
""" LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """
def __init__(self, ndim, bias):
super().__init__()
self.weight = nn.Parameter(torch.ones(ndim))
self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
def forward(self, input):
return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)
Setting bias=False throughout provides a small efficiency gain—empirically, models perform equally well without biases.
Multi-Head Self-Attention
Query, Key, Value Projections
For input hidden states , we project to queries, keys, and values:
where .
Intuition:
- Query (): "What am I looking for?"
- Key (): "What do I contain?"
- Value (): "What information do I provide if selected?"
nanoGPT computes all three in a single fused projection (model.py:35, model.py:56):
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
# Forward:
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
Scaled Dot-Product Attention
The attention weights are computed as:
where is the dimension per head.
Why Scale by ?
Consider query and key vectors with independent components, each with zero mean and unit variance. The dot product:
has variance (sum of independent products with variance 1 each). The standard deviation is .
Without scaling, as grows, dot products become large, pushing softmax into saturation:
In saturation, gradients vanish (all outputs ~0 except one ~1). Scaling by normalizes variance to 1, keeping softmax in a well-behaved regime.
Softmax Numerical Stability
The naive softmax can overflow for large logits:
The stable version subtracts the maximum:
This is mathematically equivalent (numerator and denominator scale by ), but all exponents are now , preventing overflow. PyTorch's F.softmax implements this automatically.
Causal Masking
For autoregressive language modeling, position can only attend to positions . This is enforced by adding a mask before softmax:
Since , masked positions contribute zero weight.
Implementation (model.py:48-50, model.py:67-68):
# Register causal mask buffer
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))
# Apply mask
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
Multi-Head Attention
Rather than a single attention function, transformers use multiple "heads" operating on different subspaces:
where each head operates on a dimensional slice:
With 12 heads and 768-dimensional embeddings, each head works with 64 dimensions. Different heads can specialize:
- Syntactic relationships (subject-verb agreement)
- Semantic dependencies (coreference)
- Positional patterns (nearby tokens)
Implementation (model.py:57-59):
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
Flash Attention
Standard attention materializes the full attention matrix, requiring memory. Flash Attention (PyTorch 2.0+) computes attention in tiles that fit in GPU SRAM, reducing memory to and providing 2-4× speedup.
Implementation (model.py:45, model.py:62-64):
self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
# Forward:
if self.flash:
y = torch.nn.functional.scaled_dot_product_attention(
q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
Memory comparison (12 heads, seq_len=1024):
- Standard: bytes = 50MB per layer
- Flash: ≈ 50KB per layer (1000× reduction)
Complete Attention Module
The full implementation (model.py:29-76):
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.n_head = config.n_head
self.n_embd = config.n_embd
self.dropout = config.dropout
self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
if not self.flash:
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))
def forward(self, x):
B, T, C = x.size()
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
if self.flash:
y = torch.nn.functional.scaled_dot_product_attention(
q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
else:
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
att = self.attn_dropout(att)
y = att @ v
y = y.transpose(1, 2).contiguous().view(B, T, C)
y = self.resid_dropout(self.c_proj(y))
return y
Feedforward Network
Each transformer block contains a position-wise feedforward network (MLP) that processes each position independently.
Mathematical Definition
where:
- expands to 4× dimension
- projects back
GELU Activation
The Gaussian Error Linear Unit:
where is the CDF of the standard normal distribution.
Intuition: GELU is a smooth, probabilistic ReLU. For input , it's multiplied by the probability that a standard normal is less than . Large positive values pass through; large negative values are zeroed; the transition is smooth.
Fast approximation:
Comparison of Activations
| Activation | Formula | Smooth at 0 | Sparse | Dead Neurons |
|---|---|---|---|---|
| ReLU | No | High | Yes | |
| GELU | Yes | Low | No | |
| SiLU/Swish | Yes | Low | No | |
| ReLU² | Yes | High | Yes |
Implementation
model.py:78-92:
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
self.gelu = nn.GELU()
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
self.dropout = nn.Dropout(config.dropout)
def forward(self, x):
x = self.c_fc(x)
x = self.gelu(x)
x = self.c_proj(x)
x = self.dropout(x)
return x
Transformer Block with Pre-Normalization
Pre-Norm vs Post-Norm
The original transformer (2017) used post-normalization:
GPT-2 and nanoGPT use pre-normalization:
Pre-norm provides better gradient flow, especially in deep networks. Normalizing the input to each sublayer ensures consistent activation scales regardless of depth.
Complete Block Equations
Residual Connections and Gradient Flow
The residual connection allows gradients to flow directly through the network:
The "1" term provides a gradient highway regardless of how the sublayer behaves. Without residuals, gradients must flow through every transformation, often vanishing in deep networks.
Implementation
model.py:94-106:
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
self.attn = CausalSelfAttention(config)
self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
Weight Initialization
Standard Initialization
Proper initialization prevents exploding or vanishing signals. nanoGPT uses normal initialization with (model.py:162-168):
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
Residual Projection Scaling
The output projections (c_proj) receive special scaled initialization (model.py:143-145):
for pn, p in self.named_parameters():
if pn.endswith('c_proj.weight'):
torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))
Why ?
Each block has two residual additions (attention and MLP). After blocks:
Without scaling, if each residual has variance 1, total variance grows as . The scaling ensures:
maintaining approximately unit variance regardless of depth.
Weight Tying
nanoGPT ties the input embedding matrix to the output projection (model.py:138):
self.transformer.wte.weight = self.lm_head.weight # https://paperswithcode.com/method/weight-tying
The logit for token at position is:
Benefits:
- Parameter efficiency: Saves parameters (~38.6M for GPT-2)
- Implicit regularization: Forces embeddings useful for both input and output
- Semantic consistency: Related tokens have similar input and output representations
Complete Forward Pass
model.py:170-193:
def forward(self, idx, targets=None):
device = idx.device
b, t = idx.size()
assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
pos = torch.arange(0, t, dtype=torch.long, device=device)
# Forward the GPT model
tok_emb = self.transformer.wte(idx) # (b, t, n_embd)
pos_emb = self.transformer.wpe(pos) # (t, n_embd)
x = self.transformer.drop(tok_emb + pos_emb)
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
if targets is not None:
logits = self.lm_head(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
else:
# Inference optimization: only compute logits for last position
logits = self.lm_head(x[:, [-1], :])
loss = None
return logits, loss
The inference optimization x[:, [-1], :] saves computation per generation step.
Loading Pretrained GPT-2 Weights
nanoGPT can load OpenAI's pretrained GPT-2 weights via HuggingFace (model.py:206-261):
@classmethod
def from_pretrained(cls, model_type, override_args=None):
assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
from transformers import GPT2LMHeadModel
# Create nanoGPT model with matching config
config = GPTConfig(**config_args)
model = GPT(config)
sd = model.state_dict()
# Load HuggingFace model
model_hf = GPT2LMHeadModel.from_pretrained(model_type)
sd_hf = model_hf.state_dict()
# Copy weights, transposing Conv1D weights
transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
for k in sd_keys_hf:
if any(k.endswith(w) for w in transposed):
sd[k].copy_(sd_hf[k].t()) # Transpose Conv1D -> Linear
else:
sd[k].copy_(sd_hf[k])
return model
The transposition is needed because OpenAI's implementation uses Conv1D (weight shape [out, in]) while nanoGPT uses Linear (weight shape [in, out]).
The Complete Training Loop
The training loop (train.py:249-336) implements a production-ready optimization procedure with gradient accumulation, checkpointing, and distributed training support.
Training Loop Structure
train.py:249-336:
# training loop
X, Y = get_batch('train') # fetch the very first batch
t0 = time.time()
local_iter_num = 0
raw_model = model.module if ddp else model # unwrap DDP container if needed
running_mfu = -1.0
while True:
# determine and set the learning rate for this iteration
lr = get_lr(iter_num) if decay_lr else learning_rate
for param_group in optimizer.param_groups:
param_group['lr'] = lr
# evaluate the loss on train/val sets and write checkpoints
if iter_num % eval_interval == 0 and master_process:
losses = estimate_loss()
print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
if losses['val'] < best_val_loss or always_save_checkpoint:
best_val_loss = losses['val']
if iter_num > 0:
checkpoint = {
'model': raw_model.state_dict(),
'optimizer': optimizer.state_dict(),
'model_args': model_args,
'iter_num': iter_num,
'best_val_loss': best_val_loss,
'config': config,
}
torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
# forward backward update, with optional gradient accumulation
for micro_step in range(gradient_accumulation_steps):
if ddp:
model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
with ctx:
logits, loss = model(X, Y)
loss = loss / gradient_accumulation_steps # scale the loss
X, Y = get_batch('train') # async prefetch next batch
scaler.scale(loss).backward()
# clip the gradient
if grad_clip != 0.0:
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
# step the optimizer and scaler
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad(set_to_none=True)
iter_num += 1
if iter_num > max_iters:
break
Gradient Accumulation
Gradient accumulation simulates larger batch sizes when GPU memory is limited. The key equation:
Example (config/train_gpt2.py:9-13):
# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 12
block_size = 1024
gradient_accumulation_steps = 5 * 8
The loss is scaled by before each backward pass:
This ensures gradients sum correctly across accumulation steps.
Evaluation During Training
train.py:215-228:
@torch.no_grad()
def estimate_loss():
out = {}
model.eval()
for split in ['train', 'val']:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
with ctx:
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
model.train()
return out
Default hyperparameters (train.py:36-41):
eval_interval = 2000steps between evaluationseval_iters = 200batches per evaluationeval_only = Falseto continue training after first eval
Checkpointing
The checkpoint contains everything needed to resume training (train.py:277-286):
checkpoint = {
'model': raw_model.state_dict(), # Model weights
'optimizer': optimizer.state_dict(), # Optimizer state (momentum, variance)
'model_args': model_args, # Architecture config
'iter_num': iter_num, # Training progress
'best_val_loss': best_val_loss, # Best validation loss
'config': config, # All hyperparameters
}
torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
Resuming from checkpoint (train.py:158-180):
if init_from == 'resume':
ckpt_path = os.path.join(out_dir, 'ckpt.pt')
checkpoint = torch.load(ckpt_path, map_location=device)
# Force architecture params to match
for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
model_args[k] = checkpoint_model_args[k]
# Load model
gptconf = GPTConfig(**model_args)
model = GPT(gptconf)
model.load_state_dict(checkpoint['model'])
iter_num = checkpoint['iter_num']
best_val_loss = checkpoint['best_val_loss']
Tokens Per Iteration Calculation
train.py:101-102:
tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")
For the default GPT-2 training config:
With 600K iterations, total tokens = (2.4 trillion).
Data Loading
train.py:114-131:
data_dir = os.path.join('data', dataset)
def get_batch(split):
# Recreate np.memmap every batch to avoid memory leak
if split == 'train':
data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
else:
data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
if device_type == 'cuda':
x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
else:
x, y = x.to(device), y.to(device)
return x, y
Key optimizations:
- Memory mapping: Only loads accessed pages, enabling datasets larger than RAM
- Pinned memory: Page-locked host memory for faster CPU→GPU transfer
- Non-blocking transfer: Overlaps data transfer with computation
- uint16 storage: Token IDs < 65536 fit in 2 bytes, halving storage
Why Recreate Memmap Each Batch?
The comment references a memory leak in numpy's memmap. Each memmap object holds references that can accumulate. Recreating ensures clean state.
Learning Rate Schedule
Cosine Decay with Linear Warmup
Implementation (train.py:231-242):
def get_lr(it):
# 1) linear warmup for warmup_iters steps
if it < warmup_iters:
return learning_rate * (it + 1) / (warmup_iters + 1)
# 2) if it > lr_decay_iters, return min learning rate
if it > lr_decay_iters:
return min_lr
# 3) in between, use cosine decay down to min learning rate
decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
assert 0 <= decay_ratio <= 1
coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # ranges 0..1
return min_lr + coeff * (learning_rate - min_lr)
Default hyperparameters (train.py:58-68):
- (ratio of 10)
- steps
- steps
AdamW Optimizer
Mathematical Formulation
Components:
- : Exponential moving average of gradients (momentum)
- : Exponential moving average of squared gradients (adaptive LR)
- : Bias correction for zero initialization
- : Decoupled weight decay (the "W" in AdamW)
AdamW vs Adam
Original Adam applies weight decay through the gradient (L2 regularization):
This couples weight decay with the adaptive learning rate. AdamW decouples them, applying weight decay directly to parameters. This improves generalization.
Selective Weight Decay
model.py:263-287:
def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
param_dict = {pn: p for pn, p in self.named_parameters()}
param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
# 2D params (weight matrices) get decay, 1D params (biases, norms) don't
decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
optim_groups = [
{'params': decay_params, 'weight_decay': weight_decay},
{'params': nodecay_params, 'weight_decay': 0.0}
]
# Use fused AdamW if available
fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
use_fused = fused_available and device_type == 'cuda'
optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, fused=use_fused)
return optimizer
Default hyperparameters (train.py:57-62):
- ,
Gradient Clipping
Global Norm Clipping
where is the global norm.
Implementation (train.py:307-309):
if grad_clip != 0.0:
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
Default (train.py:63).
Mixed Precision Training
BF16 vs FP16
| Format | Exponent | Mantissa | Range | Precision |
|---|---|---|---|---|
| FP32 | 8 bits | 23 bits | ±3.4×10³⁸ | High |
| BF16 | 8 bits | 7 bits | ±3.4×10³⁸ | Low |
| FP16 | 5 bits | 10 bits | ±65,504 | Medium |
BF16 has the same exponent range as FP32, eliminating gradient underflow issues. FP16 requires dynamic loss scaling via GradScaler.
Implementation (train.py:111-112, train.py:196, train.py:299-312):
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype)
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))
# Training step
with ctx:
logits, loss = model(X, Y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Distributed Training (DDP)
Setup
train.py:82-100:
ddp = int(os.environ.get('RANK', -1)) != -1
if ddp:
init_process_group(backend=backend)
ddp_rank = int(os.environ['RANK'])
ddp_local_rank = int(os.environ['LOCAL_RANK'])
ddp_world_size = int(os.environ['WORLD_SIZE'])
device = f'cuda:{ddp_local_rank}'
torch.cuda.set_device(device)
master_process = ddp_rank == 0
seed_offset = ddp_rank # Different data per rank
gradient_accumulation_steps //= ddp_world_size
Gradient Synchronization Optimization
Only sync gradients on the final accumulation step (train.py:292-298):
for micro_step in range(gradient_accumulation_steps):
if ddp:
model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
with ctx:
logits, loss = model(X, Y)
loss = loss / gradient_accumulation_steps
scaler.scale(loss).backward()
This reduces all-reduce communication by the accumulation factor.
Model Flops Utilization (MFU)
FLOPS Estimation
For transformers, FLOPS per forward-backward pass:
where:
- = parameters (excluding embeddings)
- = layers, = heads, = head dimension, = sequence length
- = batch size
- Factor 6: 2× forward (multiply-add), 2× backward for activations, 2× backward for weights
Implementation (model.py:289-303):
def estimate_mfu(self, fwdbwd_per_iter, dt):
N = self.get_num_params()
cfg = self.config
L, H, Q, T = cfg.n_layer, cfg.n_head, cfg.n_embd//cfg.n_head, cfg.block_size
flops_per_token = 6*N + 12*L*H*Q*T
flops_per_fwdbwd = flops_per_token * T
flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
flops_achieved = flops_per_iter * (1.0/dt)
flops_promised = 312e12 # A100 bfloat16 peak: 312 TFLOPS
mfu = flops_achieved / flops_promised
return mfu
Typical MFU values: 40-60% for well-tuned implementations.
Text Generation
Temperature Scaling
| Temperature | Effect |
|---|---|
| Greedy (argmax) | |
| Focused but varied | |
| Standard softmax | |
| More random |
Top-k Sampling
Restrict to most probable tokens:
Implementation (model.py:305-330):
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
for _ in range(max_new_tokens):
idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return idx
Finetuning from Pretrained Models
nanoGPT supports finetuning pretrained GPT-2 models on custom datasets.
Finetuning Configuration
config/finetune_shakespeare.py:
out_dir = 'out-shakespeare'
eval_interval = 5
eval_iters = 40
wandb_log = False
dataset = 'shakespeare'
init_from = 'gpt2-xl' # Start from largest GPT-2
# Only save if validation improves
always_save_checkpoint = False
# Small batch size for finetuning
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20
# Constant LR, much smaller than pretraining
learning_rate = 3e-5
decay_lr = False
Key Finetuning Differences
| Setting | Pretraining | Finetuning |
|---|---|---|
init_from | 'scratch' | 'gpt2-xl' |
learning_rate | 6e-4 | 3e-5 (20× smaller) |
decay_lr | True (cosine) | False (constant) |
max_iters | 600,000 | 20 |
dropout | 0.0 | 0.1+ |
Loading Pretrained Weights
train.py:181-188:
elif init_from.startswith('gpt2'):
print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
override_args = dict(dropout=dropout)
model = GPT.from_pretrained(init_from, override_args)
for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
model_args[k] = getattr(model.config, k)
Block Size Cropping
If your dataset uses shorter sequences, crop the model's position embeddings:
train.py:189-192:
if block_size < model.config.block_size:
model.crop_block_size(block_size)
model_args['block_size'] = block_size
model.py:262-270:
def crop_block_size(self, block_size):
assert block_size <= self.config.block_size
self.config.block_size = block_size
self.transformer.wpe.weight = nn.Parameter(self.transformer.wpe.weight[:block_size])
for block in self.transformer.h:
if hasattr(block.attn, 'bias'):
block.attn.bias = block.attn.bias[:,:,:block_size,:block_size]
Configuration System
nanoGPT uses exec() to load configuration files, allowing any Python expression:
configurator.py:
import sys
for arg in sys.argv[1:]:
if '=' not in arg:
# Assume it's a config file
with open(arg) as f:
exec(f.read())
else:
# Override individual setting
exec(arg)
Usage:
# Load config file
python train.py config/train_gpt2.py
# Override settings
python train.py config/train_gpt2.py --batch_size=8 --learning_rate=1e-4
# Multiple config files (later overrides earlier)
python train.py config/train_gpt2.py config/my_overrides.py
Frequently Asked Questions
Related Articles
nanochat: Andrej Karpathy's Full-Stack ChatGPT Clone
A comprehensive, equation-complete analysis of nanochat—the complete ChatGPT pipeline from tokenization through reinforcement learning. Deep dive into the modern GPT architecture (RoPE, RMSNorm, GQA, QK-norm, ReLU²), the Muon optimizer with Newton-Schulz orthogonalization, KV cache inference, and tool use.
Transformer Architecture: A Complete Deep Dive
A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.
Distributed Training: How to Train 70B+ Parameter Models
A comprehensive deep dive into distributed training—how to train models that don't fit on a single GPU. Understand data parallelism, tensor parallelism, pipeline parallelism, ZeRO optimization, and the engineering behind training frontier LLMs.
Text Generation & Decoding Strategies: A Complete Guide
A comprehensive guide to how LLMs actually generate text—from greedy decoding to beam search, temperature scaling, nucleus sampling, speculative decoding, and structured generation. Master the techniques that control LLM output quality, creativity, and speed.