RoPE: Rotary Position Embeddings Explained
A comprehensive mathematical deep dive into Rotary Position Embeddings (RoPE)—the position encoding method that powers Llama, Mistral, Qwen, and most modern LLMs. Complete derivations, proofs, implementation, and the mathematics of context extension.
Table of Contents
Overview
Rotary Position Embedding (RoPE), introduced by Su et al. in the 2021 paper "RoFormer: Enhanced Transformer with Rotary Position Embedding," has become the dominant position encoding method in modern large language models. Llama 1/2/3/4, Mistral, Mixtral, Qwen, Yi, DeepSeek, Gemma, Phi—virtually every major open-source LLM uses RoPE.
| Model | RoPE Base | Max Context | Notes |
|---|---|---|---|
| Llama 1 | 10,000 | 2,048 | Original implementation |
| Llama 2 | 10,000 | 4,096 | Extended context |
| Llama 3 | 500,000 | 8,192 | Higher base for extension |
| Llama 3.1 | 500,000 | 128,000 | With YaRN scaling |
| Mistral 7B | 10,000 | 8,192 | Sliding window attention |
| Qwen 2.5 | 1,000,000 | 32,768+ | Very high base |
| DeepSeek-V2 | 10,000 | 128,000 | With YaRN |
This post provides a complete mathematical treatment of RoPE: foundational concepts, full derivations, production implementation, and context extension methods.
Prerequisites: Linear algebra (matrices, dot products), complex numbers, basic calculus. See Positional Embeddings for background.
Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
Part I: Mathematical Foundations
The Position Encoding Problem
In self-attention, we compute scores between queries and keys:
where is the query at position , is the key at position , and is the head dimension.
The core attention operation is the inner product . Without position information, this depends only on the content of tokens at positions and , not their positions.
Goal: Design a function that incorporates position into vector such that:
The inner product should depend on relative position , not absolute positions and individually.
Complex Numbers and Rotation
RoPE's key insight: rotation naturally encodes relative position.
Euler's Formula
Derived from Taylor series:
Complex Multiplication as Rotation
A complex number in polar form:
where (magnitude) and (phase).
Multiplying by rotates by angle :
Magnitude preserved, angle shifted—this is rotation.
The 2D Rotation Matrix
For a real 2D vector , rotation by angle :
Key properties:
| Property | Formula | Implication |
|---|---|---|
| Orthogonality | ||
| Composition | Rotations add | |
| Determinant | Orientation preserved | |
| Norm preservation | Length unchanged |
Equivalence: Complex ↔ Matrix
Viewing as :
This equals the matrix rotation result. Complex multiplication = matrix rotation.
┌─────────────────────────────────────────────────────────────────────────┐
│ ROTATION: TWO EQUIVALENT VIEWS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ COMPLEX VIEW: MATRIX VIEW: │
│ ───────────── ──────────── │
│ │
│ z = x + iy v = [x, y]ᵀ │
│ │
│ Rotate by θ: Rotate by θ: │
│ │
│ z' = z · e^(iθ) v' = R(θ) · v │
│ = z · (cosθ + i·sinθ) │
│ R(θ) = [cosθ -sinθ] │
│ [sinθ cosθ] │
│ │
│ Result: Result: │
│ z' = (x·cosθ - y·sinθ) v' = [x·cosθ - y·sinθ] │
│ + i(x·sinθ + y·cosθ) [x·sinθ + y·cosθ] │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SAME RESULT: Real part = first component, Imaginary = second │
│ │
│ Implementation choice: │
│ • Complex: elegant, single multiplication │
│ • Matrix: explicit, works without complex support │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part II: Deriving RoPE
The Inner Product Under Rotation
Consider two 2D vectors and .
Rotate by angle and by angle :
Their inner product:
Using :
Using (orthogonality):
Using (composition):
Key result: Inner product depends only on , the difference of rotation angles.
Applying to Position Encoding
Set rotation angle proportional to position:
- Query at position : rotate by
- Key at position : rotate by
Then:
The inner product depends only on relative position .
The Complete RoPE Formulation
For high-dimensional vectors (dimension ), apply independent 2D rotations to pairs of dimensions. Split into pairs, each rotated by a different frequency.
Block-Diagonal Rotation Matrix
The full rotation matrix is block-diagonal with rotation blocks:
where each is a rotation matrix:
Expanded form (showing all elements):
Compact Notation
Using block diagonal notation:
The Frequency Schedule
Each dimension pair rotates at frequency :
where is the base (hyperparameter).
Frequencies form a geometric sequence:
| Dimension pair | Frequency | Formula |
|---|---|---|
Part III: The Core Theorem
Theorem: RoPE Encodes Relative Position
Statement: For query at position and key at position , with RoPE-transformed vectors and :
The inner product depends only on relative position .
Proof:
Step 1: Start with the inner product definition.
Step 2: Apply transpose property .
Step 3: Each 2×2 block is orthogonal, so .
Step 4: Substitute.
Step 5: Apply composition property to each block.
Step 6: Final result.
Alternative: Complex Number Proof
For each 2D pair , let and .
RoPE rotation in complex form:
The real inner product of the 2D pair equals:
Computing:
Summing over all pairs:
Depends only on .
Part IV: Frequency Analysis
Wavelength and Period
Each dimension pair has frequency . The wavelength (positions per complete rotation):
For and (typical head dimension):
| Dimension pair | Frequency | Wavelength | Interpretation |
|---|---|---|---|
| Distinguishes positions 1 apart | |||
| Medium-range patterns | |||
| Long-range patterns | |||
| Global position |
Maximum Distinguishable Position
Two positions become indistinguishable when all dimension pairs complete full rotations. The limiting factor is the lowest frequency:
For : positions.
This is why higher base values enable longer contexts: Llama 3 uses , giving positions.
Fourier Interpretation
The RoPE inner product can be viewed as a Fourier-like decomposition:
Expanding:
where is the phase difference.
Interpretation: Sum of cosines at different frequencies—a Fourier series where:
- Coefficients () depend on content
- Frequencies () are fixed by architecture
- Variable is relative position
┌─────────────────────────────────────────────────────────────────────────┐
│ ROPE FREQUENCY SPECTRUM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ θ_j = b^(-2j/d) where b = 10000, d = head_dim │
│ │
│ Frequency │
│ (log scale) │
│ │ │
│ 1 │ ● θ₀ = 1 (high freq: local patterns) │
│ │ ╲ │
│ 0.1 │ ╲ │
│ │ ╲ │
│ 0.01 │ ● θ_{d/4} (medium freq: sentence-level) │
│ │ ╲ │
│0.001 │ ╲ │
│ │ ╲ │
│0.0001│ ● θ_{d/2-1} (low freq: document-level) │
│ └────────────────────────────────────────────→ Dimension pair j │
│ 0 d/4 d/2-1 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WAVELENGTH λ_j = 2π / θ_j: │
│ │
│ • j = 0: λ ≈ 6 positions → distinguishes adjacent tokens │
│ • j = d/4: λ ≈ 600 positions → paragraph-level patterns │
│ • j = d/2-1: λ ≈ 60,000 positions → document-level patterns │
│ │
│ Together: unique position "fingerprint" at ALL scales │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part V: Implementation
Complexity Analysis
| Approach | Time Complexity | Space Complexity |
|---|---|---|
| Naive matrix multiply | per vector | for matrix |
| Efficient (element-wise) | per vector | for sin/cos |
| Precomputed | per vector | for all positions |
The block-diagonal structure enables computation—no full matrix multiply needed.
Core Implementation
Frequency computation (rope.py:1-20):
import torch
import math
def compute_rope_frequencies(
dim: int,
max_seq_len: int,
base: float = 10000.0,
device: torch.device = None
) -> torch.Tensor:
"""
Precompute RoPE frequencies for all positions.
θ_j = 1 / base^(2j/dim) for j ∈ [0, dim/2)
Returns:
freqs: (max_seq_len, dim/2) - angles m·θ_j for each position m
"""
# Compute θ_j = base^(-2j/dim)
j = torch.arange(0, dim, 2, dtype=torch.float32, device=device)
theta = 1.0 / (base ** (j / dim)) # Shape: (dim/2,)
# Position indices
m = torch.arange(max_seq_len, dtype=torch.float32, device=device)
# Outer product: angles[m, j] = m * θ_j
angles = torch.outer(m, theta) # Shape: (max_seq_len, dim/2)
return angles
Complex implementation (rope.py:22-60):
def precompute_freqs_cis(
dim: int,
max_seq_len: int,
base: float = 10000.0,
device: torch.device = None
) -> torch.Tensor:
"""
Precompute complex exponentials e^(i·m·θ_j).
This is the Llama-style implementation using complex numbers.
Returns:
freqs_cis: Complex tensor (max_seq_len, dim/2)
freqs_cis[m, j] = cos(m·θ_j) + i·sin(m·θ_j)
"""
angles = compute_rope_frequencies(dim, max_seq_len, base, device)
# Convert to complex: e^(i·angle) = cos(angle) + i·sin(angle)
freqs_cis = torch.polar(torch.ones_like(angles), angles)
return freqs_cis
def apply_rotary_emb_complex(
xq: torch.Tensor,
xk: torch.Tensor,
freqs_cis: torch.Tensor,
start_pos: int = 0
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Apply RoPE using complex multiplication.
Args:
xq: Query tensor (batch, seq_len, n_heads, head_dim)
xk: Key tensor (batch, seq_len, n_kv_heads, head_dim)
freqs_cis: Precomputed frequencies (max_seq_len, head_dim/2)
start_pos: Starting position (for KV cache)
Returns:
Rotated (query, key) tensors, same shapes as input
"""
batch, seq_len, n_heads, head_dim = xq.shape
n_kv_heads = xk.shape[2]
# Reshape to complex: (batch, seq, heads, dim/2, 2) -> complex (batch, seq, heads, dim/2)
xq_complex = torch.view_as_complex(xq.float().reshape(batch, seq_len, n_heads, -1, 2))
xk_complex = torch.view_as_complex(xk.float().reshape(batch, seq_len, n_kv_heads, -1, 2))
# Get frequencies for this sequence
freqs = freqs_cis[start_pos : start_pos + seq_len] # (seq_len, dim/2)
freqs = freqs.unsqueeze(0).unsqueeze(2) # (1, seq_len, 1, dim/2)
# Complex multiplication = rotation
xq_rot = xq_complex * freqs
xk_rot = xk_complex * freqs
# Back to real
xq_out = torch.view_as_real(xq_rot).reshape(batch, seq_len, n_heads, head_dim)
xk_out = torch.view_as_real(xk_rot).reshape(batch, seq_len, n_kv_heads, head_dim)
return xq_out.type_as(xq), xk_out.type_as(xk)
Real-number implementation (rope.py:62-110):
def precompute_cos_sin(
dim: int,
max_seq_len: int,
base: float = 10000.0,
device: torch.device = None
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Precompute cos and sin for real-number RoPE.
Returns:
cos: (max_seq_len, dim/2) - cos(m·θ_j)
sin: (max_seq_len, dim/2) - sin(m·θ_j)
"""
angles = compute_rope_frequencies(dim, max_seq_len, base, device)
return torch.cos(angles), torch.sin(angles)
def apply_rotary_emb_real(
x: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
start_pos: int = 0
) -> torch.Tensor:
"""
Apply RoPE using only real arithmetic.
Implements the rotation formula:
x̃_{2j} = x_{2j}·cos(mθ_j) - x_{2j+1}·sin(mθ_j)
x̃_{2j+1} = x_{2j}·sin(mθ_j) + x_{2j+1}·cos(mθ_j)
Args:
x: Input tensor (..., head_dim)
cos: Cosine values (max_seq_len, head_dim/2)
sin: Sine values (max_seq_len, head_dim/2)
start_pos: Starting position
Returns:
Rotated tensor, same shape as input
"""
seq_len = x.shape[1]
# Get cos/sin for this sequence
cos = cos[start_pos : start_pos + seq_len] # (seq_len, dim/2)
sin = sin[start_pos : start_pos + seq_len]
# Reshape for broadcasting: (1, seq_len, 1, dim/2)
cos = cos.unsqueeze(0).unsqueeze(2)
sin = sin.unsqueeze(0).unsqueeze(2)
# Split into even/odd pairs
x_even = x[..., 0::2] # x_0, x_2, x_4, ...
x_odd = x[..., 1::2] # x_1, x_3, x_5, ...
# Apply 2D rotation to each pair
x_even_rot = x_even * cos - x_odd * sin
x_odd_rot = x_even * sin + x_odd * cos
# Interleave back
out = torch.stack([x_even_rot, x_odd_rot], dim=-1)
return out.reshape(x.shape)
Full Attention Module
Complete attention with RoPE (attention.py:1-80):
class RoPEMultiHeadAttention(torch.nn.Module):
"""
Multi-head attention with Rotary Position Embeddings.
Attention computation:
Q_rot, K_rot = RoPE(Q, pos), RoPE(K, pos)
Attention = softmax(Q_rot · K_rot^T / √d) · V
Note: V is NOT rotated - position affects attention weights, not values.
"""
def __init__(
self,
dim: int,
n_heads: int,
n_kv_heads: int | None = None,
max_seq_len: int = 4096,
rope_base: float = 10000.0,
dropout: float = 0.0,
):
super().__init__()
self.dim = dim
self.n_heads = n_heads
self.n_kv_heads = n_kv_heads or n_heads
self.head_dim = dim // n_heads
self.scale = 1.0 / math.sqrt(self.head_dim)
# Check dimensions
assert dim % n_heads == 0, f"dim {dim} must be divisible by n_heads {n_heads}"
assert n_heads % self.n_kv_heads == 0, "n_heads must be divisible by n_kv_heads"
# Projections (no bias, following modern LLM convention)
self.wq = torch.nn.Linear(dim, n_heads * self.head_dim, bias=False)
self.wk = torch.nn.Linear(dim, self.n_kv_heads * self.head_dim, bias=False)
self.wv = torch.nn.Linear(dim, self.n_kv_heads * self.head_dim, bias=False)
self.wo = torch.nn.Linear(n_heads * self.head_dim, dim, bias=False)
self.dropout = torch.nn.Dropout(dropout)
# Precompute RoPE frequencies
self.register_buffer(
"freqs_cis",
precompute_freqs_cis(self.head_dim, max_seq_len, rope_base)
)
def forward(
self,
x: torch.Tensor,
start_pos: int = 0,
mask: torch.Tensor | None = None,
kv_cache: tuple[torch.Tensor, torch.Tensor] | None = None
) -> tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]:
batch, seq_len, _ = x.shape
# Project to Q, K, V
q = self.wq(x).view(batch, seq_len, self.n_heads, self.head_dim)
k = self.wk(x).view(batch, seq_len, self.n_kv_heads, self.head_dim)
v = self.wv(x).view(batch, seq_len, self.n_kv_heads, self.head_dim)
# Apply RoPE to Q and K (NOT V!)
q, k = apply_rotary_emb_complex(q, k, self.freqs_cis, start_pos)
# KV cache handling
if kv_cache is not None:
k_cache, v_cache = kv_cache
k = torch.cat([k_cache, k], dim=1)
v = torch.cat([v_cache, v], dim=1)
# GQA: expand KV heads to match query heads
if self.n_kv_heads < self.n_heads:
n_rep = self.n_heads // self.n_kv_heads
k = k.unsqueeze(3).expand(-1, -1, -1, n_rep, -1).reshape(batch, -1, self.n_heads, self.head_dim)
v = v.unsqueeze(3).expand(-1, -1, -1, n_rep, -1).reshape(batch, -1, self.n_heads, self.head_dim)
# Transpose for attention: (batch, n_heads, seq_len, head_dim)
q, k, v = [t.transpose(1, 2) for t in (q, k, v)]
# Scaled dot-product attention
scores = torch.matmul(q, k.transpose(-2, -1)) * self.scale
if mask is not None:
scores = scores + mask
attn = torch.softmax(scores, dim=-1)
attn = self.dropout(attn)
out = torch.matmul(attn, v)
out = out.transpose(1, 2).reshape(batch, seq_len, -1)
return self.wo(out), (k, v)
┌─────────────────────────────────────────────────────────────────────────┐
│ ROPE IN ATTENTION: DATA FLOW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT: x (batch, seq_len, dim) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Linear Projections │ │
│ │ Q = x @ W_Q K = x @ W_K V = x @ W_V │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ │ │
│ ┌─────────────────────────────────┐ │ │
│ │ Apply RoPE │ │ │
│ │ Q_rot = R(m·θ) · Q │ │ V is NOT rotated! │
│ │ K_rot = R(n·θ) · K │ │ (content unchanged) │
│ └─────────────────────────────────┘ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Scaled Dot-Product Attention │ │
│ │ │ │
│ │ scores = Q_rot · K_rot^T / √d │ │
│ │ └── depends on (m-n)·θ = RELATIVE POSITION │ │
│ │ │ │
│ │ attn = softmax(mask(scores)) │ │
│ │ output = attn · V │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ output = output @ W_O │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ OUTPUT: (batch, seq_len, dim) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part VI: Context Extension Methods
RoPE enables sophisticated context extension through mathematical manipulation of rotation frequencies.
Position Interpolation (PI)
Paper: "Extending Context Window of Large Language Models via Positional Interpolation" (Chen et al., 2023)
Problem: Model trained with max position . At position , angles exceed training distribution.
Solution: Scale positions to fit within :
Modified RoPE:
Effect: All rotation angles scaled by :
where .
Implementation:
def apply_position_interpolation(
freqs_cis: torch.Tensor,
original_max_len: int,
target_max_len: int
) -> torch.Tensor:
"""
Position Interpolation: scale all positions by L/L'.
All frequencies are uniformly scaled down.
"""
scale = original_max_len / target_max_len
# Recompute with scaled positions
# Equivalent to using freqs_cis[int(pos * scale)]
return freqs_cis # Actual implementation indexes differently
Limitation: High-frequency dimensions (large ) are scaled most, degrading local position discrimination.
NTK-Aware Interpolation
Key insight: Instead of uniform scaling, increase base to lower frequencies non-uniformly.
Modified frequency:
Effect by dimension:
| Dimension | Scaling factor | Effect |
|---|---|---|
| Unchanged (local patterns preserved) | ||
| Moderately scaled | ||
| Fully scaled (global patterns extended) |
Choosing : To extend from context to :
Implementation:
def compute_ntk_frequencies(
dim: int,
max_seq_len: int,
base: float = 10000.0,
original_max_len: int = 4096,
device: torch.device = None
) -> torch.Tensor:
"""
NTK-aware interpolation: scale base instead of positions.
High frequencies (local patterns) preserved.
Low frequencies (global patterns) scaled.
"""
scale = max_seq_len / original_max_len
# α = scale^(d/(d-2))
alpha = scale ** (dim / (dim - 2))
# Modified base
ntk_base = base * alpha
return compute_rope_frequencies(dim, max_seq_len, ntk_base, device)
YaRN (Yet another RoPE extensioN)
Paper: "YaRN: Efficient Context Window Extension of Large Language Models" (Peng et al., 2023)
YaRN combines multiple techniques:
1. Frequency Partitioning
Divide dimensions into three groups based on wavelength :
| Group | Condition | Interpolation |
|---|---|---|
| High-frequency | None (preserve local) | |
| Medium-frequency | Partial (ramp) | |
| Low-frequency | Full PI |
Interpolation factor (smooth ramp from 0 = no interpolation to 1 = full interpolation):
Modified frequency:
2. Attention Temperature Scaling
Scale attention logits to compensate for changed score distributions:
where temperature:
Implementation:
def yarn_find_correction_range(
dim: int,
base: float,
original_max_len: int
) -> tuple[float, float]:
"""
Find dimension range for partial interpolation.
Returns (low_dim, high_dim) where:
- dims < low_dim: no interpolation
- low_dim <= dims <= high_dim: partial interpolation
- dims > high_dim: full interpolation
"""
# λ_j = 2π · base^(2j/dim)
# Find j where λ_j = original_max_len
# For β rotations, wavelength threshold = original_max_len / β
# High freq cutoff: β = 32 (many rotations)
# Low freq cutoff: β = 1 (one rotation)
def find_dim(num_rotations):
# Solve: 2π · base^(2j/dim) = original_max_len / num_rotations
return (dim * math.log(original_max_len / (num_rotations * 2 * math.pi))) / (2 * math.log(base))
low_dim = find_dim(32) # High-frequency boundary
high_dim = find_dim(1) # Low-frequency boundary
return max(0, low_dim), min(dim // 2 - 1, high_dim)
def compute_yarn_frequencies(
dim: int,
max_seq_len: int,
base: float = 10000.0,
original_max_len: int = 4096,
device: torch.device = None
) -> tuple[torch.Tensor, float]:
"""
YaRN: frequency partitioning + attention scaling.
Returns:
freqs: Modified frequency tensor
mscale: Attention temperature scale
"""
scale = max_seq_len / original_max_len
if scale <= 1:
freqs = compute_rope_frequencies(dim, max_seq_len, base, device)
return freqs, 1.0
# Find interpolation boundaries
low_dim, high_dim = yarn_find_correction_range(dim, base, original_max_len)
# Base frequencies
j = torch.arange(0, dim, 2, dtype=torch.float32, device=device)
theta = 1.0 / (base ** (j / dim))
# Compute interpolation ramp
ramp = torch.clamp((j / 2 - low_dim) / (high_dim - low_dim), 0, 1)
# Interpolated scale: 1 for low j, 1/scale for high j
freq_scale = (1 - ramp) + ramp / scale
# Apply scaling
theta_scaled = theta * freq_scale
# Position angles
m = torch.arange(max_seq_len, dtype=torch.float32, device=device)
freqs = torch.outer(m, theta_scaled)
# Attention temperature
mscale = 0.1 * math.log(scale) + 1.0
return freqs, mscale
LongRoPE
Key innovations:
-
Search-based factors: Instead of analytical formulas, search for optimal per-dimension interpolation factors by minimizing perplexity on long-context data.
-
Progressive extension: Extend in stages (4K → 128K → 256K → 2M), fine-tuning at each stage.
-
Short context recovery: Use original RoPE for short sequences, scaled RoPE for long—prevents short-context degradation.
Comparison
| Method | High-freq | Low-freq | Training | Complexity |
|---|---|---|---|---|
| Position Interpolation | Degraded | Scaled | 1000 steps | Simple |
| NTK-aware | Preserved | Scaled | 1000 steps | Simple |
| YaRN | Preserved | Scaled | 400 steps | Medium |
| LongRoPE | Optimized | Optimized | Search + fine-tune | Complex |
Part VII: Mathematical Properties
Norm Preservation
RoPE preserves vector norms:
Proof: Each 2×2 rotation block preserves norm (orthogonal matrix). Blocks are independent:
Linearity
RoPE is linear in content:
Follows from matrix multiplication linearity.
Position Additivity
Rotation angles add for sequential positions:
Follows from for each block.
Implication: Relative position emerges naturally from rotation group structure.
Dimension Independence
Different pairs are independent:
Each pair's rotation depends only on that pair's input, not other dimensions. Model can learn different patterns at different frequency scales.
Part VIII: Comparison with Other Methods
RoPE vs Sinusoidal
| Aspect | Sinusoidal | RoPE |
|---|---|---|
| Operation | Additive: | Multiplicative: |
| Relative position | Implicit (model must learn) | Explicit (mathematical property) |
| Content-position mixing | Mixed in same vector | Separated (rotation vs. magnitude) |
| Length extrapolation | Poor | Scalable with PI/YaRN/etc. |
RoPE vs ALiBi
| Aspect | ALiBi | RoPE |
|---|---|---|
| Where applied | Attention scores only | Q and K embeddings |
| Formula | $\text{score} - m | i-j |
| Content interaction | None (bias is fixed) | Content modulates position |
| Extrapolation | Excellent (linear extends) | Good with scaling methods |
| Expressiveness | Linear decay only | Arbitrary learned functions |
RoPE vs Learned
| Aspect | Learned | RoPE |
|---|---|---|
| Parameters | Zero | |
| Max length | Fixed (embedding table size) | Extensible |
| Relative position | Must be learned implicitly | Built-in mathematical property |
| Flexibility | Can learn any pattern | Constrained to rotation |
Part IX: Advanced Topics and Variations
2D RoPE for Vision Transformers
Vision transformers process images as sequences of patches. Each patch has a 2D position rather than a 1D position. RoPE can be extended to 2D by splitting dimensions between row and column encoding.
Approach: Allocate half the dimensions to row position, half to column position.
where rotates the first dimensions by row position, and rotates the remaining dimensions by column position.
def precompute_freqs_2d(
dim: int,
height: int,
width: int,
base: float = 10000.0
) -> torch.Tensor:
"""
Precompute 2D RoPE frequencies for vision transformers.
Returns: (height * width, dim/2) complex tensor
"""
# Split dimensions: half for rows, half for columns
half_dim = dim // 2
# Frequencies for each dimension
theta_h = 1.0 / (base ** (torch.arange(0, half_dim, 2).float() / half_dim))
theta_w = 1.0 / (base ** (torch.arange(0, half_dim, 2).float() / half_dim))
# Position grids
h_pos = torch.arange(height)
w_pos = torch.arange(width)
# Compute angles
angles_h = torch.outer(h_pos, theta_h) # (height, half_dim/2)
angles_w = torch.outer(w_pos, theta_w) # (width, half_dim/2)
# Create 2D grid of frequencies
# For each (h, w) position, concatenate row and column frequencies
freqs_h = torch.polar(torch.ones_like(angles_h), angles_h) # (height, half_dim/2)
freqs_w = torch.polar(torch.ones_like(angles_w), angles_w) # (width, half_dim/2)
# Expand to (height, width, half_dim/2) then combine
freqs_h = freqs_h.unsqueeze(1).expand(-1, width, -1) # (height, width, half_dim/2)
freqs_w = freqs_w.unsqueeze(0).expand(height, -1, -1) # (height, width, half_dim/2)
# Concatenate row and column frequencies
freqs_2d = torch.cat([freqs_h, freqs_w], dim=-1) # (height, width, half_dim)
# Flatten to (height * width, half_dim)
return freqs_2d.reshape(height * width, -1)
def apply_rope_2d(
x: torch.Tensor, # (batch, num_patches, heads, head_dim)
freqs_2d: torch.Tensor, # (num_patches, head_dim/2)
patch_height: int,
patch_width: int
) -> torch.Tensor:
"""Apply 2D RoPE to image patch embeddings."""
batch, num_patches, heads, dim = x.shape
# Reshape to complex
x_complex = torch.view_as_complex(
x.float().reshape(batch, num_patches, heads, dim // 2, 2)
)
# Apply rotation
freqs = freqs_2d.unsqueeze(0).unsqueeze(2) # (1, num_patches, 1, dim/2)
x_rot = x_complex * freqs
# Back to real
return torch.view_as_real(x_rot).reshape(batch, num_patches, heads, dim).type_as(x)
Key insight: 2D RoPE preserves the relative position property in both dimensions independently:
- Attention between patches at and depends on and
3D RoPE for Video
For video (time × height × width), extend to 3D by splitting dimensions three ways:
This enables relative position encoding in all three dimensions: temporal, vertical, and horizontal.
RoPE with Multi-Head Latent Attention (MLA)
DeepSeek-V2 and V3 use Multi-Head Latent Attention (MLA) with RoPE. MLA compresses KV cache by projecting keys and values through a low-rank bottleneck.
The challenge: In standard RoPE, we rotate and by position. In MLA, keys are compressed:
If we apply RoPE to the full key and then compress, position information is mixed with content in the compressed representation.
DeepSeek's solution: Split the query/key dimensions into position-dependent and position-independent parts.
class MLAWithRoPE(nn.Module):
"""
Multi-Head Latent Attention with RoPE.
Key insight: Separate RoPE dimensions from compressed dimensions.
"""
def __init__(
self,
d_model: int,
n_heads: int,
d_latent: int, # Compressed KV dimension
d_rope: int, # Dimensions for RoPE (not compressed)
max_seq_len: int = 4096
):
super().__init__()
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.d_latent = d_latent
self.d_rope = d_rope
self.d_nope = self.head_dim - d_rope # Non-RoPE dimensions
# Query projection
self.wq = nn.Linear(d_model, n_heads * self.head_dim)
# Key/Value projections with latent compression
self.wkv_down = nn.Linear(d_model, d_latent) # Compress
self.wk_up = nn.Linear(d_latent, n_heads * self.d_nope) # Expand K (non-RoPE part)
self.wk_rope = nn.Linear(d_model, n_heads * d_rope) # K RoPE part (not compressed)
self.wv_up = nn.Linear(d_latent, n_heads * self.head_dim) # Expand V
self.wo = nn.Linear(n_heads * self.head_dim, d_model)
# RoPE frequencies
self.register_buffer(
"freqs_cis",
precompute_freqs_cis(d_rope, max_seq_len)
)
def forward(self, x: torch.Tensor, start_pos: int = 0):
batch, seq_len, _ = x.shape
# Query: full projection, apply RoPE to first d_rope dims
q = self.wq(x).view(batch, seq_len, self.n_heads, self.head_dim)
q_rope = q[..., :self.d_rope]
q_nope = q[..., self.d_rope:]
# Apply RoPE to query's RoPE dimensions
freqs = self.freqs_cis[start_pos : start_pos + seq_len]
q_rope = apply_rope(q_rope, freqs)
q = torch.cat([q_rope, q_nope], dim=-1)
# Key: compressed part (no RoPE) + RoPE part (not compressed)
kv_latent = self.wkv_down(x) # Compress
k_nope = self.wk_up(kv_latent).view(batch, seq_len, self.n_heads, self.d_nope)
k_rope = self.wk_rope(x).view(batch, seq_len, self.n_heads, self.d_rope)
# Apply RoPE to key's RoPE dimensions
k_rope = apply_rope(k_rope, freqs)
k = torch.cat([k_rope, k_nope], dim=-1)
# Value: from compressed representation
v = self.wv_up(kv_latent).view(batch, seq_len, self.n_heads, self.head_dim)
# Standard attention from here...
# (KV cache stores kv_latent and k_rope separately for efficiency)
return self.attention(q, k, v)
Why this works:
- RoPE dimensions carry position information, not compressed
- Non-RoPE dimensions carry content, can be compressed
- Best of both: position accuracy + memory efficiency
Partial RoPE
Some models apply RoPE to only a subset of dimensions, leaving others position-agnostic.
Motivation: Not all attention patterns need position. Some heads might learn position-independent patterns (e.g., "always attend to [SEP] token").
def apply_partial_rope(
x: torch.Tensor, # (batch, seq, heads, head_dim)
freqs: torch.Tensor,
rope_dims: int # How many dimensions to rotate
) -> torch.Tensor:
"""
Apply RoPE to only the first rope_dims dimensions.
Remaining dimensions are position-agnostic.
"""
# Split dimensions
x_rope = x[..., :rope_dims]
x_pass = x[..., rope_dims:]
# Apply RoPE to subset
x_rope_rot = apply_rope(x_rope, freqs)
# Recombine
return torch.cat([x_rope_rot, x_pass], dim=-1)
Used by: Some Llama variants, CodeLlama (different RoPE scaling for code vs text dimensions).
Part X: Visualization and Intuition
Visualizing Rotation in 2D
Consider a single dimension pair. The query and key vectors rotate in a 2D plane as position changes.
┌─────────────────────────────────────────────────────────────────────────┐
│ VISUALIZING ROPE ROTATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Consider dimension pair 0 (highest frequency, θ₀ = 1): │
│ │
│ Position 0 │
│ │ │
│ ▼ │
│ ──────→ q (no rotation) │
│ │
│ Position 1 │
│ │ │
│ ▼ │
│ ╱ │
│ ╱ → q rotated by θ = 1 radian ≈ 57° │
│ │
│ Position 2 │
│ │ │
│ ▼ │
│ │ │
│ ↓ q rotated by 2θ ≈ 114° │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ When computing attention between positions m and n: │
│ │
│ Position m=0 Position n=2 │
│ │ │ │
│ ▼ ▼ │
│ ──────→ │ │
│ q_0 ↓ k_2 │
│ │
│ Dot product depends on ANGLE BETWEEN them = 2θ │
│ This is (n - m) × θ = relative position × frequency │
│ │
│ Same pattern holds for m=100, n=102: angle is still 2θ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Frequency Spectrum Visualization
┌─────────────────────────────────────────────────────────────────────────┐
│ ROPE FREQUENCY SPECTRUM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Position 0 10 20 30 40 50 60 70 80 │
│ │ │ │ │ │ │ │ │ │ │
│ Dim 0: ─╮ ╭─╮ ╭─╮ ╭─╮ ╭─╮ ╭─╮ ╭─╮ ╭─╮ ╭─╮ ╭─╮ ╭─╮ ╭─╮ ╭─╮ │
│ (θ=1) ╰─╯ ╰─╯ ╰─╯ ╰─╯ ╰─╯ ╰─╯ ╰─╯ ╰─╯ ╰─╯ ╰─╯ ╰─╯ ╰─╯ │
│ Fast oscillation (~6 positions per cycle) │
│ │
│ Dim 16: ───────╮ ╭───────╮ ╭───────╮ ╭───────╮ │
│ (θ≈0.06) ╰─────╯ ╰─────╯ ╰─────╯ │
│ Medium oscillation (~100 positions per cycle) │
│ │
│ Dim 32: ─────────────────────╮ ╭───────── │
│ (θ≈0.003) ╰───────────────────╯ │
│ Slow oscillation (~2000 positions per cycle) │
│ │
│ Dim 63: ─────────────────────────────────────────────────── │
│ (θ≈0.0001) │
│ Very slow (~60000 positions per cycle) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMBINED EFFECT: Each position has a unique "fingerprint" │
│ │
│ Position 0: [1.0, 0.0, 1.0, 0.0, 1.0, 0.0, ...] │
│ Position 1: [0.54, 0.84, 0.998, 0.06, 0.9999, 0.006, ...] │
│ Position 10: [-0.84, 0.54, 0.82, 0.57, 0.997, 0.06, ...] │
│ │
│ High freq dims: change rapidly, distinguish neighbors │
│ Low freq dims: change slowly, distinguish distant positions │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part XI: Common Bugs and Debugging
Bug 1: Wrong Dimension Pairing
Symptom: Model trains but performance is poor, especially on long sequences.
Cause: Pairing wrong dimensions (e.g., [0,1], [1,2], [2,3] instead of [0,1], [2,3], [4,5]).
# WRONG: Overlapping pairs
x_pairs_wrong = x.unfold(-1, 2, 1) # Creates overlapping windows!
# CORRECT: Non-overlapping pairs
x_pairs = x.reshape(..., dim // 2, 2) # Proper pairing
Bug 2: Forgetting to Apply RoPE to Keys
Symptom: Model ignores position entirely.
Cause: Only rotating queries, not keys.
# WRONG
q = apply_rope(q, freqs, positions)
# k is not rotated!
# CORRECT
q = apply_rope(q, freqs, positions)
k = apply_rope(k, freqs, positions) # Both must be rotated!
Bug 3: Wrong Position Indices with KV Cache
Symptom: During inference with KV cache, model outputs degrade after a few tokens.
Cause: Using wrong position indices for new tokens.
# WRONG: Always using positions [0, 1, 2, ...] for new tokens
positions = torch.arange(seq_len)
q = apply_rope(q, freqs, positions)
# CORRECT: Use actual positions accounting for cache
positions = torch.arange(start_pos, start_pos + seq_len)
q = apply_rope(q, freqs[start_pos:start_pos + seq_len])
Bug 4: Rotating Values (V)
Symptom: Strange outputs, model doesn't learn well.
Cause: Applying RoPE to values in addition to queries and keys.
# WRONG
q = apply_rope(q, freqs, positions)
k = apply_rope(k, freqs, positions)
v = apply_rope(v, freqs, positions) # NO! V should not be rotated
# CORRECT
q = apply_rope(q, freqs, positions)
k = apply_rope(k, freqs, positions)
# v unchanged - position affects attention weights, not values
Bug 5: Base Frequency Mismatch
Symptom: Loading a pretrained model gives garbage outputs.
Cause: Using different RoPE base than the model was trained with.
# Model was trained with base=500000 (Llama 3)
# But you're using base=10000 (default)
# WRONG
freqs = precompute_freqs(dim, max_len, base=10000)
# CORRECT: Match the model's training configuration
freqs = precompute_freqs(dim, max_len, base=500000) # Check model config!
Bug 6: Context Extension Without Scaling
Symptom: Model works fine up to training length, then output quality drops sharply.
Cause: Using positions beyond training range without scaling.
# Model trained on 4096 tokens, now using 8192
# WRONG: Raw RoPE at extended positions
freqs = precompute_freqs(dim, 8192, base=10000) # Angles exceed training distribution
# CORRECT: Apply scaling
freqs = compute_yarn_freqs(dim, 8192, base=10000, original_max=4096)
Debugging Checklist
| Check | How to Verify |
|---|---|
| Dimension pairing | Print shape after reshape: should be (..., dim/2, 2) |
| Q and K both rotated | Add print statements or breakpoints in attention |
| Position indices | Print positions tensor, verify matches actual token positions |
| V unchanged | Verify V is not passed through any RoPE function |
| Base frequency | Check model config file for rope_theta or rope_base |
| Context scaling | Compare perplexity at training vs extended lengths |
Part XII: LongRoPE2 and Recent Advances (2024-2025)
LongRoPE2
LongRoPE2 improves on LongRoPE with more sophisticated search for optimal interpolation factors.
Key innovations:
-
Needle-driven evaluation: Instead of just perplexity, use "needle-in-haystack" tasks to evaluate position encoding quality. Can the model find specific information at various positions?
-
Evolutionary search: Use genetic algorithms to search the space of per-dimension interpolation factors, evolving populations of factor configurations.
-
Multi-objective optimization: Balance perplexity, needle accuracy, and computational cost.
def longrope2_search(
model,
target_length: int,
original_length: int,
dim: int,
population_size: int = 50,
generations: int = 100,
needle_weight: float = 0.7,
perplexity_weight: float = 0.3
):
"""
Evolutionary search for optimal RoPE scaling factors.
Returns: (dim/2,) tensor of per-dimension scaling factors
"""
# Initialize population with random factors around 1.0
population = torch.randn(population_size, dim // 2) * 0.1 + 1.0
for gen in range(generations):
# Evaluate each individual
fitness = []
for factors in population:
# Apply factors to RoPE
scaled_freqs = apply_longrope_factors(base_freqs, factors)
# Evaluate on needle tasks (can the model find info at various positions?)
needle_score = evaluate_needle_in_haystack(model, scaled_freqs, target_length)
# Evaluate perplexity on validation set
ppl = evaluate_perplexity(model, scaled_freqs, target_length)
# Combined fitness
fit = needle_weight * needle_score - perplexity_weight * math.log(ppl)
fitness.append(fit)
# Selection, crossover, mutation (standard genetic algorithm)
population = evolve(population, fitness)
# Return best individual
return population[torch.argmax(torch.tensor(fitness))]
Other Recent Advances
Dynamic RoPE (2024): Adjust RoPE frequencies dynamically based on input content, not just position.
Learned frequency schedules: Instead of the fixed geometric sequence , learn the frequencies during training.
Hybrid approaches: Combine RoPE with ALiBi-style biases for the best of both worlds.
Frequently Asked Questions
Related Articles
Positional Embeddings: How Transformers Understand Word Order
A comprehensive deep dive into positional embeddings—how transformers encode sequence order. From sinusoidal encodings to learned embeddings, relative positions to ALiBi, understand the evolution that led to modern approaches like RoPE.
Context Extension: How LLMs Scale Beyond Training Length
A comprehensive deep dive into context extension techniques—how models trained on 4K tokens extrapolate to 128K+. Understand RoPE scaling, Position Interpolation, NTK-aware scaling, YaRN, and the mathematics of long-context LLMs.
Transformer Architecture: A Complete Deep Dive
A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.
Attention Mechanisms: From Self-Attention to FlashAttention
A comprehensive deep dive into attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.
nanoGPT: Andrej Karpathy's Minimal GPT Training Framework
A comprehensive, equation-complete analysis of nanoGPT—Andrej Karpathy's influential minimal GPT implementation. Deep dive into the ~300-line model definition (model.py), training loop (train.py), Flash Attention, weight initialization, and the mathematical foundations behind every component.