Vision Transformers (ViT): Applying Transformers to Images
A comprehensive deep dive into Vision Transformers—how the transformer architecture adapts from text to images. Understand patch embeddings, position encoding for images, and why ViT has revolutionized computer vision.
Table of Contents
From Text to Images
The transformer architecture, originally designed for text, has proven remarkably general. Vision Transformer (ViT) applies transformers to images with minimal modifications, achieving state-of-the-art results in computer vision.
The key insight is treating images like sequences: split an image into patches, flatten each patch into a vector, and process them with a standard transformer. This simple idea, combined with large-scale pretraining, matches or exceeds CNN performance.
2025: The era of vision foundation models. DINOv3, released by Meta in August 2025, represents the new frontier—6.7 billion parameters, trained on 1.7 billion images, with breakthrough innovations:
- RoPE position embeddings enabling variable resolution (256×256 to 4096×4096) without retraining
- Gram anchoring for stable dense features in segmentation tasks
- 88.4% ImageNet accuracy self-supervised, outperforming DINOv2 (87.3%)
- 86.6 mIoU on PASCAL VOC segmentation vs DINOv2's 83.1 (source)
The self-supervised vs supervised tradeoff: According to DINOv3 analysis, weakly-supervised models like SigLIP-2 (89.1% ImageNet) win on classification, but self-supervised DINOv3 dominates on dense tasks (segmentation, depth, 3D awareness). Text supervision is great for "what's in the image" but not for purely visual tasks.
Understanding ViT is important because it bridges NLP and vision, enables multimodal models (like GPT-4V), and has become the foundation for modern vision architectures. This post covers how ViT works, why it works, and how it's evolved.
Part I: The ViT Architecture
From Pixels to Patches
The fundamental challenge: images are 2D grids of pixels, but transformers process sequences of tokens. ViT's solution: split the image into patches and treat patches as tokens.
┌─────────────────────────────────────────────────────────────────────────┐
│ IMAGE TO PATCHES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE PROCESS: │
│ ──────────── │
│ │
│ Input image: 224 × 224 × 3 (H × W × RGB) │
│ Patch size: 16 × 16 │
│ Number of patches: (224/16) × (224/16) = 14 × 14 = 196 │
│ │
│ Each patch: 16 × 16 × 3 = 768 values │
│ Sequence length: 196 patches + 1 [CLS] token = 197 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ VISUALIZATION: │
│ ────────────── │
│ │
│ Original Image (224×224): │
│ ┌────────────────────────────────┐ │
│ │ │ │
│ │ [Photo of a cat] │ │
│ │ │ │
│ │ │ │
│ └────────────────────────────────┘ │
│ │
│ Split into 14×14 patches: │
│ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │
│ │ 0│ 1│ 2│ 3│ 4│ 5│ 6│ 7│ 8│ 9│10│11│12│13│ (row 0) │
│ ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤ │
│ │14│15│16│17│18│19│20│21│22│23│24│25│26│27│ (row 1) │
│ ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ... │
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
│ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │
│ │
│ Each patch becomes a "token": │
│ [CLS] [P0] [P1] [P2] ... [P195] │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PATCH SIZES: │
│ ──────────── │
│ │
│ Patch Size Patches (224²) Sequence Length │
│ ───────────────────────────────────────────────── │
│ 32 × 32 7 × 7 = 49 50 (+ [CLS]) │
│ 16 × 16 14 × 14 = 196 197 │
│ 14 × 14 16 × 16 = 256 257 │
│ 8 × 8 28 × 28 = 784 785 │
│ │
│ Smaller patches = more tokens = more compute but finer detail. │
│ 16×16 is the most common default. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Patch Embedding
Each patch is flattened and projected to the model's hidden dimension:
┌─────────────────────────────────────────────────────────────────────────┐
│ PATCH EMBEDDING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ METHOD 1: Linear Projection (Original ViT) │
│ ─────────────────────────────────────────── │
│ │
│ 1. Flatten each patch: 16×16×3 = 768 values │
│ 2. Linear projection: 768 → hidden_dim (e.g., 768) │
│ │
│ patch_embed = nn.Linear(patch_size² × channels, hidden_dim) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ METHOD 2: Convolutional Projection (Common variant) │
│ ──────────────────────────────────────────────────── │
│ │
│ Use Conv2d with kernel_size = patch_size, stride = patch_size │
│ │
│ patch_embed = nn.Conv2d( │
│ in_channels=3, │
│ out_channels=hidden_dim, │
│ kernel_size=16, │
│ stride=16 │
│ ) │
│ │
│ This is mathematically equivalent to linear projection │
│ but more efficient (single operation, not flatten + matmul). │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ IMPLEMENTATION: │
│ ─────────────── │
│ │
│ class PatchEmbed(nn.Module): │
│ def __init__(self, img_size=224, patch_size=16, in_chans=3, │
│ embed_dim=768): │
│ super().__init__() │
│ self.img_size = img_size │
│ self.patch_size = patch_size │
│ self.n_patches = (img_size // patch_size) ** 2 │
│ │
│ self.proj = nn.Conv2d( │
│ in_chans, embed_dim, │
│ kernel_size=patch_size, stride=patch_size │
│ ) │
│ │
│ def forward(self, x): │
│ # x: (B, 3, 224, 224) │
│ x = self.proj(x) # (B, embed_dim, 14, 14) │
│ x = x.flatten(2) # (B, embed_dim, 196) │
│ x = x.transpose(1, 2) # (B, 196, embed_dim) │
│ return x │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Position Embeddings for Images
Transformers need position information. For images, ViT uses learned 2D or 1D position embeddings:
┌─────────────────────────────────────────────────────────────────────────┐
│ POSITION EMBEDDINGS IN VIT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ OPTION 1: Learned 1D Positions (Original ViT) │
│ ───────────────────────────────────────────── │
│ │
│ Treat patches as a 1D sequence with learned embeddings: │
│ │
│ pos_embed = nn.Parameter(torch.zeros(1, n_patches + 1, embed_dim)) │
│ │
│ Patches are ordered left-to-right, top-to-bottom: │
│ [CLS] [0,0] [0,1] ... [0,13] [1,0] [1,1] ... [13,13] │
│ │
│ Model learns to associate position with spatial location. │
│ Surprisingly, this works as well as explicit 2D encodings! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OPTION 2: Learned 2D Positions │
│ ────────────────────────────── │
│ │
│ Separate embeddings for row and column: │
│ │
│ row_embed = nn.Parameter(torch.zeros(1, n_rows, embed_dim // 2)) │
│ col_embed = nn.Parameter(torch.zeros(1, n_cols, embed_dim // 2)) │
│ │
│ For patch at (i, j): │
│ pos_embed[i,j] = concat(row_embed[i], col_embed[j]) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OPTION 3: Sinusoidal 2D Positions │
│ ───────────────────────────────── │
│ │
│ Like original transformer, but for 2D: │
│ │
│ def get_2d_sincos_pos_embed(embed_dim, grid_size): │
│ grid_h = np.arange(grid_size) │
│ grid_w = np.arange(grid_size) │
│ grid = np.meshgrid(grid_w, grid_h) │
│ grid = np.stack(grid, axis=0).reshape(2, -1) │
│ │
│ # Half dims for each axis │
│ pos_embed_h = get_1d_sincos(embed_dim // 2, grid[0]) │
│ pos_embed_w = get_1d_sincos(embed_dim // 2, grid[1]) │
│ return np.concatenate([pos_embed_h, pos_embed_w], axis=1) │
│ │
│ Used by: MAE, some CLIP variants │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OPTION 4: RoPE for Vision (Modern approach) │
│ ──────────────────────────────────────────── │
│ │
│ 2D Rotary Position Embedding: │
│ • Separate rotations for x and y axes │
│ • Better extrapolation to different resolutions │
│ • Used by: EVA, some newer ViTs │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMPARISON: │
│ │
│ Method Resolution Flex Performance Complexity │
│ ─────────────────────────────────────────────────────────── │
│ Learned 1D Limited Good Simple │
│ Learned 2D Better Good Moderate │
│ Sinusoidal 2D Good Good Simple │
│ 2D RoPE Best Best Moderate │
│ │
│ For fixed resolution: Learned 1D works fine. │
│ For variable resolution: Sinusoidal or RoPE preferred. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The [CLS] Token
Like BERT, ViT uses a special [CLS] token for classification:
┌─────────────────────────────────────────────────────────────────────────┐
│ THE [CLS] TOKEN │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PURPOSE: │
│ ──────── │
│ │
│ [CLS] token aggregates information from all patches for classification│
│ │
│ Input sequence: [CLS] [P0] [P1] [P2] ... [P195] │
│ After transformer: [CLS'] [P0'] [P1'] [P2'] ... [P195'] │
│ │
│ For classification, use only [CLS']: │
│ output = classifier_head(transformer_output[:, 0, :]) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY [CLS]? │
│ ────────── │
│ │
│ 1. AGGREGATION POINT │
│ [CLS] attends to all patches, gathers global information │
│ Doesn't have to choose which patch to use for output │
│ │
│ 2. CONSISTENT REPRESENTATION │
│ Output is always at position 0 │
│ Independent of image size / patch count │
│ │
│ 3. SAME AS BERT │
│ Makes transfer learning from NLP easier │
│ Same architecture patterns │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ALTERNATIVES: │
│ ───────────── │
│ │
│ GLOBAL AVERAGE POOLING: │
│ output = transformer_output[:, 1:, :].mean(dim=1) │
│ │
│ Average all patch representations (excluding [CLS]). │
│ Works comparably for classification. │
│ Some models (DeiT) use both [CLS] and pooling. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ FOR DENSE PREDICTION: │
│ ───────────────────── │
│ │
│ For segmentation, detection: use all patch outputs. │
│ Reshape (B, 196, D) → (B, D, 14, 14) and decode. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Full ViT Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ VIT ARCHITECTURE OVERVIEW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT PROCESSING: │
│ ───────────────── │
│ │
│ Image (224×224×3) │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │Patch Embed │ Conv2d(3, 768, k=16, s=16) │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ Patches (196 × 768) │
│ │ │
│ ├──── + [CLS] token (1 × 768) │
│ │ │
│ ▼ │
│ Tokens (197 × 768) │
│ │ │
│ ├──── + Position Embeddings (197 × 768) │
│ │ │
│ ▼ │
│ Embedded Tokens (197 × 768) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TRANSFORMER BLOCKS (×12 for ViT-Base): │
│ ─────────────────────────────────────── │
│ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ LayerNorm │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Multi-Head │ 12 heads, head_dim = 64 │
│ │ Self-Attn │ │
│ └─────────────┘ │
│ │ │
│ ├──── + Residual │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ LayerNorm │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ MLP │ 768 → 3072 → 768 (4× expansion) │
│ │ (GELU) │ │
│ └─────────────┘ │
│ │ │
│ ├──── + Residual │
│ │ │
│ ▼ │
│ (repeat 12×) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OUTPUT HEAD: │
│ ──────────── │
│ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ LayerNorm │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ Extract [CLS] token (1 × 768) │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Classifier │ Linear(768, num_classes) │
│ │ Head │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ Class logits (1 × 1000) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part II: Implementation
Complete ViT Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
class PatchEmbed(nn.Module):
"""Image to Patch Embedding."""
def __init__(
self,
img_size: int = 224,
patch_size: int = 16,
in_chans: int = 3,
embed_dim: int = 768,
):
super().__init__()
self.img_size = img_size
self.patch_size = patch_size
self.n_patches = (img_size // patch_size) ** 2
# Convolutional projection (equivalent to linear on flattened patches)
self.proj = nn.Conv2d(
in_chans, embed_dim,
kernel_size=patch_size, stride=patch_size
)
def forward(self, x):
# x: (B, 3, H, W)
x = self.proj(x) # (B, embed_dim, H/patch, W/patch)
x = x.flatten(2) # (B, embed_dim, n_patches)
x = x.transpose(1, 2) # (B, n_patches, embed_dim)
return x
class Attention(nn.Module):
"""Multi-Head Self-Attention."""
def __init__(
self,
dim: int,
n_heads: int = 12,
qkv_bias: bool = True,
attn_drop: float = 0.0,
proj_drop: float = 0.0,
):
super().__init__()
self.n_heads = n_heads
self.head_dim = dim // n_heads
self.scale = self.head_dim ** -0.5
# Combined QKV projection
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x):
B, N, C = x.shape
# Generate Q, K, V
qkv = self.qkv(x) # (B, N, 3*dim)
qkv = qkv.reshape(B, N, 3, self.n_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4) # (3, B, heads, N, head_dim)
q, k, v = qkv[0], qkv[1], qkv[2]
# Attention
attn = (q @ k.transpose(-2, -1)) * self.scale # (B, heads, N, N)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
# Combine heads
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x
class MLP(nn.Module):
"""Feed-Forward Network with GELU."""
def __init__(
self,
in_features: int,
hidden_features: int = None,
out_features: int = None,
drop: float = 0.0,
):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features * 4
self.fc1 = nn.Linear(in_features, hidden_features)
self.act = nn.GELU()
self.fc2 = nn.Linear(hidden_features, out_features)
self.drop = nn.Dropout(drop)
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.drop(x)
x = self.fc2(x)
x = self.drop(x)
return x
class TransformerBlock(nn.Module):
"""Transformer block with pre-normalization."""
def __init__(
self,
dim: int,
n_heads: int,
mlp_ratio: float = 4.0,
qkv_bias: bool = True,
drop: float = 0.0,
attn_drop: float = 0.0,
):
super().__init__()
self.norm1 = nn.LayerNorm(dim)
self.attn = Attention(
dim, n_heads=n_heads, qkv_bias=qkv_bias,
attn_drop=attn_drop, proj_drop=drop
)
self.norm2 = nn.LayerNorm(dim)
self.mlp = MLP(
in_features=dim,
hidden_features=int(dim * mlp_ratio),
drop=drop
)
def forward(self, x):
x = x + self.attn(self.norm1(x))
x = x + self.mlp(self.norm2(x))
return x
class VisionTransformer(nn.Module):
"""Vision Transformer (ViT)."""
def __init__(
self,
img_size: int = 224,
patch_size: int = 16,
in_chans: int = 3,
num_classes: int = 1000,
embed_dim: int = 768,
depth: int = 12,
n_heads: int = 12,
mlp_ratio: float = 4.0,
qkv_bias: bool = True,
drop_rate: float = 0.0,
attn_drop_rate: float = 0.0,
):
super().__init__()
self.num_classes = num_classes
self.embed_dim = embed_dim
# Patch embedding
self.patch_embed = PatchEmbed(
img_size=img_size,
patch_size=patch_size,
in_chans=in_chans,
embed_dim=embed_dim,
)
n_patches = self.patch_embed.n_patches
# CLS token and position embeddings
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.zeros(1, n_patches + 1, embed_dim))
self.pos_drop = nn.Dropout(p=drop_rate)
# Transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(
dim=embed_dim,
n_heads=n_heads,
mlp_ratio=mlp_ratio,
qkv_bias=qkv_bias,
drop=drop_rate,
attn_drop=attn_drop_rate,
)
for _ in range(depth)
])
# Output
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes)
# Initialize weights
self._init_weights()
def _init_weights(self):
# Initialize position embeddings
nn.init.trunc_normal_(self.pos_embed, std=0.02)
nn.init.trunc_normal_(self.cls_token, std=0.02)
# Initialize linear layers
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.trunc_normal_(m.weight, std=0.02)
if m.bias is not None:
nn.init.zeros_(m.bias)
def forward(self, x):
B = x.shape[0]
# Patch embedding
x = self.patch_embed(x) # (B, n_patches, embed_dim)
# Add CLS token
cls_tokens = self.cls_token.expand(B, -1, -1)
x = torch.cat([cls_tokens, x], dim=1) # (B, n_patches + 1, embed_dim)
# Add position embeddings
x = x + self.pos_embed
x = self.pos_drop(x)
# Transformer blocks
for block in self.blocks:
x = block(x)
# Output
x = self.norm(x)
cls_output = x[:, 0] # CLS token
logits = self.head(cls_output)
return logits
# Model variants
def vit_tiny(num_classes=1000):
return VisionTransformer(
embed_dim=192, depth=12, n_heads=3, num_classes=num_classes
)
def vit_small(num_classes=1000):
return VisionTransformer(
embed_dim=384, depth=12, n_heads=6, num_classes=num_classes
)
def vit_base(num_classes=1000):
return VisionTransformer(
embed_dim=768, depth=12, n_heads=12, num_classes=num_classes
)
def vit_large(num_classes=1000):
return VisionTransformer(
embed_dim=1024, depth=24, n_heads=16, num_classes=num_classes
)
Part III: Why ViT Works
Scale is Key
The original ViT paper showed that transformers need lots of data to outperform CNNs:
┌─────────────────────────────────────────────────────────────────────────┐
│ VIT SCALING BEHAVIOR │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ KEY FINDING: │
│ ──────────── │
│ │
│ ViT underperforms CNNs on small datasets. │
│ ViT outperforms CNNs on large datasets. │
│ │
│ The crossover point: ~10-100 million images. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RESULTS (ImageNet-1K accuracy): │
│ │
│ Training Data ResNet-152 ViT-Large │
│ ──────────────────────────────────────────────────── │
│ ImageNet-1K 78.5% 76.5% (ViT worse) │
│ ImageNet-21K 80.0% 82.0% (ViT better) │
│ JFT-300M 81.0% 87.8% (ViT much better) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY SCALE MATTERS: │
│ ────────────────── │
│ │
│ CNNs have strong inductive biases: │
│ • Translation equivariance (convolutions) │
│ • Locality (small receptive fields initially) │
│ • Hierarchical features (pooling) │
│ │
│ These biases help when data is limited. │
│ │
│ ViT has weak inductive biases: │
│ • Only bias: 2D structure of patches │
│ • Attention can learn any pattern │
│ • More flexible but needs more data to learn │
│ │
│ With enough data, ViT's flexibility becomes an advantage. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MODERN SOLUTION: Better training, not more data │
│ ─────────────────────────────────────────────── │
│ │
│ DeiT (Data-efficient Image Transformers) showed: │
│ With proper training (augmentation, regularization), │
│ ViT can match CNNs on ImageNet-1K alone! │
│ │
│ Key improvements: │
│ • Strong augmentation (RandAugment, Mixup, CutMix) │
│ • Regularization (DropPath, LayerScale) │
│ • Knowledge distillation from CNN teachers │
│ │
└─────────────────────────────────────────────────────────────────────────┘
What Does ViT Learn?
┌─────────────────────────────────────────────────────────────────────────┐
│ WHAT VIT LEARNS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ATTENTION PATTERNS: │
│ ─────────────────── │
│ │
│ Early layers: │
│ • Local attention (attend to nearby patches) │
│ • Similar to early CNN layers │
│ • Capture edges, textures │
│ │
│ Middle layers: │
│ • Mix of local and global attention │
│ • Attend to semantically related patches │
│ • E.g., all patches containing "dog" attend to each other │
│ │
│ Later layers: │
│ • Global attention patterns │
│ • [CLS] attends to discriminative regions │
│ • Task-relevant feature aggregation │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ POSITION EMBEDDINGS: │
│ ──────────────────── │
│ │
│ Learned position embeddings show 2D structure! │
│ │
│ Similarity between position embeddings: │
│ • Nearby positions have similar embeddings │
│ • Horizontal/vertical neighbors more similar than diagonal │
│ • Model learns 2D grid from 1D indexing │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMPARISON TO CNNs: │
│ ─────────────────── │
│ │
│ CNNs: Forced locality in early layers → Gradual increase │
│ ViT: Can attend globally from layer 1 → Chooses to be local │
│ │
│ ViT LEARNS the inductive biases that CNNs have hardcoded! │
│ With enough data, learned biases can be more appropriate. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part IV: ViT Variants and Evolution
Key ViT Variants
┌─────────────────────────────────────────────────────────────────────────┐
│ VIT VARIANTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ DeiT (Data-efficient Image Transformer): │
│ ───────────────────────────────────────── │
│ • Same architecture as ViT │
│ • Better training recipe │
│ • Distillation token for knowledge transfer │
│ • Matches ViT-JFT using only ImageNet-1K │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Swin Transformer (Hierarchical): │
│ ───────────────────────────────── │
│ • Hierarchical feature maps (like CNNs) │
│ • Shifted window attention (local, efficient) │
│ • Good for dense prediction (detection, segmentation) │
│ • Complexity: O(n) instead of O(n²) │
│ │
│ Architecture: │
│ Stage 1: 56×56, 96 dim → Patch merge → │
│ Stage 2: 28×28, 192 dim → Patch merge → │
│ Stage 3: 14×14, 384 dim → Patch merge → │
│ Stage 4: 7×7, 768 dim │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ BEiT (BERT pre-training for images): │
│ ────────────────────────────────── │
│ • Masked image modeling (like BERT) │
│ • Discrete visual tokens via VQ-VAE │
│ • Strong transfer learning │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MAE (Masked Autoencoder): │
│ ───────────────────────── │
│ • Mask 75% of patches during training │
│ • Reconstruct masked patches │
│ • Very efficient pre-training │
│ • Enables huge models with less compute │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CLIP ViT (Contrastive Language-Image Pre-training): │
│ ───────────────────────────────────────────────────── │
│ • Pre-trained with image-text pairs │
│ • Zero-shot classification │
│ • Foundation for multimodal models │
│ • Powers DALL-E, GPT-4V, etc. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Model Size Comparison
┌─────────────────────────────────────────────────────────────────────────┐
│ VIT MODEL SIZES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Model Params Layers Hidden Heads Patch │
│ ───────────────────────────────────────────────────────────────────── │
│ ViT-Ti/16 6M 12 192 3 16 │
│ ViT-S/16 22M 12 384 6 16 │
│ ViT-B/16 86M 12 768 12 16 │
│ ViT-B/32 88M 12 768 12 32 │
│ ViT-L/16 307M 24 1024 16 16 │
│ ViT-L/32 306M 24 1024 16 32 │
│ ViT-H/14 632M 32 1280 16 14 │
│ ViT-G/14 1.8B 40 1664 16 14 │
│ ViT-22B 22B 48 6144 48 14 │
│ │
│ Notation: ViT-{Size}/{Patch_Size} │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMPUTE COMPARISON (224×224 input): │
│ │
│ Model GFLOPs Tokens │
│ ─────────────────────────────────── │
│ ViT-B/32 4.4 50 │
│ ViT-B/16 17.6 197 │
│ ViT-L/16 61.6 197 │
│ ViT-H/14 167.4 257 │
│ │
│ ResNet-50 4.1 - │
│ ResNet-152 11.5 - │
│ │
│ ViT-B/16 is ~4× more compute than ResNet-50 for similar accuracy. │
│ But ViT scales better: ViT-H beats everything despite more compute. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part V: Multimodal Vision Transformers
CLIP and Vision-Language Models
┌─────────────────────────────────────────────────────────────────────────┐
│ CLIP ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CLIP TRAINING: │
│ ────────────── │
│ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Image │ │ Text │ │
│ │ (224×224) │ │ "a dog..." │ │
│ └───────────────┘ └───────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ ViT Encoder │ │ Text Encoder │ │
│ │ (ViT-L/14) │ │ (Transformer) │ │
│ └───────────────┘ └───────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ Image embedding Text embedding │
│ (768-dim) (768-dim) │
│ │ │ │
│ └──────────────┬─────────────────────┘ │
│ ▼ │
│ Contrastive Loss │
│ (match image-text pairs in batch) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ZERO-SHOT CLASSIFICATION: │
│ ───────────────────────── │
│ │
│ No training on downstream dataset! │
│ │
│ 1. Create text prompts for each class: │
│ "a photo of a cat", "a photo of a dog", ... │
│ │
│ 2. Encode all prompts with text encoder │
│ │
│ 3. Encode image with image encoder │
│ │
│ 4. Compute similarity: image_emb · text_emb │
│ │
│ 5. Highest similarity = predicted class │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY CLIP MATTERS: │
│ ───────────────── │
│ │
│ • Pre-trained on 400M image-text pairs from internet │
│ • Generalizes to novel concepts ("zero-shot") │
│ • Foundation for image generation (DALL-E, Stable Diffusion) │
│ • Foundation for multimodal LLMs (GPT-4V, LLaVA) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Vision Language Models (GPT-4V, LLaVA)
┌─────────────────────────────────────────────────────────────────────────┐
│ VISION LANGUAGE MODELS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ARCHITECTURE PATTERN: │
│ ──────────────────── │
│ │
│ Image Text │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ ViT Encoder │ │ Tokenizer │ │
│ │ (CLIP/EVA) │ │ │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────┐ │ │
│ │ Projector │ (MLP to align dims) │ │
│ └─────────────┘ │ │
│ │ │ │
│ └──────────────┬───────────────────┘ │
│ ▼ │
│ ┌───────────┐ │
│ │ LLM │ │
│ │ (Llama, │ │
│ │ Mistral) │ │
│ └───────────┘ │
│ │ │
│ ▼ │
│ Response │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ KEY COMPONENTS: │
│ ─────────────── │
│ │
│ 1. Vision Encoder: Pre-trained ViT (CLIP, SigLIP, EVA) │
│ - Often frozen during training │
│ - Converts image to sequence of embeddings │
│ │
│ 2. Projector: Aligns vision and text embedding spaces │
│ - Simple: Linear layer │
│ - Complex: MLP, cross-attention │
│ │
│ 3. LLM: Pre-trained language model │
│ - May be fine-tuned or frozen │
│ - Processes image tokens + text tokens together │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE: LLaVA Input Sequence │
│ ───────────────────────────── │
│ │
│ [Image tokens] [BOS] User: What's in this image? [/INST] │
│ Assistant: This image shows a cat sitting on a windowsill. │
│ │
│ Image tokens (e.g., 576 tokens from 24×24 patches) │
│ are prepended to text tokens. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part VI: Recent Innovations (2024-2025)
SigLIP and SigLIP 2
SigLIP (Sigmoid Loss for Language-Image Pre-training) improved CLIP's training efficiency:
┌─────────────────────────────────────────────────────────────────────────┐
│ SIGLIP: IMPROVED CONTRASTIVE LEARNING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CLIP's LIMITATION: │
│ ────────────────── │
│ │
│ CLIP uses softmax contrastive loss: │
│ - Requires large batch sizes (32K+) for good negatives │
│ - Expensive distributed training │
│ - Loss requires all-to-all comparison in batch │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SIGLIP'S SOLUTION: │
│ ────────────────── │
│ │
│ Replace softmax with sigmoid (binary classification): │
│ │
│ CLIP loss: softmax(img_i · text_j / τ) for all j │
│ SigLIP loss: sigmoid(img_i · text_i / τ) for positive pairs │
│ sigmoid(-img_i · text_j / τ) for negative pairs │
│ │
│ Each pair is classified independently! │
│ No need for all-to-all comparison. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ BENEFITS: │
│ ───────── │
│ │
│ • Works with smaller batch sizes (4K vs 32K) │
│ • 4× faster training at same quality │
│ • Better scaling to larger models │
│ • Simpler distributed training │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SIGLIP 2 IMPROVEMENTS (2024): │
│ ───────────────────────────── │
│ │
│ 1. CAPTIONING LOSS: │
│ Add image captioning objective alongside contrastive │
│ Better fine-grained understanding │
│ │
│ 2. SELF-FILTERING: │
│ Use model to filter low-quality image-text pairs │
│ Improves training data quality automatically │
│ │
│ 3. NATIVE RESOLUTION: │
│ Train at multiple resolutions │
│ Better handling of aspect ratios │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RESULTS: │
│ │
│ Model ImageNet Zero-shot Training Cost │
│ ───────────────────────────────────────────────────────────── │
│ CLIP ViT-L/14 75.5% ~12K GPU-hours │
│ SigLIP ViT-L 78.2% ~3K GPU-hours │
│ SigLIP 2 ViT-L 80.1% ~4K GPU-hours │
│ │
│ SigLIP 2 is the default vision encoder for many VLMs in 2024-2025. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
DINOv2: Self-Supervised Vision Foundation Models
┌─────────────────────────────────────────────────────────────────────────┐
│ DINOV2: UNIVERSAL VISUAL FEATURES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ DINO (Self-Distillation with No Labels): │
│ ───────────────────────────────────────── │
│ │
│ Student-teacher self-supervised learning: │
│ • Student sees augmented crops │
│ • Teacher sees different augmented crops │
│ • Student learns to match teacher's output │
│ • Teacher is exponential moving average of student │
│ │
│ No labels needed! Learns from image structure alone. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DINOV2 IMPROVEMENTS: │
│ ──────────────────── │
│ │
│ 1. SCALE: │
│ • Trained on 142M curated images (LVD-142M) │
│ • Model sizes up to ViT-g (1.1B params) │
│ │
│ 2. DATA CURATION: │
│ • Automatic deduplication │
│ • Quality filtering │
│ • Balanced sampling │
│ │
│ 3. TRAINING: │
│ • Combined self-distillation + masked image modeling │
│ • Multi-crop strategy at different resolutions │
│ • KoLeo regularizer for uniform feature distribution │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ KEY PROPERTIES: │
│ ─────────────── │
│ │
│ • Semantic segmentation emerges without training! │
│ PCA of patch features shows object boundaries │
│ │
│ • Depth estimation from features alone │
│ Linear probe achieves good depth prediction │
│ │
│ • Cross-domain transfer │
│ Features work on art, medical images, satellite, etc. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ USAGE PATTERN: │
│ ────────────── │
│ │
│ import torch │
│ from transformers import AutoImageProcessor, AutoModel │
│ │
│ processor = AutoImageProcessor.from_pretrained( │
│ 'facebook/dinov2-large' │
│ ) │
│ model = AutoModel.from_pretrained('facebook/dinov2-large') │
│ │
│ # Get features (no [CLS] token, use patch features) │
│ inputs = processor(images=image, return_tensors="pt") │
│ features = model(**inputs).last_hidden_state │
│ │
│ # For classification: add linear probe │
│ # For segmentation: reshape features to spatial grid │
│ │
└─────────────────────────────────────────────────────────────────────────┘
EVA and EVA-02: Scaling Vision Encoders
┌─────────────────────────────────────────────────────────────────────────┐
│ EVA: EXPLORING VISION TRANSFORMERS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EVA APPROACH: │
│ ───────────── │
│ │
│ Pre-train ViT with masked image modeling using CLIP features as target│
│ Then fine-tune with contrastive learning. │
│ │
│ Stage 1: MIM Pre-training │
│ • Mask 40% of patches │
│ • Predict CLIP features of masked patches │
│ • Faster convergence than predicting pixels │
│ │
│ Stage 2: Contrastive Fine-tuning │
│ • Standard CLIP-style training │
│ • Uses pre-trained weights from Stage 1 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EVA-02 IMPROVEMENTS: │
│ ──────────────────── │
│ │
│ • Scaled to 4.4B parameters (EVA-02-E) │
│ • Native 448×448 resolution │
│ • Multi-scale training │
│ • Improved architecture details (RMSNorm, SwiGLU) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ USE IN VLMS: │
│ ──────────── │
│ │
│ EVA-CLIP is popular vision encoder for VLMs: │
│ • LLaVA-1.5 uses EVA-CLIP │
│ • Many open-source VLMs prefer EVA over original CLIP │
│ • Better fine-grained understanding │
│ │
└─────────────────────────────────────────────────────────────────────────┘
InternViT and InternVL
┌─────────────────────────────────────────────────────────────────────────┐
│ INTERNVIT: LARGE-SCALE VISION FOUNDATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INTERNVIT-6B: │
│ ───────────── │
│ │
│ Largest open-source vision encoder (6B parameters) │
│ │
│ Architecture: │
│ • 48 layers, hidden dim 3200 │
│ • 25 attention heads │
│ • 448×448 native resolution, 14×14 patches │
│ • 1024 patch tokens │
│ │
│ Training: │
│ • Contrastive (CLIP-style) + generative objectives │
│ • 1B image-text pairs │
│ • Dynamic resolution training │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ INTERNVL (Vision-Language): │
│ ─────────────────────────── │
│ │
│ InternViT + InternLM2 = InternVL │
│ │
│ • State-of-the-art open-source VLM │
│ • Dynamic high-resolution processing │
│ • Supports up to 4K resolution via tiling │
│ │
│ High-resolution strategy: │
│ • Divide large image into 448×448 tiles │
│ • Process each tile with InternViT │
│ • Concatenate tile features │
│ • Enables document/chart understanding │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Updated Vision Encoder Comparison
┌─────────────────────────────────────────────────────────────────────────┐
│ VISION ENCODERS (2025) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Model Params Training Best Use Case │
│ ───────────────────────────────────────────────────────────────────── │
│ CLIP ViT-L 400M Contrastive Zero-shot, legacy │
│ SigLIP 2 400M Sigmoid + Cap VLMs (efficient) │
│ PaliGemma 2 400M+ SigLIP + Gemma2 OCR, documents (2025) │
│ DINOv2-L 300M Self-supervised Dense prediction │
│ DINOv3-7B 7B Self-supervised Best dense features (2025) │
│ EVA-02-L 300M MIM + Contrast VLMs (quality) │
│ InternVL 3 6B Native MM Best open VLM (Apr 2025) │
│ InternViT-6B 6B Mixed Highest resolution │
│ │
│ RECOMMENDATIONS: │
│ ──────────────── │
│ │
│ For VLMs (general): │
│ • SigLIP 2: Best efficiency/quality tradeoff │
│ • EVA-CLIP: Better fine-grained understanding │
│ │
│ For dense prediction (segmentation, depth): │
│ • DINOv3 (7B): +6 mIoU vs DINOv2, best dense features (Aug 2025) │
│ • DINOv2: Good transfer, smaller model option │
│ │
│ For high-resolution documents/charts: │
│ • InternViT with tiling │
│ │
│ For research/maximum quality: │
│ • EVA-02-E (4.4B) or InternViT-6B │
│ │
│ TREND: VLMs increasingly use SigLIP 2 as default encoder │
│ (PaliGemma, Qwen-VL-2, many open-source models) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Summary
Vision Transformers apply the transformer architecture to images with minimal modification:
┌─────────────────────────────────────────────────────────────────────────┐
│ KEY TAKEAWAYS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CORE IDEA: │
│ • Split image into patches (16×16 typical) │
│ • Embed patches as tokens │
│ • Add position embeddings and [CLS] token │
│ • Process with standard transformer │
│ │
│ SCALING BEHAVIOR: │
│ • ViT needs more data than CNNs (weaker inductive bias) │
│ • With enough data/training, ViT outperforms CNNs │
│ • Modern training recipes (DeiT) close the data gap │
│ │
│ KEY VARIANTS: │
│ • ViT: Original, global attention │
│ • Swin: Hierarchical, local attention, efficient │
│ • MAE: Masked pre-training, very efficient │
│ • CLIP: Vision-language pre-training, zero-shot │
│ │
│ MULTIMODAL: │
│ • ViT is the vision backbone for most multimodal models │
│ • GPT-4V, LLaVA, etc. use ViT + LLM │
│ • CLIP provides aligned vision-text representations │
│ │
│ 2024-2025 INNOVATIONS: │
│ • DINOv3 (Aug 2025): 7B params, 1.7B images, Gram Anchoring │
│ - +6 mIoU on segmentation, +6.7 J&F on video tracking │
│ - Distills to smaller models (ViT-B, ViT-L, ConvNeXt) │
│ • InternVL 3 (Apr 2025): V2PE, Native Multimodal Pre-Training │
│ - 72.2 MMMU (first open-source >70%), 89.7% ChartQA │
│ - Variable Visual Position Encoding for long context │
│ • PaliGemma 2: SigLIP-So400m + Gemma 2, SOTA on OCR │
│ - 9 variants: 2B/9B/27B × 224/448/896px resolution │
│ • SigLIP 2: 4× faster training, default for VLMs │
│ • TokenFlow (CVPR 2025): Unified image tokenizer │
│ - Dual-codebook: semantic + pixel-level features │
│ • OpenVision (UC Santa Cruz, May 2025): │
│ - Open-source alternative to CLIP/SigLIP │
│ - 2-3× faster training, matches/beats on TextVQA, ChartQA │
│ • Janus-Pro-7B: DeepSeek unified multimodal │
│ - Decoupled visual encoding for understanding vs generation │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Frequently Asked Questions
Related Articles
Transformer Architecture: A Complete Deep Dive
A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.
Attention Mechanisms: From Self-Attention to FlashAttention
A comprehensive deep dive into attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.
Multimodal LLMs: Vision, Audio, and Beyond
A comprehensive guide to multimodal LLMs—vision-language models, audio understanding, video comprehension, and any-to-any models. Architecture deep dives, benchmarks, implementation patterns, and production deployment.