Why patches instead of pixels?

Processing every pixel as a token would create 224×224 = 50,176 tokens—too many for efficient attention. Patches reduce this to ~196 tokens while preserving spatial structure. Each patch captures enough information (16×16 = 256 pixels).

Can ViT handle different image sizes?

Not directly. Position embeddings are fixed to training resolution. Solutions: interpolate position embeddings (works okay), use relative or sinusoidal positions (better), or use hierarchical models like Swin (most flexible).

Is ViT better than CNNs?

Depends on the setting. With lots of data or pre-training: yes. On small datasets: CNNs often win. For dense prediction: Swin and hybrid approaches work best. Modern best practice: use pre-trained ViT (CLIP, etc.).

What's the relationship to GPT-4V?

GPT-4V likely uses a ViT-based vision encoder (possibly CLIP-style) connected to a large language model. The vision encoder converts images to embeddings that the LLM can process alongside text.

Vision Transformers (ViT): Applying Transformers to Images | Enrico Piovano

From Text to Images

The transformer architecture, originally designed for text, has proven remarkably general. Vision Transformer (ViT) applies transformers to images with minimal modifications, achieving state-of-the-art results in computer vision.

The key insight is treating images like sequences: split an image into patches, flatten each patch into a vector, and process them with a standard transformer. This simple idea, combined with large-scale pretraining, matches or exceeds CNN performance.

2025: The era of vision foundation models. DINOv3, released by Meta in August 2025, represents the new frontier—6.7 billion parameters, trained on 1.7 billion images, with breakthrough innovations:

RoPE position embeddings enabling variable resolution (256×256 to 4096×4096) without retraining
Gram anchoring for stable dense features in segmentation tasks
88.4% ImageNet accuracy self-supervised, outperforming DINOv2 (87.3%)
86.6 mIoU on PASCAL VOC segmentation vs DINOv2's 83.1 (source)

The self-supervised vs supervised tradeoff: According to DINOv3 analysis, weakly-supervised models like SigLIP-2 (89.1% ImageNet) win on classification, but self-supervised DINOv3 dominates on dense tasks (segmentation, depth, 3D awareness). Text supervision is great for "what's in the image" but not for purely visual tasks.

Understanding ViT is important because it bridges NLP and vision, enables multimodal models (like GPT-4V), and has become the foundation for modern vision architectures. This post covers how ViT works, why it works, and how it's evolved.

Part I: The ViT Architecture

From Pixels to Patches

The fundamental challenge: images are 2D grids of pixels, but transformers process sequences of tokens. ViT's solution: split the image into patches and treat patches as tokens.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    IMAGE TO PATCHES                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROCESS:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Input image: 224 × 224 × 3 (H × W × RGB)                              │
│  Patch size: 16 × 16                                                    │
│  Number of patches: (224/16) × (224/16) = 14 × 14 = 196               │
│                                                                          │
│  Each patch: 16 × 16 × 3 = 768 values                                  │
│  Sequence length: 196 patches + 1 [CLS] token = 197                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  VISUALIZATION:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  Original Image (224×224):                                              │
│  ┌────────────────────────────────┐                                    │
│  │                                │                                    │
│  │     [Photo of a cat]          │                                    │
│  │                                │                                    │
│  │                                │                                    │
│  └────────────────────────────────┘                                    │
│                                                                          │
│  Split into 14×14 patches:                                             │
│  ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐                         │
│  │ 0│ 1│ 2│ 3│ 4│ 5│ 6│ 7│ 8│ 9│10│11│12│13│  (row 0)                │
│  ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤                         │
│  │14│15│16│17│18│19│20│21│22│23│24│25│26│27│  (row 1)                │
│  ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤                         │
│  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  ...                    │
│  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │                         │
│  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │                         │
│  └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘                         │
│                                                                          │
│  Each patch becomes a "token":                                         │
│  [CLS] [P0] [P1] [P2] ... [P195]                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PATCH SIZES:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Patch Size    Patches (224²)    Sequence Length                       │
│  ─────────────────────────────────────────────────                      │
│  32 × 32       7 × 7 = 49       50 (+ [CLS])                          │
│  16 × 16       14 × 14 = 196    197                                    │
│  14 × 14       16 × 16 = 256    257                                    │
│  8 × 8         28 × 28 = 784    785                                    │
│                                                                          │
│  Smaller patches = more tokens = more compute but finer detail.       │
│  16×16 is the most common default.                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Patch Embedding

Each patch is flattened and projected to the model's hidden dimension:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PATCH EMBEDDING                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  METHOD 1: Linear Projection (Original ViT)                             │
│  ───────────────────────────────────────────                             │
│                                                                          │
│  1. Flatten each patch: 16×16×3 = 768 values                          │
│  2. Linear projection: 768 → hidden_dim (e.g., 768)                   │
│                                                                          │
│  patch_embed = nn.Linear(patch_size² × channels, hidden_dim)          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  METHOD 2: Convolutional Projection (Common variant)                    │
│  ────────────────────────────────────────────────────                    │
│                                                                          │
│  Use Conv2d with kernel_size = patch_size, stride = patch_size        │
│                                                                          │
│  patch_embed = nn.Conv2d(                                               │
│      in_channels=3,                                                     │
│      out_channels=hidden_dim,                                          │
│      kernel_size=16,                                                    │
│      stride=16                                                          │
│  )                                                                       │
│                                                                          │
│  This is mathematically equivalent to linear projection                │
│  but more efficient (single operation, not flatten + matmul).         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  class PatchEmbed(nn.Module):                                           │
│      def __init__(self, img_size=224, patch_size=16, in_chans=3,      │
│                   embed_dim=768):                                      │
│          super().__init__()                                            │
│          self.img_size = img_size                                      │
│          self.patch_size = patch_size                                  │
│          self.n_patches = (img_size // patch_size) ** 2               │
│                                                                          │
│          self.proj = nn.Conv2d(                                        │
│              in_chans, embed_dim,                                      │
│              kernel_size=patch_size, stride=patch_size                │
│          )                                                              │
│                                                                          │
│      def forward(self, x):                                             │
│          # x: (B, 3, 224, 224)                                        │
│          x = self.proj(x)       # (B, embed_dim, 14, 14)             │
│          x = x.flatten(2)       # (B, embed_dim, 196)                │
│          x = x.transpose(1, 2)  # (B, 196, embed_dim)                │
│          return x                                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Position Embeddings for Images

Transformers need position information. For images, ViT uses learned 2D or 1D position embeddings:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    POSITION EMBEDDINGS IN VIT                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OPTION 1: Learned 1D Positions (Original ViT)                          │
│  ─────────────────────────────────────────────                           │
│                                                                          │
│  Treat patches as a 1D sequence with learned embeddings:               │
│                                                                          │
│  pos_embed = nn.Parameter(torch.zeros(1, n_patches + 1, embed_dim))    │
│                                                                          │
│  Patches are ordered left-to-right, top-to-bottom:                    │
│  [CLS] [0,0] [0,1] ... [0,13] [1,0] [1,1] ... [13,13]                │
│                                                                          │
│  Model learns to associate position with spatial location.            │
│  Surprisingly, this works as well as explicit 2D encodings!          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPTION 2: Learned 2D Positions                                         │
│  ──────────────────────────────                                          │
│                                                                          │
│  Separate embeddings for row and column:                               │
│                                                                          │
│  row_embed = nn.Parameter(torch.zeros(1, n_rows, embed_dim // 2))     │
│  col_embed = nn.Parameter(torch.zeros(1, n_cols, embed_dim // 2))     │
│                                                                          │
│  For patch at (i, j):                                                  │
│  pos_embed[i,j] = concat(row_embed[i], col_embed[j])                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPTION 3: Sinusoidal 2D Positions                                      │
│  ─────────────────────────────────                                       │
│                                                                          │
│  Like original transformer, but for 2D:                                │
│                                                                          │
│  def get_2d_sincos_pos_embed(embed_dim, grid_size):                    │
│      grid_h = np.arange(grid_size)                                     │
│      grid_w = np.arange(grid_size)                                     │
│      grid = np.meshgrid(grid_w, grid_h)                               │
│      grid = np.stack(grid, axis=0).reshape(2, -1)                     │
│                                                                          │
│      # Half dims for each axis                                         │
│      pos_embed_h = get_1d_sincos(embed_dim // 2, grid[0])            │
│      pos_embed_w = get_1d_sincos(embed_dim // 2, grid[1])            │
│      return np.concatenate([pos_embed_h, pos_embed_w], axis=1)       │
│                                                                          │
│  Used by: MAE, some CLIP variants                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPTION 4: RoPE for Vision (Modern approach)                           │
│  ────────────────────────────────────────────                            │
│                                                                          │
│  2D Rotary Position Embedding:                                         │
│  • Separate rotations for x and y axes                                │
│  • Better extrapolation to different resolutions                       │
│  • Used by: EVA, some newer ViTs                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMPARISON:                                                             │
│                                                                          │
│  Method         Resolution Flex    Performance    Complexity           │
│  ───────────────────────────────────────────────────────────          │
│  Learned 1D     Limited            Good           Simple              │
│  Learned 2D     Better             Good           Moderate            │
│  Sinusoidal 2D  Good               Good           Simple              │
│  2D RoPE        Best               Best           Moderate            │
│                                                                          │
│  For fixed resolution: Learned 1D works fine.                         │
│  For variable resolution: Sinusoidal or RoPE preferred.              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The [CLS] Token

Like BERT, ViT uses a special [CLS] token for classification:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE [CLS] TOKEN                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PURPOSE:                                                                │
│  ────────                                                                │
│                                                                          │
│  [CLS] token aggregates information from all patches for classification│
│                                                                          │
│  Input sequence: [CLS] [P0] [P1] [P2] ... [P195]                      │
│  After transformer: [CLS'] [P0'] [P1'] [P2'] ... [P195']              │
│                                                                          │
│  For classification, use only [CLS']:                                  │
│  output = classifier_head(transformer_output[:, 0, :])                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY [CLS]?                                                              │
│  ──────────                                                              │
│                                                                          │
│  1. AGGREGATION POINT                                                   │
│     [CLS] attends to all patches, gathers global information          │
│     Doesn't have to choose which patch to use for output              │
│                                                                          │
│  2. CONSISTENT REPRESENTATION                                           │
│     Output is always at position 0                                     │
│     Independent of image size / patch count                           │
│                                                                          │
│  3. SAME AS BERT                                                        │
│     Makes transfer learning from NLP easier                           │
│     Same architecture patterns                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ALTERNATIVES:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  GLOBAL AVERAGE POOLING:                                                │
│  output = transformer_output[:, 1:, :].mean(dim=1)                    │
│                                                                          │
│  Average all patch representations (excluding [CLS]).                  │
│  Works comparably for classification.                                  │
│  Some models (DeiT) use both [CLS] and pooling.                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FOR DENSE PREDICTION:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  For segmentation, detection: use all patch outputs.                  │
│  Reshape (B, 196, D) → (B, D, 14, 14) and decode.                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Full ViT Architecture

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VIT ARCHITECTURE OVERVIEW                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT PROCESSING:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Image (224×224×3)                                                      │
│        │                                                                │
│        ▼                                                                │
│  ┌─────────────┐                                                       │
│  │Patch Embed  │  Conv2d(3, 768, k=16, s=16)                          │
│  └─────────────┘                                                       │
│        │                                                                │
│        ▼                                                                │
│  Patches (196 × 768)                                                   │
│        │                                                                │
│        ├──── + [CLS] token (1 × 768)                                  │
│        │                                                                │
│        ▼                                                                │
│  Tokens (197 × 768)                                                    │
│        │                                                                │
│        ├──── + Position Embeddings (197 × 768)                        │
│        │                                                                │
│        ▼                                                                │
│  Embedded Tokens (197 × 768)                                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TRANSFORMER BLOCKS (×12 for ViT-Base):                                 │
│  ───────────────────────────────────────                                 │
│                                                                          │
│        │                                                                │
│        ▼                                                                │
│  ┌─────────────┐                                                       │
│  │ LayerNorm   │                                                       │
│  └─────────────┘                                                       │
│        │                                                                │
│        ▼                                                                │
│  ┌─────────────┐                                                       │
│  │  Multi-Head │  12 heads, head_dim = 64                             │
│  │  Self-Attn  │                                                       │
│  └─────────────┘                                                       │
│        │                                                                │
│        ├──── + Residual                                               │
│        │                                                                │
│        ▼                                                                │
│  ┌─────────────┐                                                       │
│  │ LayerNorm   │                                                       │
│  └─────────────┘                                                       │
│        │                                                                │
│        ▼                                                                │
│  ┌─────────────┐                                                       │
│  │     MLP     │  768 → 3072 → 768 (4× expansion)                    │
│  │  (GELU)     │                                                       │
│  └─────────────┘                                                       │
│        │                                                                │
│        ├──── + Residual                                               │
│        │                                                                │
│        ▼                                                                │
│  (repeat 12×)                                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OUTPUT HEAD:                                                            │
│  ────────────                                                            │
│                                                                          │
│        │                                                                │
│        ▼                                                                │
│  ┌─────────────┐                                                       │
│  │ LayerNorm   │                                                       │
│  └─────────────┘                                                       │
│        │                                                                │
│        ▼                                                                │
│  Extract [CLS] token (1 × 768)                                        │
│        │                                                                │
│        ▼                                                                │
│  ┌─────────────┐                                                       │
│  │ Classifier  │  Linear(768, num_classes)                            │
│  │    Head     │                                                       │
│  └─────────────┘                                                       │
│        │                                                                │
│        ▼                                                                │
│  Class logits (1 × 1000)                                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part II: Implementation

Complete ViT Implementation

Python

import torch
import torch.nn as nn
import torch.nn.functional as F

class PatchEmbed(nn.Module):
    """Image to Patch Embedding."""

    def __init__(
        self,
        img_size: int = 224,
        patch_size: int = 16,
        in_chans: int = 3,
        embed_dim: int = 768,
    ):
        super().__init__()
        self.img_size = img_size
        self.patch_size = patch_size
        self.n_patches = (img_size // patch_size) ** 2

        # Convolutional projection (equivalent to linear on flattened patches)
        self.proj = nn.Conv2d(
            in_chans, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        # x: (B, 3, H, W)
        x = self.proj(x)  # (B, embed_dim, H/patch, W/patch)
        x = x.flatten(2)  # (B, embed_dim, n_patches)
        x = x.transpose(1, 2)  # (B, n_patches, embed_dim)
        return x

class Attention(nn.Module):
    """Multi-Head Self-Attention."""

    def __init__(
        self,
        dim: int,
        n_heads: int = 12,
        qkv_bias: bool = True,
        attn_drop: float = 0.0,
        proj_drop: float = 0.0,
    ):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = dim // n_heads
        self.scale = self.head_dim ** -0.5

        # Combined QKV projection
        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape

        # Generate Q, K, V
        qkv = self.qkv(x)  # (B, N, 3*dim)
        qkv = qkv.reshape(B, N, 3, self.n_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, B, heads, N, head_dim)
        q, k, v = qkv[0], qkv[1], qkv[2]

        # Attention
        attn = (q @ k.transpose(-2, -1)) * self.scale  # (B, heads, N, N)
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        # Combine heads
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

class MLP(nn.Module):
    """Feed-Forward Network with GELU."""

    def __init__(
        self,
        in_features: int,
        hidden_features: int = None,
        out_features: int = None,
        drop: float = 0.0,
    ):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features * 4

        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

class TransformerBlock(nn.Module):
    """Transformer block with pre-normalization."""

    def __init__(
        self,
        dim: int,
        n_heads: int,
        mlp_ratio: float = 4.0,
        qkv_bias: bool = True,
        drop: float = 0.0,
        attn_drop: float = 0.0,
    ):
        super().__init__()
        self.norm1 = nn.LayerNorm(dim)
        self.attn = Attention(
            dim, n_heads=n_heads, qkv_bias=qkv_bias,
            attn_drop=attn_drop, proj_drop=drop
        )
        self.norm2 = nn.LayerNorm(dim)
        self.mlp = MLP(
            in_features=dim,
            hidden_features=int(dim * mlp_ratio),
            drop=drop
        )

    def forward(self, x):
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp(self.norm2(x))
        return x

class VisionTransformer(nn.Module):
    """Vision Transformer (ViT)."""

    def __init__(
        self,
        img_size: int = 224,
        patch_size: int = 16,
        in_chans: int = 3,
        num_classes: int = 1000,
        embed_dim: int = 768,
        depth: int = 12,
        n_heads: int = 12,
        mlp_ratio: float = 4.0,
        qkv_bias: bool = True,
        drop_rate: float = 0.0,
        attn_drop_rate: float = 0.0,
    ):
        super().__init__()
        self.num_classes = num_classes
        self.embed_dim = embed_dim

        # Patch embedding
        self.patch_embed = PatchEmbed(
            img_size=img_size,
            patch_size=patch_size,
            in_chans=in_chans,
            embed_dim=embed_dim,
        )
        n_patches = self.patch_embed.n_patches

        # CLS token and position embeddings
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, n_patches + 1, embed_dim))
        self.pos_drop = nn.Dropout(p=drop_rate)

        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(
                dim=embed_dim,
                n_heads=n_heads,
                mlp_ratio=mlp_ratio,
                qkv_bias=qkv_bias,
                drop=drop_rate,
                attn_drop=attn_drop_rate,
            )
            for _ in range(depth)
        ])

        # Output
        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)

        # Initialize weights
        self._init_weights()

    def _init_weights(self):
        # Initialize position embeddings
        nn.init.trunc_normal_(self.pos_embed, std=0.02)
        nn.init.trunc_normal_(self.cls_token, std=0.02)

        # Initialize linear layers
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.trunc_normal_(m.weight, std=0.02)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)

    def forward(self, x):
        B = x.shape[0]

        # Patch embedding
        x = self.patch_embed(x)  # (B, n_patches, embed_dim)

        # Add CLS token
        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)  # (B, n_patches + 1, embed_dim)

        # Add position embeddings
        x = x + self.pos_embed
        x = self.pos_drop(x)

        # Transformer blocks
        for block in self.blocks:
            x = block(x)

        # Output
        x = self.norm(x)
        cls_output = x[:, 0]  # CLS token
        logits = self.head(cls_output)

        return logits

# Model variants
def vit_tiny(num_classes=1000):
    return VisionTransformer(
        embed_dim=192, depth=12, n_heads=3, num_classes=num_classes
    )

def vit_small(num_classes=1000):
    return VisionTransformer(
        embed_dim=384, depth=12, n_heads=6, num_classes=num_classes
    )

def vit_base(num_classes=1000):
    return VisionTransformer(
        embed_dim=768, depth=12, n_heads=12, num_classes=num_classes
    )

def vit_large(num_classes=1000):
    return VisionTransformer(
        embed_dim=1024, depth=24, n_heads=16, num_classes=num_classes
    )

Part III: Why ViT Works

Scale is Key

The original ViT paper showed that transformers need lots of data to outperform CNNs:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VIT SCALING BEHAVIOR                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  KEY FINDING:                                                            │
│  ────────────                                                            │
│                                                                          │
│  ViT underperforms CNNs on small datasets.                             │
│  ViT outperforms CNNs on large datasets.                               │
│                                                                          │
│  The crossover point: ~10-100 million images.                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RESULTS (ImageNet-1K accuracy):                                        │
│                                                                          │
│  Training Data        ResNet-152       ViT-Large                       │
│  ────────────────────────────────────────────────────                  │
│  ImageNet-1K          78.5%            76.5%  (ViT worse)              │
│  ImageNet-21K         80.0%            82.0%  (ViT better)             │
│  JFT-300M             81.0%            87.8%  (ViT much better)        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY SCALE MATTERS:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  CNNs have strong inductive biases:                                    │
│  • Translation equivariance (convolutions)                            │
│  • Locality (small receptive fields initially)                        │
│  • Hierarchical features (pooling)                                    │
│                                                                          │
│  These biases help when data is limited.                              │
│                                                                          │
│  ViT has weak inductive biases:                                        │
│  • Only bias: 2D structure of patches                                 │
│  • Attention can learn any pattern                                    │
│  • More flexible but needs more data to learn                         │
│                                                                          │
│  With enough data, ViT's flexibility becomes an advantage.            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MODERN SOLUTION: Better training, not more data                       │
│  ───────────────────────────────────────────────                         │
│                                                                          │
│  DeiT (Data-efficient Image Transformers) showed:                     │
│  With proper training (augmentation, regularization),                 │
│  ViT can match CNNs on ImageNet-1K alone!                             │
│                                                                          │
│  Key improvements:                                                      │
│  • Strong augmentation (RandAugment, Mixup, CutMix)                  │
│  • Regularization (DropPath, LayerScale)                              │
│  • Knowledge distillation from CNN teachers                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

What Does ViT Learn?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    WHAT VIT LEARNS                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ATTENTION PATTERNS:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  Early layers:                                                          │
│  • Local attention (attend to nearby patches)                         │
│  • Similar to early CNN layers                                        │
│  • Capture edges, textures                                            │
│                                                                          │
│  Middle layers:                                                         │
│  • Mix of local and global attention                                  │
│  • Attend to semantically related patches                             │
│  • E.g., all patches containing "dog" attend to each other           │
│                                                                          │
│  Later layers:                                                          │
│  • Global attention patterns                                          │
│  • [CLS] attends to discriminative regions                           │
│  • Task-relevant feature aggregation                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  POSITION EMBEDDINGS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Learned position embeddings show 2D structure!                       │
│                                                                          │
│  Similarity between position embeddings:                              │
│  • Nearby positions have similar embeddings                           │
│  • Horizontal/vertical neighbors more similar than diagonal           │
│  • Model learns 2D grid from 1D indexing                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMPARISON TO CNNs:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  CNNs: Forced locality in early layers → Gradual increase            │
│  ViT: Can attend globally from layer 1 → Chooses to be local         │
│                                                                          │
│  ViT LEARNS the inductive biases that CNNs have hardcoded!           │
│  With enough data, learned biases can be more appropriate.           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IV: ViT Variants and Evolution

Key ViT Variants

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VIT VARIANTS                                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DeiT (Data-efficient Image Transformer):                               │
│  ─────────────────────────────────────────                               │
│  • Same architecture as ViT                                            │
│  • Better training recipe                                              │
│  • Distillation token for knowledge transfer                          │
│  • Matches ViT-JFT using only ImageNet-1K                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Swin Transformer (Hierarchical):                                       │
│  ─────────────────────────────────                                       │
│  • Hierarchical feature maps (like CNNs)                              │
│  • Shifted window attention (local, efficient)                        │
│  • Good for dense prediction (detection, segmentation)               │
│  • Complexity: O(n) instead of O(n²)                                  │
│                                                                          │
│  Architecture:                                                          │
│  Stage 1: 56×56, 96 dim     → Patch merge →                          │
│  Stage 2: 28×28, 192 dim    → Patch merge →                          │
│  Stage 3: 14×14, 384 dim    → Patch merge →                          │
│  Stage 4: 7×7, 768 dim                                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BEiT (BERT pre-training for images):                                   │
│  ──────────────────────────────────                                      │
│  • Masked image modeling (like BERT)                                  │
│  • Discrete visual tokens via VQ-VAE                                  │
│  • Strong transfer learning                                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MAE (Masked Autoencoder):                                              │
│  ─────────────────────────                                               │
│  • Mask 75% of patches during training                               │
│  • Reconstruct masked patches                                         │
│  • Very efficient pre-training                                        │
│  • Enables huge models with less compute                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CLIP ViT (Contrastive Language-Image Pre-training):                   │
│  ─────────────────────────────────────────────────────                   │
│  • Pre-trained with image-text pairs                                  │
│  • Zero-shot classification                                           │
│  • Foundation for multimodal models                                   │
│  • Powers DALL-E, GPT-4V, etc.                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Size Comparison

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VIT MODEL SIZES                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model         Params    Layers   Hidden   Heads   Patch                │
│  ─────────────────────────────────────────────────────────────────────  │
│  ViT-Ti/16     6M        12       192      3       16                   │
│  ViT-S/16      22M       12       384      6       16                   │
│  ViT-B/16      86M       12       768      12      16                   │
│  ViT-B/32      88M       12       768      12      32                   │
│  ViT-L/16      307M      24       1024     16      16                   │
│  ViT-L/32      306M      24       1024     16      32                   │
│  ViT-H/14      632M      32       1280     16      14                   │
│  ViT-G/14      1.8B      40       1664     16      14                   │
│  ViT-22B       22B       48       6144     48      14                   │
│                                                                          │
│  Notation: ViT-{Size}/{Patch_Size}                                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMPUTE COMPARISON (224×224 input):                                    │
│                                                                          │
│  Model           GFLOPs    Tokens                                       │
│  ───────────────────────────────────                                    │
│  ViT-B/32        4.4       50                                          │
│  ViT-B/16        17.6      197                                         │
│  ViT-L/16        61.6      197                                         │
│  ViT-H/14        167.4     257                                         │
│                                                                          │
│  ResNet-50       4.1       -                                           │
│  ResNet-152      11.5      -                                           │
│                                                                          │
│  ViT-B/16 is ~4× more compute than ResNet-50 for similar accuracy.   │
│  But ViT scales better: ViT-H beats everything despite more compute.  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: Multimodal Vision Transformers

CLIP and Vision-Language Models

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CLIP ARCHITECTURE                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CLIP TRAINING:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  ┌───────────────┐                    ┌───────────────┐                │
│  │   Image       │                    │    Text       │                │
│  │  (224×224)    │                    │  "a dog..."   │                │
│  └───────────────┘                    └───────────────┘                │
│         │                                    │                         │
│         ▼                                    ▼                         │
│  ┌───────────────┐                    ┌───────────────┐                │
│  │  ViT Encoder  │                    │ Text Encoder  │                │
│  │   (ViT-L/14)  │                    │ (Transformer) │                │
│  └───────────────┘                    └───────────────┘                │
│         │                                    │                         │
│         ▼                                    ▼                         │
│  Image embedding                      Text embedding                  │
│    (768-dim)                            (768-dim)                     │
│         │                                    │                         │
│         └──────────────┬─────────────────────┘                         │
│                        ▼                                               │
│                 Contrastive Loss                                       │
│         (match image-text pairs in batch)                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ZERO-SHOT CLASSIFICATION:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  No training on downstream dataset!                                    │
│                                                                          │
│  1. Create text prompts for each class:                               │
│     "a photo of a cat", "a photo of a dog", ...                       │
│                                                                          │
│  2. Encode all prompts with text encoder                              │
│                                                                          │
│  3. Encode image with image encoder                                    │
│                                                                          │
│  4. Compute similarity: image_emb · text_emb                          │
│                                                                          │
│  5. Highest similarity = predicted class                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY CLIP MATTERS:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  • Pre-trained on 400M image-text pairs from internet                 │
│  • Generalizes to novel concepts ("zero-shot")                        │
│  • Foundation for image generation (DALL-E, Stable Diffusion)        │
│  • Foundation for multimodal LLMs (GPT-4V, LLaVA)                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Vision Language Models (GPT-4V, LLaVA)

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VISION LANGUAGE MODELS                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ARCHITECTURE PATTERN:                                                   │
│  ────────────────────                                                    │
│                                                                          │
│       Image                              Text                           │
│         │                                  │                            │
│         ▼                                  ▼                            │
│  ┌─────────────┐                    ┌─────────────┐                    │
│  │ ViT Encoder │                    │  Tokenizer  │                    │
│  │ (CLIP/EVA)  │                    │             │                    │
│  └─────────────┘                    └─────────────┘                    │
│         │                                  │                            │
│         ▼                                  │                            │
│  ┌─────────────┐                          │                            │
│  │  Projector  │  (MLP to align dims)     │                            │
│  └─────────────┘                          │                            │
│         │                                  │                            │
│         └──────────────┬───────────────────┘                            │
│                        ▼                                                │
│                  ┌───────────┐                                         │
│                  │    LLM    │                                         │
│                  │ (Llama,   │                                         │
│                  │  Mistral) │                                         │
│                  └───────────┘                                         │
│                        │                                                │
│                        ▼                                                │
│                   Response                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY COMPONENTS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  1. Vision Encoder: Pre-trained ViT (CLIP, SigLIP, EVA)               │
│     - Often frozen during training                                     │
│     - Converts image to sequence of embeddings                        │
│                                                                          │
│  2. Projector: Aligns vision and text embedding spaces                │
│     - Simple: Linear layer                                            │
│     - Complex: MLP, cross-attention                                   │
│                                                                          │
│  3. LLM: Pre-trained language model                                    │
│     - May be fine-tuned or frozen                                     │
│     - Processes image tokens + text tokens together                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE: LLaVA Input Sequence                                          │
│  ─────────────────────────────                                           │
│                                                                          │
│  [Image tokens] [BOS] User: What's in this image? [/INST]             │
│  Assistant: This image shows a cat sitting on a windowsill.           │
│                                                                          │
│  Image tokens (e.g., 576 tokens from 24×24 patches)                   │
│  are prepended to text tokens.                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VI: Recent Innovations (2024-2025)

SigLIP and SigLIP 2

SigLIP (Sigmoid Loss for Language-Image Pre-training) improved CLIP's training efficiency:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SIGLIP: IMPROVED CONTRASTIVE LEARNING                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CLIP's LIMITATION:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  CLIP uses softmax contrastive loss:                                    │
│  - Requires large batch sizes (32K+) for good negatives               │
│  - Expensive distributed training                                       │
│  - Loss requires all-to-all comparison in batch                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SIGLIP'S SOLUTION:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Replace softmax with sigmoid (binary classification):                  │
│                                                                          │
│  CLIP loss: softmax(img_i · text_j / τ) for all j                     │
│  SigLIP loss: sigmoid(img_i · text_i / τ) for positive pairs          │
│              sigmoid(-img_i · text_j / τ) for negative pairs          │
│                                                                          │
│  Each pair is classified independently!                                 │
│  No need for all-to-all comparison.                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BENEFITS:                                                               │
│  ─────────                                                               │
│                                                                          │
│  • Works with smaller batch sizes (4K vs 32K)                          │
│  • 4× faster training at same quality                                  │
│  • Better scaling to larger models                                     │
│  • Simpler distributed training                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SIGLIP 2 IMPROVEMENTS (2024):                                          │
│  ─────────────────────────────                                           │
│                                                                          │
│  1. CAPTIONING LOSS:                                                    │
│     Add image captioning objective alongside contrastive                │
│     Better fine-grained understanding                                   │
│                                                                          │
│  2. SELF-FILTERING:                                                     │
│     Use model to filter low-quality image-text pairs                   │
│     Improves training data quality automatically                        │
│                                                                          │
│  3. NATIVE RESOLUTION:                                                  │
│     Train at multiple resolutions                                       │
│     Better handling of aspect ratios                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RESULTS:                                                                │
│                                                                          │
│  Model           ImageNet Zero-shot    Training Cost                    │
│  ─────────────────────────────────────────────────────────────         │
│  CLIP ViT-L/14   75.5%                 ~12K GPU-hours                  │
│  SigLIP ViT-L    78.2%                 ~3K GPU-hours                   │
│  SigLIP 2 ViT-L  80.1%                 ~4K GPU-hours                   │
│                                                                          │
│  SigLIP 2 is the default vision encoder for many VLMs in 2024-2025.   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

DINOv2: Self-Supervised Vision Foundation Models

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DINOV2: UNIVERSAL VISUAL FEATURES                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DINO (Self-Distillation with No Labels):                               │
│  ─────────────────────────────────────────                               │
│                                                                          │
│  Student-teacher self-supervised learning:                              │
│  • Student sees augmented crops                                         │
│  • Teacher sees different augmented crops                              │
│  • Student learns to match teacher's output                            │
│  • Teacher is exponential moving average of student                    │
│                                                                          │
│  No labels needed! Learns from image structure alone.                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DINOV2 IMPROVEMENTS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  1. SCALE:                                                               │
│     • Trained on 142M curated images (LVD-142M)                        │
│     • Model sizes up to ViT-g (1.1B params)                            │
│                                                                          │
│  2. DATA CURATION:                                                       │
│     • Automatic deduplication                                           │
│     • Quality filtering                                                 │
│     • Balanced sampling                                                 │
│                                                                          │
│  3. TRAINING:                                                            │
│     • Combined self-distillation + masked image modeling               │
│     • Multi-crop strategy at different resolutions                     │
│     • KoLeo regularizer for uniform feature distribution               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY PROPERTIES:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  • Semantic segmentation emerges without training!                     │
│    PCA of patch features shows object boundaries                       │
│                                                                          │
│  • Depth estimation from features alone                                │
│    Linear probe achieves good depth prediction                         │
│                                                                          │
│  • Cross-domain transfer                                               │
│    Features work on art, medical images, satellite, etc.               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  USAGE PATTERN:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  import torch                                                           │
│  from transformers import AutoImageProcessor, AutoModel                │
│                                                                          │
│  processor = AutoImageProcessor.from_pretrained(                       │
│      'facebook/dinov2-large'                                            │
│  )                                                                       │
│  model = AutoModel.from_pretrained('facebook/dinov2-large')            │
│                                                                          │
│  # Get features (no [CLS] token, use patch features)                   │
│  inputs = processor(images=image, return_tensors="pt")                 │
│  features = model(**inputs).last_hidden_state                          │
│                                                                          │
│  # For classification: add linear probe                                │
│  # For segmentation: reshape features to spatial grid                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

EVA and EVA-02: Scaling Vision Encoders

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    EVA: EXPLORING VISION TRANSFORMERS                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EVA APPROACH:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Pre-train ViT with masked image modeling using CLIP features as target│
│  Then fine-tune with contrastive learning.                             │
│                                                                          │
│  Stage 1: MIM Pre-training                                              │
│  • Mask 40% of patches                                                 │
│  • Predict CLIP features of masked patches                             │
│  • Faster convergence than predicting pixels                           │
│                                                                          │
│  Stage 2: Contrastive Fine-tuning                                       │
│  • Standard CLIP-style training                                        │
│  • Uses pre-trained weights from Stage 1                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EVA-02 IMPROVEMENTS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  • Scaled to 4.4B parameters (EVA-02-E)                                │
│  • Native 448×448 resolution                                           │
│  • Multi-scale training                                                │
│  • Improved architecture details (RMSNorm, SwiGLU)                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  USE IN VLMS:                                                            │
│  ────────────                                                            │
│                                                                          │
│  EVA-CLIP is popular vision encoder for VLMs:                          │
│  • LLaVA-1.5 uses EVA-CLIP                                             │
│  • Many open-source VLMs prefer EVA over original CLIP                 │
│  • Better fine-grained understanding                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

InternViT and InternVL

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    INTERNVIT: LARGE-SCALE VISION FOUNDATION             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INTERNVIT-6B:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Largest open-source vision encoder (6B parameters)                    │
│                                                                          │
│  Architecture:                                                          │
│  • 48 layers, hidden dim 3200                                          │
│  • 25 attention heads                                                   │
│  • 448×448 native resolution, 14×14 patches                            │
│  • 1024 patch tokens                                                   │
│                                                                          │
│  Training:                                                              │
│  • Contrastive (CLIP-style) + generative objectives                   │
│  • 1B image-text pairs                                                 │
│  • Dynamic resolution training                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INTERNVL (Vision-Language):                                            │
│  ───────────────────────────                                             │
│                                                                          │
│  InternViT + InternLM2 = InternVL                                      │
│                                                                          │
│  • State-of-the-art open-source VLM                                    │
│  • Dynamic high-resolution processing                                  │
│  • Supports up to 4K resolution via tiling                            │
│                                                                          │
│  High-resolution strategy:                                              │
│  • Divide large image into 448×448 tiles                               │
│  • Process each tile with InternViT                                    │
│  • Concatenate tile features                                           │
│  • Enables document/chart understanding                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Updated Vision Encoder Comparison

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VISION ENCODERS (2025)                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model           Params    Training         Best Use Case               │
│  ─────────────────────────────────────────────────────────────────────  │
│  CLIP ViT-L      400M      Contrastive      Zero-shot, legacy          │
│  SigLIP 2        400M      Sigmoid + Cap    VLMs (efficient)           │
│  PaliGemma 2     400M+     SigLIP + Gemma2  OCR, documents (2025)      │
│  DINOv2-L        300M      Self-supervised  Dense prediction           │
│  DINOv3-7B       7B        Self-supervised  Best dense features (2025) │
│  EVA-02-L        300M      MIM + Contrast   VLMs (quality)             │
│  InternVL 3      6B        Native MM        Best open VLM (Apr 2025)   │
│  InternViT-6B    6B        Mixed            Highest resolution         │
│                                                                          │
│  RECOMMENDATIONS:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  For VLMs (general):                                                    │
│  • SigLIP 2: Best efficiency/quality tradeoff                          │
│  • EVA-CLIP: Better fine-grained understanding                         │
│                                                                          │
│  For dense prediction (segmentation, depth):                           │
│  • DINOv3 (7B): +6 mIoU vs DINOv2, best dense features (Aug 2025)    │
│  • DINOv2: Good transfer, smaller model option                        │
│                                                                          │
│  For high-resolution documents/charts:                                  │
│  • InternViT with tiling                                               │
│                                                                          │
│  For research/maximum quality:                                          │
│  • EVA-02-E (4.4B) or InternViT-6B                                    │
│                                                                          │
│  TREND: VLMs increasingly use SigLIP 2 as default encoder             │
│  (PaliGemma, Qwen-VL-2, many open-source models)                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Summary

Vision Transformers apply the transformer architecture to images with minimal modification:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    KEY TAKEAWAYS                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CORE IDEA:                                                              │
│  • Split image into patches (16×16 typical)                           │
│  • Embed patches as tokens                                            │
│  • Add position embeddings and [CLS] token                            │
│  • Process with standard transformer                                   │
│                                                                          │
│  SCALING BEHAVIOR:                                                       │
│  • ViT needs more data than CNNs (weaker inductive bias)             │
│  • With enough data/training, ViT outperforms CNNs                   │
│  • Modern training recipes (DeiT) close the data gap                 │
│                                                                          │
│  KEY VARIANTS:                                                           │
│  • ViT: Original, global attention                                    │
│  • Swin: Hierarchical, local attention, efficient                     │
│  • MAE: Masked pre-training, very efficient                          │
│  • CLIP: Vision-language pre-training, zero-shot                     │
│                                                                          │
│  MULTIMODAL:                                                             │
│  • ViT is the vision backbone for most multimodal models             │
│  • GPT-4V, LLaVA, etc. use ViT + LLM                                 │
│  • CLIP provides aligned vision-text representations                  │
│                                                                          │
│  2024-2025 INNOVATIONS:                                                 │
│  • DINOv3 (Aug 2025): 7B params, 1.7B images, Gram Anchoring        │
│    - +6 mIoU on segmentation, +6.7 J&F on video tracking           │
│    - Distills to smaller models (ViT-B, ViT-L, ConvNeXt)           │
│  • InternVL 3 (Apr 2025): V2PE, Native Multimodal Pre-Training     │
│    - 72.2 MMMU (first open-source >70%), 89.7% ChartQA             │
│    - Variable Visual Position Encoding for long context            │
│  • PaliGemma 2: SigLIP-So400m + Gemma 2, SOTA on OCR              │
│    - 9 variants: 2B/9B/27B × 224/448/896px resolution             │
│  • SigLIP 2: 4× faster training, default for VLMs                   │
│  • TokenFlow (CVPR 2025): Unified image tokenizer                   │
│    - Dual-codebook: semantic + pixel-level features                │
│  • OpenVision (UC Santa Cruz, May 2025):                           │
│    - Open-source alternative to CLIP/SigLIP                        │
│    - 2-3× faster training, matches/beats on TextVQA, ChartQA      │
│  • Janus-Pro-7B: DeepSeek unified multimodal                        │
│    - Decoupled visual encoding for understanding vs generation    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Sources

An Image is Worth 16x16 Words - Original ViT paper
Training Data-Efficient Image Transformers - DeiT
Swin Transformer - Hierarchical ViT
Learning Transferable Visual Models - CLIP
Masked Autoencoders Are Scalable Vision Learners - MAE
Visual Instruction Tuning - LLaVA
Sigmoid Loss for Language Image Pre-Training - SigLIP (2023)
DINOv2: Learning Robust Visual Features - DINOv2 (2023)
EVA: Exploring the Limits of Masked Visual Representation Learning - EVA (2022)
EVA-02: A Visual Representation for Neon Genesis - EVA-02 (2023)
InternVL: Scaling up Vision Foundation Models - InternViT (2024)
DINOv3: Scaling Self-Supervised Learning for Vision Foundation Models - Meta (Aug 2025)

Vision Transformers (ViT): Applying Transformers to Images

Table of Contents

From Text to Images

Part I: The ViT Architecture

From Pixels to Patches

Patch Embedding

Position Embeddings for Images

The [CLS] Token

Full ViT Architecture

Part II: Implementation

Complete ViT Implementation

Part III: Why ViT Works

Scale is Key

What Does ViT Learn?

Part IV: ViT Variants and Evolution

Key ViT Variants

Model Size Comparison

Part V: Multimodal Vision Transformers

CLIP and Vision-Language Models

Vision Language Models (GPT-4V, LLaVA)

Part VI: Recent Innovations (2024-2025)

SigLIP and SigLIP 2

DINOv2: Self-Supervised Vision Foundation Models

EVA and EVA-02: Scaling Vision Encoders

InternViT and InternVL

Updated Vision Encoder Comparison

Summary

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Transformer Architecture: A Complete Deep Dive

Attention Mechanisms: From Self-Attention to FlashAttention

Multimodal LLMs: Vision, Audio, and Beyond