Why not just train a larger dense model?

Cost. A dense 400B model requires ~5× more FLOPs to train and ~5× more FLOPs per inference token than a 47B MoE with similar quality. At scale, this difference is millions of dollars.

Do experts really specialize?

Partially. Research shows experts develop preferences for certain token types, domains, or syntactic patterns. But it's not clean separation—there's significant overlap. Specialization is emergent, not designed.

How do I run Mixtral locally?

Mixtral 8x7B requires ~94GB VRAM in FP16. Options: 2x 48GB GPUs (A6000, etc.), Quantized (AWQ/GPTQ INT4): ~25GB, fits on single high-end consumer GPU, CPU offloading: Slower but works on 32GB+ RAM systems.

Is MoE harder to fine-tune?

Somewhat. You need larger batch sizes for stable training, and the auxiliary loss coefficient may need tuning. But standard LoRA/QLoRA work fine—they typically only train adapter weights, not experts.

Why top-2 and not top-1 or top-4?

Empirical finding. Top-1 is efficient but loses quality from hard routing decisions. Top-2 gives soft blending with minimal extra cost. Top-4+ shows diminishing returns—the extra computation doesn't proportionally improve quality.

How does MoE affect context length?

MoE itself doesn't directly affect context length—that's determined by attention. But MoE's memory efficiency means more VRAM available for KV cache, potentially enabling longer contexts on the same hardware. ---

Mixture of Experts: Scaling LLMs Beyond Dense Models | Enrico Piovano

Why Mixture of Experts Matters

The scaling laws that have driven LLM progress face a fundamental tension: larger models perform better, but larger models also cost more to run. A 70B parameter model produces better outputs than a 7B model, but requires ~10× more compute per token. This creates a painful tradeoff between quality and cost.

Mixture of Experts (MoE) breaks this tradeoff. An MoE model can have the total parameters of a 400B model but the inference cost of a 70B model. The secret: not all parameters are used for every token. Instead, a "router" dynamically selects which subset of parameters (which "experts") to use, based on the input.

2025: MoE dominates frontier AI. According to NVIDIA, the top 10 most intelligent open-source models on the Artificial Analysis leaderboard all use MoE architecture. The leading MoE models in 2025 include:

Model	Total Params	Active Params	Experts	Configuration
DeepSeek-V3	671B	37B	256 + 1 shared	8 routed + 1 shared
Qwen3-235B	235B	22B	128	Top-8 routing
Llama 4 Maverick	~400B	~17B	128 + 1 shared	1 routed + 1 shared
OpenAI gpt-oss-120B	120B	~15B	128	4 per token

Key architectural innovations in 2025:

Shared experts: DeepSeek, Llama 4, and GLM-4.5 use 1 shared expert activated for all tokens plus routed experts, improving convergence stability
Dense warmup: GLM-4.5 uses 3 dense layers before MoE blocks—early MoE routing can interfere with feature extraction
Varied expert sizes: Llama 4 uses fewer but larger experts (2 × 8192) vs DeepSeek's many smaller experts (9 × 2048)

Understanding MoE is essential because it represents the present and future of LLM scaling. As we push toward larger and more capable models, the economics of dense models become untenable. MoE offers a path to continued scaling without proportional cost increases.

This post covers MoE architecture from first principles: what experts are, how routing works, the challenges of load balancing, and practical implementation details. By the end, you'll understand how MoE models achieve their remarkable efficiency and where the field is heading.

Part I: The Core Idea

What Are "Experts"?

In a standard transformer, every token passes through the same Feed-Forward Network (FFN) in each layer. As we covered in the transformer architecture post, the FFN contains approximately 67% of the model's parameters. It's the parameter-heavy component.

An MoE layer replaces this single FFN with multiple FFNs called "experts," plus a router that decides which expert(s) each token should use:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DENSE VS MIXTURE OF EXPERTS                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DENSE TRANSFORMER (Standard):                                           │
│  ─────────────────────────────                                           │
│                                                                          │
│       Token                                                              │
│         │                                                                │
│         ▼                                                                │
│    ┌─────────┐                                                          │
│    │   FFN   │  ← Every token uses the SAME FFN                        │
│    │ (67% of │                                                          │
│    │ params) │                                                          │
│    └─────────┘                                                          │
│         │                                                                │
│         ▼                                                                │
│      Output                                                              │
│                                                                          │
│  All parameters used for every token.                                   │
│  Cost scales linearly with parameters.                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MIXTURE OF EXPERTS:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│       Token                                                              │
│         │                                                                │
│         ▼                                                                │
│    ┌─────────┐                                                          │
│    │ Router  │  ← Small network decides which experts to use           │
│    └─────────┘                                                          │
│         │                                                                │
│    ┌────┴────┬────────┬────────┐                                       │
│    ▼         ▼        ▼        ▼                                       │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐                                   │
│ │Expert│ │Expert│ │Expert│ │Expert│  ← 8 experts (8× FFN params)      │
│ │  1   │ │  2   │ │  3   │ │  4   │                                   │
│ └──────┘ └──────┘ └──────┘ └──────┘                                   │
│    │         │        │        │                                       │
│    │    ┌────┘        │        │                                       │
│    │    │    ┌────────┘        │                                       │
│    ▼    ▼    ▼                 │  (only 2 selected per token)         │
│    ┌─────────┐                 │                                       │
│    │ Combine │ ←───────────────┘                                       │
│    └─────────┘                                                          │
│         │                                                                │
│         ▼                                                                │
│      Output                                                              │
│                                                                          │
│  8× total parameters, but only 2 experts used per token.               │
│  Cost ~2× dense FFN, but capacity ~8× dense FFN.                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Key Insight: Sparse Activation

The breakthrough of MoE is sparse activation: having many parameters but using only a fraction for any given input. This creates a separation between:

Total parameters: All weights stored in memory (determines model capacity/knowledge)
Active parameters: Weights used per forward pass (determines compute cost)

A dense 70B model has 70B total = 70B active parameters. An MoE with 8 experts of 7B each has 56B total parameters but might activate only 14B (2 experts) per token.

This matters because model quality correlates with total parameters (more storage for knowledge), while inference cost correlates with active parameters. MoE gives you more knowledge per FLOP.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PARAMETER EFFICIENCY                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DENSE MODEL SCALING:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Model      Total Params    Active Params    Inference Cost             │
│  ──────────────────────────────────────────────────────────             │
│  7B         7B              7B               1×                         │
│  13B        13B             13B              1.9×                       │
│  70B        70B             70B              10×                        │
│  175B       175B            175B             25×                        │
│                                                                          │
│  Quality scales with params, but so does cost. Linear relationship.    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MOE MODEL SCALING:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Model           Total Params    Active Params    Inference Cost        │
│  ──────────────────────────────────────────────────────────────        │
│  Mixtral 8x7B    47B             13B              ~2×                   │
│  GPT-4 (est.)    ~1.8T           ~280B            ~40×                  │
│  Switch-C        1.6T            ~100B            ~15×                  │
│                                                                          │
│  Mixtral: 47B capacity at 13B cost!                                    │
│  7× more parameters per FLOP than dense equivalent.                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THIS WORKS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  1. Not all knowledge needed for every token                           │
│     "Paris" activates geography experts                                │
│     "def function" activates code experts                              │
│     Different experts for different domains                            │
│                                                                          │
│  2. FFN is where most parameters live                                  │
│     MoE multiplies FFN capacity without multiplying attention          │
│     Since FFN = 67% of params, this is highly effective               │
│                                                                          │
│  3. Inference is memory-bandwidth bound                                │
│     Loading 13B active params is ~4× faster than 47B                  │
│     Compute for 13B vs 47B is similar (GPU has spare compute)         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part II: The Router

How Routing Works

The router is a small neural network that takes a token's hidden state and outputs a probability distribution over experts. The top-k experts (usually k=1 or k=2) are selected, and their outputs are combined weighted by the routing probabilities.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ROUTER MECHANISM                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ROUTER ARCHITECTURE:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Input: Token hidden state h ∈ R^d (e.g., d=4096)                      │
│                                                                          │
│  Step 1: Compute router logits                                          │
│  ─────────────────────────────                                          │
│  router_logits = h × W_router    (W_router ∈ R^(d × num_experts))      │
│                                                                          │
│  Example with 8 experts:                                                │
│  h = [0.2, -0.5, 0.8, ...]  (4096 dims)                               │
│  router_logits = [2.1, 0.3, -0.5, 3.2, 0.1, -0.8, 1.5, 0.9]           │
│                   E1   E2    E3   E4   E5   E6   E7   E8              │
│                                                                          │
│  Step 2: Apply softmax to get routing probabilities                    │
│  ─────────────────────────────────────────────────                      │
│  router_probs = softmax(router_logits)                                 │
│  router_probs = [0.15, 0.03, 0.01, 0.45, 0.02, 0.01, 0.08, 0.05]     │
│                                                                          │
│  Step 3: Select top-k experts (k=2 typical)                            │
│  ─────────────────────────────────────────────                          │
│  top_experts = [E4, E1]  (indices 3, 0)                               │
│  top_probs = [0.45, 0.15]                                              │
│                                                                          │
│  Step 4: Renormalize selected probabilities                            │
│  ─────────────────────────────────────────                              │
│  weights = [0.45, 0.15] / (0.45 + 0.15) = [0.75, 0.25]               │
│                                                                          │
│  Step 5: Compute weighted combination                                   │
│  ─────────────────────────────────────                                  │
│  output = 0.75 × Expert4(h) + 0.25 × Expert1(h)                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  VISUAL REPRESENTATION:                                                  │
│                                                                          │
│       h (hidden state)                                                  │
│            │                                                            │
│            ▼                                                            │
│    ┌───────────────┐                                                   │
│    │   W_router    │  (d × num_experts)                               │
│    │   (linear)    │                                                   │
│    └───────────────┘                                                   │
│            │                                                            │
│            ▼                                                            │
│    [2.1, 0.3, -0.5, 3.2, 0.1, -0.8, 1.5, 0.9]  (logits)             │
│            │                                                            │
│            ▼                                                            │
│       softmax + top-k                                                  │
│            │                                                            │
│     ┌──────┴──────┐                                                    │
│     ▼             ▼                                                    │
│   E4 (w=0.75)  E1 (w=0.25)                                           │
│     │             │                                                    │
│     ▼             ▼                                                    │
│  Expert4(h)   Expert1(h)                                              │
│     │             │                                                    │
│     └──────┬──────┘                                                    │
│            ▼                                                            │
│     weighted sum                                                       │
│            │                                                            │
│            ▼                                                            │
│         output                                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Top-1 vs Top-2 Routing

The choice of how many experts to activate per token (k) involves tradeoffs:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOP-K SELECTION                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TOP-1 ROUTING (k=1):                                                   │
│  ────────────────────                                                    │
│  Each token goes to exactly one expert.                                │
│                                                                          │
│  Advantages:                                                            │
│  • Lowest compute cost (1 expert per token)                            │
│  • Simplest implementation                                             │
│  • Clear expert specialization                                         │
│                                                                          │
│  Disadvantages:                                                         │
│  • Hard routing decisions (no blending)                                │
│  • More sensitive to routing errors                                    │
│  • Less stable training                                                │
│                                                                          │
│  Used by: Switch Transformer, some efficient MoE variants             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOP-2 ROUTING (k=2):                                                   │
│  ────────────────────                                                    │
│  Each token goes to two experts, outputs combined.                     │
│                                                                          │
│  Advantages:                                                            │
│  • Soft blending between experts                                       │
│  • More robust to routing errors                                       │
│  • Smoother expert utilization                                         │
│  • Generally better quality                                            │
│                                                                          │
│  Disadvantages:                                                         │
│  • 2× compute vs top-1                                                 │
│  • More complex combining logic                                        │
│                                                                          │
│  Used by: Mixtral, most production MoE models                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXPERT CHOICE ROUTING (alternative):                                   │
│  ─────────────────────────────────────                                  │
│  Instead of tokens choosing experts, experts choose tokens.           │
│                                                                          │
│  Each expert selects its top-k tokens from the batch.                 │
│  Guarantees perfect load balance!                                      │
│                                                                          │
│  Problem: Some tokens might not be selected by any expert.            │
│  Solution: Ensure enough experts that all tokens get processed.       │
│                                                                          │
│  Used by: Expert Choice paper (Google), some efficient variants       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part III: The Load Balancing Problem

Why Balance Matters

MoE has a critical failure mode: if the router learns to send all tokens to a small subset of experts, the other experts never get trained and become useless. This "expert collapse" wastes parameters and defeats the purpose of MoE.

The problem emerges naturally from gradient descent. If Expert 1 happens to perform slightly better early in training, more tokens get routed to it. With more tokens, Expert 1 gets more gradient updates and improves further. Meanwhile, underused experts get few updates and stagnate. This positive feedback loop leads to collapse.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LOAD BALANCING PROBLEM                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE COLLAPSE FAILURE MODE:                                              │
│  ──────────────────────────                                              │
│                                                                          │
│  Training begins:                                                       │
│  E1: 12.5%  E2: 12.5%  E3: 12.5%  E4: 12.5%  (8 experts, uniform)    │
│  E5: 12.5%  E6: 12.5%  E7: 12.5%  E8: 12.5%                          │
│                                                                          │
│  After 1000 steps (slight imbalance emerges):                          │
│  E1: 15%  E2: 14%  E3: 13%  E4: 12%                                   │
│  E5: 11%  E6: 10%  E7: 8%   E8: 7%                                    │
│                                                                          │
│  After 10000 steps (rich get richer):                                  │
│  E1: 45%  E2: 30%  E3: 15%  E4: 5%                                    │
│  E5: 3%   E6: 1%   E7: 0.5% E8: 0.5%                                  │
│                                                                          │
│  Collapsed state:                                                       │
│  E1: 90%  E2: 8%   E3: 2%   E4-E8: ~0%                                │
│                                                                          │
│  Result: 8× parameters but effectively 1-2 experts used.              │
│  Massive waste of capacity. Defeats the purpose of MoE.               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY COLLAPSE HAPPENS:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  1. Random initialization gives some experts slight advantage          │
│  2. Better experts get more tokens                                     │
│  3. More tokens = more gradient updates = faster learning             │
│  4. Faster learning = even better performance                         │
│  5. Even better = even more tokens (positive feedback)                │
│  6. Underused experts get few updates, fall further behind           │
│                                                                          │
│  Without intervention, this is the natural equilibrium.               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Auxiliary Losses for Balance

The primary solution is adding a loss term that penalizes imbalanced routing:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    AUXILIARY LOAD BALANCING LOSS                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE BALANCE LOSS:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  For each expert i, compute:                                            │
│  • f_i = fraction of tokens routed to expert i                        │
│  • P_i = average routing probability assigned to expert i             │
│                                                                          │
│  Balance loss = α × num_experts × Σᵢ (f_i × P_i)                      │
│                                                                          │
│  This penalizes both:                                                   │
│  • High f_i: Expert getting too many tokens                           │
│  • High P_i: Router assigning high probability to one expert          │
│                                                                          │
│  The product f_i × P_i is minimized when both are uniform (1/N).     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE CALCULATION:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Batch of 100 tokens, 4 experts, top-1 routing:                       │
│                                                                          │
│  Balanced case:                                                         │
│  f = [0.25, 0.25, 0.25, 0.25]  (25 tokens each)                      │
│  P = [0.25, 0.25, 0.25, 0.25]  (uniform avg probability)             │
│  loss = 4 × (0.25×0.25 + 0.25×0.25 + 0.25×0.25 + 0.25×0.25)        │
│       = 4 × 0.25 = 1.0                                                │
│                                                                          │
│  Imbalanced case:                                                       │
│  f = [0.70, 0.20, 0.08, 0.02]  (70 tokens to E1!)                    │
│  P = [0.65, 0.20, 0.10, 0.05]  (router prefers E1)                   │
│  loss = 4 × (0.70×0.65 + 0.20×0.20 + 0.08×0.10 + 0.02×0.05)        │
│       = 4 × (0.455 + 0.04 + 0.008 + 0.001)                          │
│       = 4 × 0.504 = 2.016                                             │
│                                                                          │
│  Higher loss penalizes the imbalanced state!                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOTAL TRAINING LOSS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  L_total = L_language_modeling + α × L_balance                        │
│                                                                          │
│  Typical α values: 0.01 - 0.1                                         │
│  Too low: Collapse still happens                                       │
│  Too high: Router becomes uniform regardless of input                 │
│                                                                          │
│  Finding the right α requires experimentation.                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Capacity Factor and Token Dropping

Another approach limits how many tokens each expert can handle:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CAPACITY FACTOR                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE CONCEPT:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Set a maximum capacity for each expert:                               │
│                                                                          │
│  capacity = (num_tokens / num_experts) × capacity_factor               │
│                                                                          │
│  Example: 100 tokens, 8 experts, capacity_factor=1.25                 │
│  capacity = (100 / 8) × 1.25 = 15.6 ≈ 16 tokens per expert           │
│                                                                          │
│  If more than 16 tokens route to an expert, excess are DROPPED.       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOKEN DROPPING:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  When capacity exceeded:                                               │
│  • Keep tokens with highest routing probability                       │
│  • Drop tokens with lowest probability                                │
│  • Dropped tokens use residual connection only (skip FFN)            │
│                                                                          │
│  100 tokens want Expert 1:                                             │
│  Capacity = 16                                                         │
│  Keep: top 16 by router probability                                   │
│  Drop: remaining 84 tokens (they skip this FFN entirely)             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CAPACITY FACTOR TRADEOFFS:                                             │
│  ──────────────────────────                                              │
│                                                                          │
│  capacity_factor = 1.0:                                                │
│  • Perfect balance forced (each expert gets exactly N/E tokens)       │
│  • High drop rate when routing is imbalanced                          │
│  • May hurt quality if good tokens are dropped                        │
│                                                                          │
│  capacity_factor = 1.25:                                               │
│  • Common default                                                      │
│  • Allows 25% imbalance before dropping                               │
│  • Good balance between utilization and quality                       │
│                                                                          │
│  capacity_factor = 2.0:                                                │
│  • Very permissive                                                     │
│  • Rarely drops tokens                                                 │
│  • May allow significant imbalance                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TRAINING VS INFERENCE:                                                  │
│  ─────────────────────                                                   │
│                                                                          │
│  Training: Use capacity_factor to enforce balance                     │
│  Inference: Often disable capacity limits (process all tokens)       │
│                                                                          │
│  At inference, we want best quality, not training stability.         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IV: Expert Specialization

What Do Experts Learn?

A natural question: do different experts actually specialize in different things? Research shows they do, to varying degrees:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    EXPERT SPECIALIZATION                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EMPIRICAL FINDINGS FROM MIXTRAL ANALYSIS:                              │
│  ──────────────────────────────────────────                              │
│                                                                          │
│  Researchers analyzed which experts activate for different content:    │
│                                                                          │
│  DOMAIN SPECIALIZATION (partial):                                       │
│  ─────────────────────────────────                                       │
│  • Some experts prefer code tokens                                     │
│  • Some experts prefer mathematical notation                           │
│  • Some experts prefer certain languages                               │
│  • But overlap is significant—not clean separation                    │
│                                                                          │
│  SYNTACTIC PATTERNS (stronger):                                         │
│  ──────────────────────────────                                          │
│  • Punctuation often routed to specific experts                       │
│  • Certain experts handle sentence boundaries                          │
│  • Some experts specialize in rare tokens                             │
│                                                                          │
│  POSITIONAL PATTERNS:                                                    │
│  ────────────────────                                                    │
│  • Early sequence positions may prefer different experts              │
│  • Some experts more active at sentence starts                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  VISUALIZATION (conceptual):                                            │
│  ──────────────────────────                                              │
│                                                                          │
│  Token Type          Primary Experts        Secondary                  │
│  ──────────────────────────────────────────────────────────           │
│  Python code         E2, E5                 E1, E7                    │
│  JavaScript          E2, E7                 E5, E1                    │
│  Math formulas       E4, E6                 E3                        │
│  English prose       E1, E3, E8             E6                        │
│  French text         E3, E1                 E8                        │
│  Punctuation         E8                     E3                        │
│  Numbers             E4, E6                 E2                        │
│                                                                          │
│  Note: This is illustrative. Real patterns are more complex.          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHT:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Specialization is EMERGENT, not designed.                             │
│  The router learns to route similar tokens to similar experts.        │
│  This happens because:                                                 │
│  • Similar tokens benefit from similar transformations                │
│  • Experts become good at what they see frequently                   │
│  • Positive feedback reinforces specialization                        │
│                                                                          │
│  We don't tell Expert 2 to handle code—it discovers this pattern.    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Layer-Wise Routing Patterns

Different layers of an MoE model show different routing behaviors:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LAYER-WISE PATTERNS                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EARLY LAYERS (1-8):                                                     │
│  ───────────────────                                                     │
│  • More uniform routing (less specialization)                          │
│  • Handle basic token processing                                       │
│  • Lower entropy in router decisions                                   │
│                                                                          │
│  MIDDLE LAYERS (8-24):                                                   │
│  ────────────────────                                                    │
│  • Strongest specialization                                            │
│  • Most varied routing patterns                                        │
│  • Domain/topic-specific routing emerges here                         │
│                                                                          │
│  LATE LAYERS (24-32):                                                    │
│  ─────────────────────                                                   │
│  • More task-specific routing                                          │
│  • Preparing for output                                                │
│  • Some convergence in patterns                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ROUTING CONSISTENCY:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Same token in same context often routes to same experts.             │
│  But routing can change based on:                                      │
│  • Surrounding context                                                 │
│  • Position in sequence                                                │
│  • Layer depth                                                         │
│                                                                          │
│  "function" in Python context → code experts                          │
│  "function" in math context → math experts                            │
│                                                                          │
│  Context-dependent routing is a feature, not a bug.                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: Architecture Variants

Where to Place MoE Layers

Not every layer needs to be an MoE layer. Common patterns:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MOE LAYER PLACEMENT                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EVERY LAYER (Dense MoE):                                               │
│  ────────────────────────                                                │
│  [MoE] [MoE] [MoE] [MoE] [MoE] [MoE] [MoE] [MoE]                      │
│                                                                          │
│  • Maximum capacity                                                     │
│  • Highest memory usage                                                │
│  • Used by: Some research models                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EVERY OTHER LAYER (Sparse MoE):                                        │
│  ───────────────────────────────                                         │
│  [Dense] [MoE] [Dense] [MoE] [Dense] [MoE] [Dense] [MoE]              │
│                                                                          │
│  • Balance between capacity and efficiency                             │
│  • Dense layers provide "shared" computation                          │
│  • Used by: Mixtral, most practical MoE models                        │
│                                                                          │
│  Why this works:                                                        │
│  Dense layers process all tokens uniformly (global features)          │
│  MoE layers specialize processing (specific features)                 │
│  Alternating captures both patterns                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EVERY 4TH LAYER (Very Sparse MoE):                                    │
│  ──────────────────────────────────                                      │
│  [Dense] [Dense] [Dense] [MoE] [Dense] [Dense] [Dense] [MoE]          │
│                                                                          │
│  • Minimal memory overhead                                             │
│  • Still significant capacity boost                                    │
│  • Used by: Some efficient variants                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MIXTRAL ARCHITECTURE:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  32 layers total                                                        │
│  Every layer has MoE FFN                                               │
│  8 experts per layer                                                   │
│  Top-2 routing                                                          │
│                                                                          │
│  Total params: 8 × 7B (experts) + shared = ~47B                       │
│  Active params: 2 × 7B (FFN) + shared = ~13B                          │
│                                                                          │
│  Attention is always dense (shared across all tokens)                 │
│  Only FFN is sparse (different experts for different tokens)          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Expert Granularity

Experts can be full FFNs or smaller units:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    EXPERT GRANULARITY                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  COARSE-GRAINED (Few Large Experts):                                    │
│  ────────────────────────────────────                                    │
│                                                                          │
│  8 experts, each is a full FFN (4096 → 14336 → 4096)                  │
│                                                                          │
│  ┌────────────────────────────────┐                                    │
│  │      Expert 1 (Full FFN)       │                                    │
│  │   W_up: 4096 × 14336           │                                    │
│  │   W_down: 14336 × 4096         │                                    │
│  └────────────────────────────────┘                                    │
│                                                                          │
│  Advantages:                                                            │
│  • Each expert has significant capacity                               │
│  • Clear specialization possible                                       │
│  • Simpler implementation                                              │
│                                                                          │
│  Disadvantages:                                                         │
│  • Coarse routing granularity                                         │
│  • Memory: must load full expert                                       │
│                                                                          │
│  Used by: Mixtral, GShard, most MoE models                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FINE-GRAINED (Many Small Experts):                                     │
│  ───────────────────────────────────                                     │
│                                                                          │
│  64 experts, each is a small FFN (4096 → 1792 → 4096)                 │
│  Route to top-8, getting similar total capacity to top-2 with 8       │
│                                                                          │
│  ┌──────┐ ┌──────┐ ┌──────┐       (64 small experts)                 │
│  │ E1   │ │ E2   │ │ E3   │ ...                                       │
│  │ tiny │ │ tiny │ │ tiny │                                           │
│  └──────┘ └──────┘ └──────┘                                           │
│                                                                          │
│  Advantages:                                                            │
│  • More precise routing                                                │
│  • Better load balancing                                               │
│  • More flexible combinations                                          │
│                                                                          │
│  Disadvantages:                                                         │
│  • More routing overhead                                               │
│  • Less capacity per expert                                           │
│  • More complex to implement efficiently                              │
│                                                                          │
│  Used by: Some research models (DeepSeek-MoE)                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DEEPSEEK-MOE APPROACH (Hybrid):                                        │
│  ───────────────────────────────                                         │
│                                                                          │
│  1 shared expert (always active) + many routed experts               │
│                                                                          │
│  Output = SharedExpert(x) + Σᵢ wᵢ × RoutedExpertᵢ(x)                  │
│                                                                          │
│  The shared expert handles common patterns                            │
│  Routed experts handle specialized patterns                           │
│  Best of both worlds                                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VI: Implementation Details

Efficient Routing Implementation

Implementing MoE efficiently requires careful attention to GPU computation patterns:

Python

import torch
import torch.nn as nn
import torch.nn.functional as F

class MoELayer(nn.Module):
    """
    Mixture of Experts layer with top-k routing.

    This replaces the standard FFN in a transformer block.
    """

    def __init__(
        self,
        hidden_size: int = 4096,
        intermediate_size: int = 14336,
        num_experts: int = 8,
        num_experts_per_token: int = 2,
        aux_loss_coef: float = 0.01,
    ):
        super().__init__()

        self.hidden_size = hidden_size
        self.num_experts = num_experts
        self.num_experts_per_token = num_experts_per_token
        self.aux_loss_coef = aux_loss_coef

        # Router: maps hidden states to expert scores
        self.router = nn.Linear(hidden_size, num_experts, bias=False)

        # Create experts (each is a small FFN)
        self.experts = nn.ModuleList([
            ExpertFFN(hidden_size, intermediate_size)
            for _ in range(num_experts)
        ])

    def forward(self, hidden_states: torch.Tensor):
        """
        Args:
            hidden_states: (batch_size, seq_len, hidden_size)

        Returns:
            output: (batch_size, seq_len, hidden_size)
            aux_loss: scalar tensor for load balancing
        """
        batch_size, seq_len, _ = hidden_states.shape

        # Flatten batch and sequence dimensions
        # (batch_size * seq_len, hidden_size)
        flat_hidden = hidden_states.view(-1, self.hidden_size)
        num_tokens = flat_hidden.shape[0]

        # Compute routing scores
        # (num_tokens, num_experts)
        router_logits = self.router(flat_hidden)
        router_probs = F.softmax(router_logits, dim=-1)

        # Select top-k experts for each token
        # (num_tokens, k)
        top_k_probs, top_k_indices = torch.topk(
            router_probs,
            self.num_experts_per_token,
            dim=-1
        )

        # Renormalize selected probabilities
        top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Compute auxiliary load balancing loss
        aux_loss = self._compute_aux_loss(router_probs, top_k_indices)

        # Process tokens through selected experts
        # This is the tricky part for efficiency!
        output = self._process_experts(
            flat_hidden, top_k_indices, top_k_probs
        )

        # Reshape back to (batch_size, seq_len, hidden_size)
        output = output.view(batch_size, seq_len, self.hidden_size)

        return output, aux_loss

    def _process_experts(self, hidden, indices, weights):
        """
        Process tokens through their selected experts.

        Naive implementation: loop over experts.
        Efficient implementation: batch tokens by expert.
        """
        num_tokens = hidden.shape[0]
        output = torch.zeros_like(hidden)

        # For each expert, find which tokens selected it
        for expert_idx in range(self.num_experts):
            expert = self.experts[expert_idx]

            # Find (token_idx, slot) pairs where this expert was selected
            # indices shape: (num_tokens, k)
            mask = (indices == expert_idx)  # (num_tokens, k)

            if not mask.any():
                continue

            # Get token indices and their weights for this expert
            token_indices = mask.any(dim=-1).nonzero(as_tuple=True)[0]

            if len(token_indices) == 0:
                continue

            # Get the tokens assigned to this expert
            expert_input = hidden[token_indices]  # (n_tokens, hidden)

            # Process through expert
            expert_output = expert(expert_input)  # (n_tokens, hidden)

            # Get weights for each token-expert pair
            token_weights = weights[mask].view(-1, 1)  # (n_tokens, 1)

            # Accumulate weighted outputs
            output.index_add_(
                0,
                token_indices,
                expert_output * token_weights
            )

        return output

    def _compute_aux_loss(self, router_probs, selected_indices):
        """
        Compute auxiliary loss for load balancing.

        Loss = α * num_experts * Σᵢ (fᵢ * Pᵢ)

        where:
        - fᵢ = fraction of tokens routed to expert i
        - Pᵢ = average routing probability for expert i
        """
        num_tokens = router_probs.shape[0]

        # Compute fraction of tokens per expert (f)
        # Count how many tokens selected each expert (across all k slots)
        expert_counts = torch.zeros(
            self.num_experts,
            device=router_probs.device
        )
        for k in range(self.num_experts_per_token):
            expert_counts.scatter_add_(
                0,
                selected_indices[:, k],
                torch.ones_like(selected_indices[:, k], dtype=torch.float)
            )

        # Normalize to get fractions
        f = expert_counts / (num_tokens * self.num_experts_per_token)

        # Compute average probability per expert (P)
        P = router_probs.mean(dim=0)

        # Compute balance loss
        aux_loss = self.aux_loss_coef * self.num_experts * (f * P).sum()

        return aux_loss

class ExpertFFN(nn.Module):
    """
    Single expert FFN with SwiGLU activation.
    Same architecture as dense transformer FFN.
    """

    def __init__(self, hidden_size: int, intermediate_size: int):
        super().__init__()

        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)

    def forward(self, x):
        # SwiGLU: gate * up, then down
        gate = F.silu(self.gate_proj(x))
        up = self.up_proj(x)
        return self.down_proj(gate * up)

Efficient Expert Batching

The naive implementation above loops over experts. Production implementations batch tokens by expert for GPU efficiency:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    EFFICIENT EXPERT BATCHING                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  NAIVE APPROACH (Loop):                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  for expert in experts:                                                 │
│      tokens_for_expert = select(all_tokens, expert_id)                 │
│      outputs = expert(tokens_for_expert)                               │
│      scatter_back(outputs)                                             │
│                                                                          │
│  Problem: 8 sequential kernel launches, low GPU utilization           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EFFICIENT APPROACH (Grouped GEMM):                                      │
│  ────────────────────────────────────                                    │
│                                                                          │
│  1. Sort tokens by their primary expert assignment                     │
│  2. Create "groups" of contiguous tokens for each expert               │
│  3. Use grouped GEMM to process all groups in one kernel              │
│                                                                          │
│  Before sorting:                                                        │
│  tokens: [T1, T2, T3, T4, T5, T6, T7, T8]                             │
│  experts: [E2, E1, E2, E3, E1, E2, E3, E1]                            │
│                                                                          │
│  After sorting by expert:                                              │
│  tokens: [T2, T5, T8, T1, T3, T6, T4, T7]                             │
│  experts: [E1, E1, E1, E2, E2, E2, E3, E3]                            │
│  groups:  [----E1----] [----E2----] [--E3--]                          │
│                                                                          │
│  Grouped GEMM processes all three groups efficiently.                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MEGABLOCKS LIBRARY:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  Stanford's megablocks library provides efficient MoE kernels:        │
│                                                                          │
│  from megablocks.layers import dmoe                                    │
│                                                                          │
│  moe_layer = dmoe.dMoE(                                                │
│      hidden_size=4096,                                                 │
│      ffn_hidden_size=14336,                                            │
│      num_experts=8,                                                    │
│      top_k=2,                                                          │
│  )                                                                      │
│                                                                          │
│  Key optimizations:                                                    │
│  • Block-sparse matrix operations                                     │
│  • Efficient token permutation                                         │
│  • Fused kernels for routing + expert computation                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VII: Training MoE Models

Training Considerations

Training MoE models requires attention to several unique challenges:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MOE TRAINING CONSIDERATIONS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. BATCH SIZE REQUIREMENTS:                                             │
│  ───────────────────────────                                             │
│                                                                          │
│  MoE needs larger batch sizes than dense models:                       │
│  • Each expert should see enough tokens for meaningful gradients       │
│  • With 8 experts, each sees ~12.5% of tokens on average              │
│  • Need batch_size * seq_len >> num_experts for stable training       │
│                                                                          │
│  Typical: batch_size >= 1024 tokens per expert                        │
│  For 8 experts: total batch >= 8192 tokens                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. AUXILIARY LOSS TUNING:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  L_total = L_LM + α × L_balance                                        │
│                                                                          │
│  α too low (0.001):                                                    │
│  • Load balancing ignored                                              │
│  • Expert collapse likely                                              │
│  • Wasted parameters                                                   │
│                                                                          │
│  α too high (0.1):                                                     │
│  • Router becomes uniform                                              │
│  • No specialization                                                   │
│  • Defeats purpose of MoE                                              │
│                                                                          │
│  Sweet spot: 0.01 - 0.02                                               │
│  May need tuning per model/dataset                                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. GRADIENT NOISE:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Each expert sees different subsets of tokens each batch.             │
│  This creates higher gradient variance than dense models.             │
│                                                                          │
│  Mitigations:                                                           │
│  • Larger batch sizes                                                  │
│  • Lower learning rate                                                 │
│  • Gradient clipping                                                   │
│  • Expert parallelism (experts across devices see all tokens)        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  4. LEARNING RATE SCHEDULES:                                             │
│  ───────────────────────────                                             │
│                                                                          │
│  Router often needs different learning rate than experts:             │
│                                                                          │
│  • Router LR: 10× lower than experts                                  │
│    - Prevents router from changing too fast                           │
│    - Allows experts to adapt to routing decisions                     │
│                                                                          │
│  • Expert LR: Standard transformer LR                                 │
│    - Same as dense FFN would use                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  5. INITIALIZATION:                                                      │
│  ─────────────────                                                       │
│                                                                          │
│  All experts typically initialized identically.                        │
│  Specialization emerges from:                                          │
│  • Different tokens routed to different experts                       │
│  • Different gradient updates per expert                              │
│  • Self-reinforcing specialization                                    │
│                                                                          │
│  Router initialized to produce uniform distribution initially.        │
│  Small random noise breaks symmetry.                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distributed Training for MoE

Large MoE models require distributed training strategies:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED MOE TRAINING                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXPERT PARALLELISM:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  Place different experts on different devices:                         │
│                                                                          │
│  GPU 0: Experts 1, 2                                                   │
│  GPU 1: Experts 3, 4                                                   │
│  GPU 2: Experts 5, 6                                                   │
│  GPU 3: Experts 7, 8                                                   │
│                                                                          │
│  All-to-all communication:                                             │
│  1. Each GPU has subset of tokens                                     │
│  2. Route tokens: GPU 0 tokens may need Experts on GPU 1-3           │
│  3. All-to-all exchange: send tokens to GPUs with their experts      │
│  4. Process through experts locally                                    │
│  5. All-to-all exchange: return outputs to original GPUs             │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                    ALL-TO-ALL COMMUNICATION                      │  │
│  │                                                                  │  │
│  │   GPU 0          GPU 1          GPU 2          GPU 3            │  │
│  │   [T1,T2]        [T3,T4]        [T5,T6]        [T7,T8]          │  │
│  │     │              │              │              │               │  │
│  │     └──────────────┼──────────────┼──────────────┘               │  │
│  │            ┌───────┴───────┐      │                              │  │
│  │            │  All-to-All   │      │                              │  │
│  │            └───────────────┘      │                              │  │
│  │     ┌──────────────┬──────────────┼──────────────┐               │  │
│  │     ▼              ▼              ▼              ▼               │  │
│  │   [T1,T4]        [T2,T6]        [T3,T7]        [T5,T8]          │  │
│  │   (for E1,2)     (for E3,4)     (for E5,6)     (for E7,8)      │  │
│  │                                                                  │  │
│  └─────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMUNICATION COSTS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Each token: 2 × hidden_size × dtype_size bytes                       │
│  For 4096 hidden, BF16: 2 × 4096 × 2 = 16 KB per token              │
│                                                                          │
│  With 8 GPUs, batch of 8192 tokens:                                   │
│  ~130 MB all-to-all communication per MoE layer                       │
│                                                                          │
│  This is significant! Efficient all-to-all is critical.              │
│  NVLink helps but doesn't eliminate the cost.                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMBINING WITH OTHER PARALLELISM:                                       │
│  ─────────────────────────────────                                       │
│                                                                          │
│  Real training combines:                                               │
│  • Data parallelism: Replicate model, split data                      │
│  • Tensor parallelism: Split attention/FFN within node               │
│  • Expert parallelism: Distribute experts across nodes               │
│  • Pipeline parallelism: Split layers across nodes                   │
│                                                                          │
│  Example: Mixtral training                                             │
│  • 64 GPUs total                                                      │
│  • 8-way expert parallelism (1 expert per GPU)                       │
│  • 8-way data parallelism                                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VIII: Inference Optimization

MoE Inference Challenges

MoE presents unique inference challenges:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MOE INFERENCE CHALLENGES                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. MEMORY BANDWIDTH:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Even though only 2 experts active, ALL experts must fit in memory.   │
│                                                                          │
│  Mixtral 8x7B:                                                         │
│  • Active params: ~13B (2 experts + shared)                           │
│  • Total params: ~47B (8 experts + shared)                            │
│  • Memory needed: ~94GB in FP16                                       │
│                                                                          │
│  Compare to dense 13B: ~26GB in FP16                                  │
│  MoE needs 3.6× more memory for same inference compute!              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. LOAD IMBALANCE AT INFERENCE:                                        │
│  ───────────────────────────────                                         │
│                                                                          │
│  Training uses large batches → good balance across experts            │
│  Inference often uses small batches → high variance in routing       │
│                                                                          │
│  Batch of 1 token: only 2 experts needed, 6 experts idle             │
│  But all 8 experts occupy memory!                                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. SPECULATIVE DECODING COMPLICATIONS:                                  │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  Draft model routing != target model routing                          │
│  • Draft routes to different experts than target would               │
│  • Verification must re-route through correct experts                │
│  • Reduces speculation benefits                                       │
│                                                                          │
│  Solutions:                                                            │
│  • Use same routing for draft and target (approximate)               │
│  • Train draft to mimic target's routing                             │
│  • Accept some inefficiency                                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

MoE-Specific Optimizations

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MOE INFERENCE OPTIMIZATIONS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. EXPERT OFFLOADING:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Keep only frequently-used experts on GPU.                            │
│  Load others from CPU/disk on demand.                                 │
│                                                                          │
│  For Mixtral on 24GB GPU:                                             │
│  • Keep 2-4 most popular experts on GPU                               │
│  • Offload rest to CPU RAM                                            │
│  • Load on demand (adds latency but fits in memory)                  │
│                                                                          │
│  Works because routing is often predictable.                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. EXPERT QUANTIZATION:                                                 │
│  ────────────────────────                                                │
│                                                                          │
│  Quantize experts independently:                                       │
│  • Popular experts: FP16/INT8 (quality matters more)                 │
│  • Rare experts: INT4 (less impact on overall quality)               │
│                                                                          │
│  Can also quantize KV cache per-expert.                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. BATCHING FOR MOE:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Continuous batching helps MoE especially:                            │
│  • More tokens → better load balance across experts                  │
│  • Higher GPU utilization                                             │
│                                                                          │
│  Batch 1:  Token routes to E1, E3 → 2 experts used                   │
│  Batch 32: Tokens distributed → all 8 experts used                   │
│                                                                          │
│  vLLM handles this automatically.                                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  4. EXPERT CACHING:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Cache expert outputs for repeated inputs:                            │
│  • Same token in same context → same routing → same output           │
│  • Useful for prefix caching                                          │
│                                                                          │
│  Implementation: Hash (token, context, layer) → cached expert output │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IX: Recent Innovations (2024-2025)

Multi-head Latent Attention (MLA)

DeepSeek introduced Multi-head Latent Attention to dramatically reduce KV cache memory in MoE models:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTI-HEAD LATENT ATTENTION (MLA)                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Standard attention stores full K, V for each token:                   │
│  KV cache per token = 2 × num_heads × head_dim × num_layers           │
│                                                                          │
│  For DeepSeek-V3 (without MLA):                                        │
│  = 2 × 128 × 128 × 61 = ~2MB per token!                               │
│  128K context → 256GB KV cache                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MLA SOLUTION:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Compress K, V into low-dimensional latent vector:                     │
│                                                                          │
│  Standard: K, V ∈ R^(num_heads × head_dim) = R^16384                  │
│  MLA:      c_kv ∈ R^512 (compressed latent)                           │
│                                                                          │
│  At inference:                                                          │
│  1. Store only latent c_kv (32× smaller!)                             │
│  2. When computing attention, project back:                           │
│     K = c_kv × W_uk,  V = c_kv × W_uv                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MEMORY SAVINGS:                                                         │
│                                                                          │
│  Method          KV Cache/Token    128K Context                        │
│  ─────────────────────────────────────────────────────                  │
│  Standard MHA    2MB               256 GB                              │
│  GQA (8 groups)  250KB             32 GB                               │
│  MLA             62KB              8 GB                                │
│                                                                          │
│  MLA achieves 32× reduction vs standard MHA!                          │
│  Even better than GQA while maintaining quality.                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Auxiliary-Loss-Free Load Balancing

Traditional MoE uses auxiliary losses to balance expert load, which can hurt model quality. DeepSeek pioneered a loss-free approach:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    AUXILIARY-LOSS-FREE BALANCING                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL APPROACH (Auxiliary Loss):                                  │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  L_total = L_language + α × L_balance                                  │
│                                                                          │
│  Problem: L_balance gradient interferes with L_language.               │
│  Tuning α is difficult—too high hurts quality, too low causes collapse.│
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DEEPSEEK'S APPROACH (Bias Adjustment):                                  │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  Instead of auxiliary loss, dynamically adjust expert biases:          │
│                                                                          │
│  router_scores = softmax(h × W_router + bias)                          │
│                                                                          │
│  If expert_i is overloaded: decrease bias_i                           │
│  If expert_i is underused: increase bias_i                            │
│                                                                          │
│  Bias updates DON'T flow gradients to the main loss!                  │
│  Load balancing is decoupled from language modeling.                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BENEFITS:                                                               │
│  ─────────                                                               │
│                                                                          │
│  • No hyperparameter α to tune                                        │
│  • No interference with language modeling objective                   │
│  • Achieves better balance than auxiliary loss                        │
│  • Simpler training dynamics                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

FP8 Training at Scale

DeepSeek-V3 was the first model to validate FP8 training at extreme scale (671B parameters):

Mixed precision: FP8 for most operations, higher precision for sensitive ops
Computation-communication overlap: Nearly full overlap in cross-node MoE
Training efficiency: 2.79M H800 GPU hours (vs. estimated 10M+ for comparable dense model)

Part X: Notable MoE Models

Model Comparison

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    NOTABLE MOE MODELS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MIXTRAL 8x7B (Mistral AI, 2023):                                       │
│  ─────────────────────────────────                                       │
│  • 8 experts per layer, top-2 routing                                 │
│  • 47B total, ~13B active                                             │
│  • Matches Llama 2 70B quality at 6× lower inference cost            │
│  • Open weights, very popular                                         │
│                                                                          │
│  MIXTRAL 8x22B (Mistral AI, 2024):                                      │
│  ──────────────────────────────────                                      │
│  • 8 experts per layer, top-2 routing                                 │
│  • 176B total, ~44B active                                            │
│  • State-of-the-art open model at release                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SWITCH TRANSFORMER (Google, 2022):                                     │
│  ───────────────────────────────────                                     │
│  • Up to 2048 experts, top-1 routing                                  │
│  • 1.6T total parameters                                              │
│  • Research model, demonstrated MoE scaling                           │
│                                                                          │
│  GLAIVE (Google, 2023):                                                 │
│  ──────────────────────                                                  │
│  • 64 experts, fine-grained routing                                   │
│  • Efficient training with expert parallelism                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DEEPSEEK-V2 (DeepSeek, 2024):                                          │
│  ──────────────────────────────                                          │
│  • 2 shared experts + 160 routed experts                              │
│  • 236B total, ~21B active (6 routed + 2 shared)                     │
│  • Multi-head Latent Attention (MLA) for KV cache compression        │
│  • Pioneered auxiliary-loss-free load balancing                       │
│                                                                          │
│  DEEPSEEK-V3 (DeepSeek, Dec 2024):                                      │
│  ──────────────────────────────────                                      │
│  • 671B total, ~37B active (1 shared + 8 routed per token)           │
│  • MLA: Compresses KV cache to low-dimensional latent vectors        │
│  • FP8 mixed precision training (first at this scale)                │
│  • Trained on 14.8T tokens in only 2.79M H800 GPU hours              │
│  • Outperforms GPT-4 at 1/10th the cost                              │
│                                                                          │
│  QWEN3 MOE (Alibaba, 2025):                                             │
│  ───────────────────────────                                             │
│  • Qwen3-235B-A22B and Qwen3-30B-A3B (similar sparsity to DeepSeek)  │
│  • Removed shared expert (earlier Qwen2.5-MoE used shared expert)    │
│  • Trained on 36T tokens, supports 119 languages                     │
│  • Qwen3 Next 80B-A3B released September 2025                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LLAMA 4 (Meta, April 2025):                                            │
│  ────────────────────────────                                            │
│  • Scout: 109B total, 17B active, 16 experts, up to 10M context      │
│  • Maverick: 400B total, 17B active, 128 experts, 1M context         │
│  • Behemoth (teacher): 2T params, 288B active, 16 experts            │
│  • Uses shared + routed experts (like DeepSeek-V3)                   │
│  • Trained on 40T tokens, natively multimodal                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KIMI K2 (Moonshot AI, July 2025):                                      │
│  ──────────────────────────────────                                      │
│  • 1.04T total parameters, 32B active                                 │
│  • 384 experts (8 routed + 1 shared per token)                       │
│  • MLA attention (like DeepSeek), 7168 hidden dim                    │
│  • MuonClip optimizer: Stabilizes trillion-param training            │
│  • 128K context, top-1 on LMSYS Arena for open-source                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPENAI GPT-OSS-120B (OpenAI, 2025):                                    │
│  ─────────────────────────────────────                                   │
│  • 128 routed experts, 4 activated per token                         │
│  • First open-weight MoE from OpenAI                                 │
│  • Part of trend toward 256-384 expert configurations                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DEEPSEEK-V3.2 (DeepSeek, Dec 2025):                                    │
│  ─────────────────────────────────────                                   │
│  • 685B total, 37B active (same architecture as V3)                  │
│  • V3.2-Speciale variant surpasses GPT-5 on reasoning               │
│  • Gold medal performance on IMO 2025 and IOI                        │
│  • mHC: Manifold-Constrained Hyper-Connections (Dec 31)              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  QWEN3-NEXT (Alibaba, Sept 2025):                                       │
│  ─────────────────────────────────                                       │
│  • 80B total, 3B active (extreme sparsity: 26.7×)                   │
│  • Hybrid: MoE + Gated DeltaNet + Gated Attention                   │
│  • Multi-Token Prediction (MTP) for efficiency                      │
│  • 256K native context, extendable to 1M                            │
│  • 10× inference throughput vs Qwen3-32B at 32K+ context           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  QWEN3-MAX (Preview, 2025):                                             │
│  ──────────────────────────                                              │
│  • 1T+ parameters (largest non-thinking Qwen model)                 │
│  • Ranked #6 in Text Arena                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MIMO-V2-FLASH (Xiaomi, 2025):                                          │
│  ──────────────────────────────                                          │
│  • 309B total, 15B active (20.6× efficiency)                        │
│  • Ultra-fast for reasoning, coding, agentic workflows              │
│  • 256K context window with hybrid "thinking" mode                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ARCHITECTURE COMPARISON (2025):                                        │
│                                                                          │
│  Model             Experts   Top-K   Total     Active    Efficiency    │
│  ────────────────────────────────────────────────────────────────      │
│  Mixtral 8x7B      8         2       47B       13B       3.6×          │
│  Mixtral 8x22B     8         2       176B      44B       4.0×          │
│  DeepSeek-V2       160+2     6+2     236B      21B       11.2×         │
│  DeepSeek-V3/V3.2  256+1     8+1     685B      37B       18.5×         │
│  Qwen3-235B        ~256      8       235B      22B       10.7×         │
│  Qwen3-Next        ~256      MoE+DN  80B       3B        26.7×         │
│  MiMo-V2-Flash     MoE       varies  309B      15B       20.6×         │
│  Llama 4 Scout     16+1      1+1     109B      17B       6.4×          │
│  Llama 4 Maverick  128+1     1+1     400B      17B       23.5×         │
│  Llama 4 Behemoth  16+1      1+1     2T        288B      6.9×          │
│  Kimi K2           384+1     8+1     1.04T     32B       32.5×         │
│  gpt-oss-120B      128       4       ~500B     ~120B     ~4.2×         │
│                                                                          │
│  Note: "+N" = shared experts, MoE+DN = hybrid MoE + Gated DeltaNet    │
│  Efficiency = Total Params / Active Params                             │
│                                                                          │
│  2025 TRENDS:                                                           │
│  • Hybrid architectures (Qwen3-Next: MoE + linear attention)          │
│  • Extreme sparsity (3B active from 80B total = 26× efficiency)       │
│  • Multi-Token Prediction (MTP) for faster inference                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Summary

Mixture of Experts represents a fundamental shift in how we think about model scaling. By decoupling total parameters (capacity) from active parameters (compute cost), MoE enables models with dramatically more knowledge at tractable inference costs.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    KEY TAKEAWAYS                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CORE CONCEPT:                                                           │
│  • Replace single FFN with multiple "expert" FFNs                      │
│  • Router selects top-k experts per token                              │
│  • Sparse activation: many params, few used per token                  │
│                                                                          │
│  EFFICIENCY GAINS:                                                       │
│  • 4-6× more parameters per FLOP than dense models                    │
│  • Mixtral 8x7B matches Llama 70B at fraction of cost                │
│  • Enables trillion-parameter models                                   │
│                                                                          │
│  CRITICAL CHALLENGES:                                                    │
│  • Load balancing: Prevent expert collapse                            │
│  • Memory: All experts must fit, even if few active                   │
│  • Training: Needs large batches, careful auxiliary loss tuning       │
│                                                                          │
│  SOLUTIONS:                                                              │
│  • Auxiliary balance loss: Penalize uneven routing                    │
│  • Capacity factors: Limit tokens per expert                          │
│  • Expert parallelism: Distribute experts across devices              │
│                                                                          │
│  PRACTICAL IMPLICATIONS:                                                 │
│  • MoE likely the future of LLM scaling                               │
│  • Inference needs MoE-aware optimizations                            │
│  • Memory requirements higher than compute suggests                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Table of Contents

Why Mixture of Experts Matters

Part I: The Core Idea

What Are "Experts"?

The Key Insight: Sparse Activation

Part II: The Router

How Routing Works

Top-1 vs Top-2 Routing

Part III: The Load Balancing Problem

Why Balance Matters

Auxiliary Losses for Balance

Capacity Factor and Token Dropping

Part IV: Expert Specialization

What Do Experts Learn?

Layer-Wise Routing Patterns

Part V: Architecture Variants

Where to Place MoE Layers

Expert Granularity

Part VI: Implementation Details

Efficient Routing Implementation

Efficient Expert Batching

Part VII: Training MoE Models

Training Considerations

Distributed Training for MoE

Part VIII: Inference Optimization

MoE Inference Challenges

MoE-Specific Optimizations

Part IX: Recent Innovations (2024-2025)

Multi-head Latent Attention (MLA)

Auxiliary-Loss-Free Load Balancing

FP8 Training at Scale

Part X: Notable MoE Models

Model Comparison

Summary

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Transformer Architecture: A Complete Deep Dive

LLM Inference Optimization: From Quantization to Speculative Decoding

Distributed Training: How to Train 70B+ Parameter Models

Open-Source LLMs: The Complete 2025 Guide