Mixture of Experts: Scaling LLMs Beyond Dense Models
A comprehensive deep dive into Mixture of Experts (MoE) architecture—how models like Mixtral and GPT-4 achieve massive capacity without proportional compute costs. Understand routing mechanisms, expert specialization, load balancing, and why MoE represents the future of LLM scaling.
Table of Contents
Why Mixture of Experts Matters
The scaling laws that have driven LLM progress face a fundamental tension: larger models perform better, but larger models also cost more to run. A 70B parameter model produces better outputs than a 7B model, but requires ~10× more compute per token. This creates a painful tradeoff between quality and cost.
Mixture of Experts (MoE) breaks this tradeoff. An MoE model can have the total parameters of a 400B model but the inference cost of a 70B model. The secret: not all parameters are used for every token. Instead, a "router" dynamically selects which subset of parameters (which "experts") to use, based on the input.
2025: MoE dominates frontier AI. According to NVIDIA, the top 10 most intelligent open-source models on the Artificial Analysis leaderboard all use MoE architecture. The leading MoE models in 2025 include:
| Model | Total Params | Active Params | Experts | Configuration |
|---|---|---|---|---|
| DeepSeek-V3 | 671B | 37B | 256 + 1 shared | 8 routed + 1 shared |
| Qwen3-235B | 235B | 22B | 128 | Top-8 routing |
| Llama 4 Maverick | ~400B | ~17B | 128 + 1 shared | 1 routed + 1 shared |
| OpenAI gpt-oss-120B | 120B | ~15B | 128 | 4 per token |
Key architectural innovations in 2025:
- Shared experts: DeepSeek, Llama 4, and GLM-4.5 use 1 shared expert activated for all tokens plus routed experts, improving convergence stability
- Dense warmup: GLM-4.5 uses 3 dense layers before MoE blocks—early MoE routing can interfere with feature extraction
- Varied expert sizes: Llama 4 uses fewer but larger experts (2 × 8192) vs DeepSeek's many smaller experts (9 × 2048)
Understanding MoE is essential because it represents the present and future of LLM scaling. As we push toward larger and more capable models, the economics of dense models become untenable. MoE offers a path to continued scaling without proportional cost increases.
This post covers MoE architecture from first principles: what experts are, how routing works, the challenges of load balancing, and practical implementation details. By the end, you'll understand how MoE models achieve their remarkable efficiency and where the field is heading.
Part I: The Core Idea
What Are "Experts"?
In a standard transformer, every token passes through the same Feed-Forward Network (FFN) in each layer. As we covered in the transformer architecture post, the FFN contains approximately 67% of the model's parameters. It's the parameter-heavy component.
An MoE layer replaces this single FFN with multiple FFNs called "experts," plus a router that decides which expert(s) each token should use:
┌─────────────────────────────────────────────────────────────────────────┐
│ DENSE VS MIXTURE OF EXPERTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ DENSE TRANSFORMER (Standard): │
│ ───────────────────────────── │
│ │
│ Token │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ FFN │ ← Every token uses the SAME FFN │
│ │ (67% of │ │
│ │ params) │ │
│ └─────────┘ │
│ │ │
│ ▼ │
│ Output │
│ │
│ All parameters used for every token. │
│ Cost scales linearly with parameters. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MIXTURE OF EXPERTS: │
│ ─────────────────── │
│ │
│ Token │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ Router │ ← Small network decides which experts to use │
│ └─────────┘ │
│ │ │
│ ┌────┴────┬────────┬────────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Expert│ │Expert│ │Expert│ │Expert│ ← 8 experts (8× FFN params) │
│ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │ │ │ │ │
│ │ ┌────┘ │ │ │
│ │ │ ┌────────┘ │ │
│ ▼ ▼ ▼ │ (only 2 selected per token) │
│ ┌─────────┐ │ │
│ │ Combine │ ←───────────────┘ │
│ └─────────┘ │
│ │ │
│ ▼ │
│ Output │
│ │
│ 8× total parameters, but only 2 experts used per token. │
│ Cost ~2× dense FFN, but capacity ~8× dense FFN. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Key Insight: Sparse Activation
The breakthrough of MoE is sparse activation: having many parameters but using only a fraction for any given input. This creates a separation between:
- Total parameters: All weights stored in memory (determines model capacity/knowledge)
- Active parameters: Weights used per forward pass (determines compute cost)
A dense 70B model has 70B total = 70B active parameters. An MoE with 8 experts of 7B each has 56B total parameters but might activate only 14B (2 experts) per token.
This matters because model quality correlates with total parameters (more storage for knowledge), while inference cost correlates with active parameters. MoE gives you more knowledge per FLOP.
┌─────────────────────────────────────────────────────────────────────────┐
│ PARAMETER EFFICIENCY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ DENSE MODEL SCALING: │
│ ──────────────────── │
│ │
│ Model Total Params Active Params Inference Cost │
│ ────────────────────────────────────────────────────────── │
│ 7B 7B 7B 1× │
│ 13B 13B 13B 1.9× │
│ 70B 70B 70B 10× │
│ 175B 175B 175B 25× │
│ │
│ Quality scales with params, but so does cost. Linear relationship. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MOE MODEL SCALING: │
│ ────────────────── │
│ │
│ Model Total Params Active Params Inference Cost │
│ ────────────────────────────────────────────────────────────── │
│ Mixtral 8x7B 47B 13B ~2× │
│ GPT-4 (est.) ~1.8T ~280B ~40× │
│ Switch-C 1.6T ~100B ~15× │
│ │
│ Mixtral: 47B capacity at 13B cost! │
│ 7× more parameters per FLOP than dense equivalent. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY THIS WORKS: │
│ ─────────────── │
│ │
│ 1. Not all knowledge needed for every token │
│ "Paris" activates geography experts │
│ "def function" activates code experts │
│ Different experts for different domains │
│ │
│ 2. FFN is where most parameters live │
│ MoE multiplies FFN capacity without multiplying attention │
│ Since FFN = 67% of params, this is highly effective │
│ │
│ 3. Inference is memory-bandwidth bound │
│ Loading 13B active params is ~4× faster than 47B │
│ Compute for 13B vs 47B is similar (GPU has spare compute) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part II: The Router
How Routing Works
The router is a small neural network that takes a token's hidden state and outputs a probability distribution over experts. The top-k experts (usually k=1 or k=2) are selected, and their outputs are combined weighted by the routing probabilities.
┌─────────────────────────────────────────────────────────────────────────┐
│ ROUTER MECHANISM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ROUTER ARCHITECTURE: │
│ ──────────────────── │
│ │
│ Input: Token hidden state h ∈ R^d (e.g., d=4096) │
│ │
│ Step 1: Compute router logits │
│ ───────────────────────────── │
│ router_logits = h × W_router (W_router ∈ R^(d × num_experts)) │
│ │
│ Example with 8 experts: │
│ h = [0.2, -0.5, 0.8, ...] (4096 dims) │
│ router_logits = [2.1, 0.3, -0.5, 3.2, 0.1, -0.8, 1.5, 0.9] │
│ E1 E2 E3 E4 E5 E6 E7 E8 │
│ │
│ Step 2: Apply softmax to get routing probabilities │
│ ───────────────────────────────────────────────── │
│ router_probs = softmax(router_logits) │
│ router_probs = [0.15, 0.03, 0.01, 0.45, 0.02, 0.01, 0.08, 0.05] │
│ │
│ Step 3: Select top-k experts (k=2 typical) │
│ ───────────────────────────────────────────── │
│ top_experts = [E4, E1] (indices 3, 0) │
│ top_probs = [0.45, 0.15] │
│ │
│ Step 4: Renormalize selected probabilities │
│ ───────────────────────────────────────── │
│ weights = [0.45, 0.15] / (0.45 + 0.15) = [0.75, 0.25] │
│ │
│ Step 5: Compute weighted combination │
│ ───────────────────────────────────── │
│ output = 0.75 × Expert4(h) + 0.25 × Expert1(h) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ VISUAL REPRESENTATION: │
│ │
│ h (hidden state) │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ W_router │ (d × num_experts) │
│ │ (linear) │ │
│ └───────────────┘ │
│ │ │
│ ▼ │
│ [2.1, 0.3, -0.5, 3.2, 0.1, -0.8, 1.5, 0.9] (logits) │
│ │ │
│ ▼ │
│ softmax + top-k │
│ │ │
│ ┌──────┴──────┐ │
│ ▼ ▼ │
│ E4 (w=0.75) E1 (w=0.25) │
│ │ │ │
│ ▼ ▼ │
│ Expert4(h) Expert1(h) │
│ │ │ │
│ └──────┬──────┘ │
│ ▼ │
│ weighted sum │
│ │ │
│ ▼ │
│ output │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Top-1 vs Top-2 Routing
The choice of how many experts to activate per token (k) involves tradeoffs:
┌─────────────────────────────────────────────────────────────────────────┐
│ TOP-K SELECTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TOP-1 ROUTING (k=1): │
│ ──────────────────── │
│ Each token goes to exactly one expert. │
│ │
│ Advantages: │
│ • Lowest compute cost (1 expert per token) │
│ • Simplest implementation │
│ • Clear expert specialization │
│ │
│ Disadvantages: │
│ • Hard routing decisions (no blending) │
│ • More sensitive to routing errors │
│ • Less stable training │
│ │
│ Used by: Switch Transformer, some efficient MoE variants │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TOP-2 ROUTING (k=2): │
│ ──────────────────── │
│ Each token goes to two experts, outputs combined. │
│ │
│ Advantages: │
│ • Soft blending between experts │
│ • More robust to routing errors │
│ • Smoother expert utilization │
│ • Generally better quality │
│ │
│ Disadvantages: │
│ • 2× compute vs top-1 │
│ • More complex combining logic │
│ │
│ Used by: Mixtral, most production MoE models │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXPERT CHOICE ROUTING (alternative): │
│ ───────────────────────────────────── │
│ Instead of tokens choosing experts, experts choose tokens. │
│ │
│ Each expert selects its top-k tokens from the batch. │
│ Guarantees perfect load balance! │
│ │
│ Problem: Some tokens might not be selected by any expert. │
│ Solution: Ensure enough experts that all tokens get processed. │
│ │
│ Used by: Expert Choice paper (Google), some efficient variants │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part III: The Load Balancing Problem
Why Balance Matters
MoE has a critical failure mode: if the router learns to send all tokens to a small subset of experts, the other experts never get trained and become useless. This "expert collapse" wastes parameters and defeats the purpose of MoE.
The problem emerges naturally from gradient descent. If Expert 1 happens to perform slightly better early in training, more tokens get routed to it. With more tokens, Expert 1 gets more gradient updates and improves further. Meanwhile, underused experts get few updates and stagnate. This positive feedback loop leads to collapse.
┌─────────────────────────────────────────────────────────────────────────┐
│ LOAD BALANCING PROBLEM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE COLLAPSE FAILURE MODE: │
│ ────────────────────────── │
│ │
│ Training begins: │
│ E1: 12.5% E2: 12.5% E3: 12.5% E4: 12.5% (8 experts, uniform) │
│ E5: 12.5% E6: 12.5% E7: 12.5% E8: 12.5% │
│ │
│ After 1000 steps (slight imbalance emerges): │
│ E1: 15% E2: 14% E3: 13% E4: 12% │
│ E5: 11% E6: 10% E7: 8% E8: 7% │
│ │
│ After 10000 steps (rich get richer): │
│ E1: 45% E2: 30% E3: 15% E4: 5% │
│ E5: 3% E6: 1% E7: 0.5% E8: 0.5% │
│ │
│ Collapsed state: │
│ E1: 90% E2: 8% E3: 2% E4-E8: ~0% │
│ │
│ Result: 8× parameters but effectively 1-2 experts used. │
│ Massive waste of capacity. Defeats the purpose of MoE. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY COLLAPSE HAPPENS: │
│ ───────────────────── │
│ │
│ 1. Random initialization gives some experts slight advantage │
│ 2. Better experts get more tokens │
│ 3. More tokens = more gradient updates = faster learning │
│ 4. Faster learning = even better performance │
│ 5. Even better = even more tokens (positive feedback) │
│ 6. Underused experts get few updates, fall further behind │
│ │
│ Without intervention, this is the natural equilibrium. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Auxiliary Losses for Balance
The primary solution is adding a loss term that penalizes imbalanced routing:
┌─────────────────────────────────────────────────────────────────────────┐
│ AUXILIARY LOAD BALANCING LOSS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE BALANCE LOSS: │
│ ───────────────── │
│ │
│ For each expert i, compute: │
│ • f_i = fraction of tokens routed to expert i │
│ • P_i = average routing probability assigned to expert i │
│ │
│ Balance loss = α × num_experts × Σᵢ (f_i × P_i) │
│ │
│ This penalizes both: │
│ • High f_i: Expert getting too many tokens │
│ • High P_i: Router assigning high probability to one expert │
│ │
│ The product f_i × P_i is minimized when both are uniform (1/N). │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE CALCULATION: │
│ ──────────────────── │
│ │
│ Batch of 100 tokens, 4 experts, top-1 routing: │
│ │
│ Balanced case: │
│ f = [0.25, 0.25, 0.25, 0.25] (25 tokens each) │
│ P = [0.25, 0.25, 0.25, 0.25] (uniform avg probability) │
│ loss = 4 × (0.25×0.25 + 0.25×0.25 + 0.25×0.25 + 0.25×0.25) │
│ = 4 × 0.25 = 1.0 │
│ │
│ Imbalanced case: │
│ f = [0.70, 0.20, 0.08, 0.02] (70 tokens to E1!) │
│ P = [0.65, 0.20, 0.10, 0.05] (router prefers E1) │
│ loss = 4 × (0.70×0.65 + 0.20×0.20 + 0.08×0.10 + 0.02×0.05) │
│ = 4 × (0.455 + 0.04 + 0.008 + 0.001) │
│ = 4 × 0.504 = 2.016 │
│ │
│ Higher loss penalizes the imbalanced state! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TOTAL TRAINING LOSS: │
│ ──────────────────── │
│ │
│ L_total = L_language_modeling + α × L_balance │
│ │
│ Typical α values: 0.01 - 0.1 │
│ Too low: Collapse still happens │
│ Too high: Router becomes uniform regardless of input │
│ │
│ Finding the right α requires experimentation. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Capacity Factor and Token Dropping
Another approach limits how many tokens each expert can handle:
┌─────────────────────────────────────────────────────────────────────────┐
│ CAPACITY FACTOR │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE CONCEPT: │
│ ──────────── │
│ │
│ Set a maximum capacity for each expert: │
│ │
│ capacity = (num_tokens / num_experts) × capacity_factor │
│ │
│ Example: 100 tokens, 8 experts, capacity_factor=1.25 │
│ capacity = (100 / 8) × 1.25 = 15.6 ≈ 16 tokens per expert │
│ │
│ If more than 16 tokens route to an expert, excess are DROPPED. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TOKEN DROPPING: │
│ ─────────────── │
│ │
│ When capacity exceeded: │
│ • Keep tokens with highest routing probability │
│ • Drop tokens with lowest probability │
│ • Dropped tokens use residual connection only (skip FFN) │
│ │
│ 100 tokens want Expert 1: │
│ Capacity = 16 │
│ Keep: top 16 by router probability │
│ Drop: remaining 84 tokens (they skip this FFN entirely) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CAPACITY FACTOR TRADEOFFS: │
│ ────────────────────────── │
│ │
│ capacity_factor = 1.0: │
│ • Perfect balance forced (each expert gets exactly N/E tokens) │
│ • High drop rate when routing is imbalanced │
│ • May hurt quality if good tokens are dropped │
│ │
│ capacity_factor = 1.25: │
│ • Common default │
│ • Allows 25% imbalance before dropping │
│ • Good balance between utilization and quality │
│ │
│ capacity_factor = 2.0: │
│ • Very permissive │
│ • Rarely drops tokens │
│ • May allow significant imbalance │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TRAINING VS INFERENCE: │
│ ───────────────────── │
│ │
│ Training: Use capacity_factor to enforce balance │
│ Inference: Often disable capacity limits (process all tokens) │
│ │
│ At inference, we want best quality, not training stability. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part IV: Expert Specialization
What Do Experts Learn?
A natural question: do different experts actually specialize in different things? Research shows they do, to varying degrees:
┌─────────────────────────────────────────────────────────────────────────┐
│ EXPERT SPECIALIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EMPIRICAL FINDINGS FROM MIXTRAL ANALYSIS: │
│ ────────────────────────────────────────── │
│ │
│ Researchers analyzed which experts activate for different content: │
│ │
│ DOMAIN SPECIALIZATION (partial): │
│ ───────────────────────────────── │
│ • Some experts prefer code tokens │
│ • Some experts prefer mathematical notation │
│ • Some experts prefer certain languages │
│ • But overlap is significant—not clean separation │
│ │
│ SYNTACTIC PATTERNS (stronger): │
│ ────────────────────────────── │
│ • Punctuation often routed to specific experts │
│ • Certain experts handle sentence boundaries │
│ • Some experts specialize in rare tokens │
│ │
│ POSITIONAL PATTERNS: │
│ ──────────────────── │
│ • Early sequence positions may prefer different experts │
│ • Some experts more active at sentence starts │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ VISUALIZATION (conceptual): │
│ ────────────────────────── │
│ │
│ Token Type Primary Experts Secondary │
│ ────────────────────────────────────────────────────────── │
│ Python code E2, E5 E1, E7 │
│ JavaScript E2, E7 E5, E1 │
│ Math formulas E4, E6 E3 │
│ English prose E1, E3, E8 E6 │
│ French text E3, E1 E8 │
│ Punctuation E8 E3 │
│ Numbers E4, E6 E2 │
│ │
│ Note: This is illustrative. Real patterns are more complex. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ KEY INSIGHT: │
│ ──────────── │
│ │
│ Specialization is EMERGENT, not designed. │
│ The router learns to route similar tokens to similar experts. │
│ This happens because: │
│ • Similar tokens benefit from similar transformations │
│ • Experts become good at what they see frequently │
│ • Positive feedback reinforces specialization │
│ │
│ We don't tell Expert 2 to handle code—it discovers this pattern. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Layer-Wise Routing Patterns
Different layers of an MoE model show different routing behaviors:
┌─────────────────────────────────────────────────────────────────────────┐
│ LAYER-WISE PATTERNS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EARLY LAYERS (1-8): │
│ ─────────────────── │
│ • More uniform routing (less specialization) │
│ • Handle basic token processing │
│ • Lower entropy in router decisions │
│ │
│ MIDDLE LAYERS (8-24): │
│ ──────────────────── │
│ • Strongest specialization │
│ • Most varied routing patterns │
│ • Domain/topic-specific routing emerges here │
│ │
│ LATE LAYERS (24-32): │
│ ───────────────────── │
│ • More task-specific routing │
│ • Preparing for output │
│ • Some convergence in patterns │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ROUTING CONSISTENCY: │
│ ──────────────────── │
│ │
│ Same token in same context often routes to same experts. │
│ But routing can change based on: │
│ • Surrounding context │
│ • Position in sequence │
│ • Layer depth │
│ │
│ "function" in Python context → code experts │
│ "function" in math context → math experts │
│ │
│ Context-dependent routing is a feature, not a bug. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part V: Architecture Variants
Where to Place MoE Layers
Not every layer needs to be an MoE layer. Common patterns:
┌─────────────────────────────────────────────────────────────────────────┐
│ MOE LAYER PLACEMENT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EVERY LAYER (Dense MoE): │
│ ──────────────────────── │
│ [MoE] [MoE] [MoE] [MoE] [MoE] [MoE] [MoE] [MoE] │
│ │
│ • Maximum capacity │
│ • Highest memory usage │
│ • Used by: Some research models │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EVERY OTHER LAYER (Sparse MoE): │
│ ─────────────────────────────── │
│ [Dense] [MoE] [Dense] [MoE] [Dense] [MoE] [Dense] [MoE] │
│ │
│ • Balance between capacity and efficiency │
│ • Dense layers provide "shared" computation │
│ • Used by: Mixtral, most practical MoE models │
│ │
│ Why this works: │
│ Dense layers process all tokens uniformly (global features) │
│ MoE layers specialize processing (specific features) │
│ Alternating captures both patterns │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EVERY 4TH LAYER (Very Sparse MoE): │
│ ────────────────────────────────── │
│ [Dense] [Dense] [Dense] [MoE] [Dense] [Dense] [Dense] [MoE] │
│ │
│ • Minimal memory overhead │
│ • Still significant capacity boost │
│ • Used by: Some efficient variants │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MIXTRAL ARCHITECTURE: │
│ ───────────────────── │
│ │
│ 32 layers total │
│ Every layer has MoE FFN │
│ 8 experts per layer │
│ Top-2 routing │
│ │
│ Total params: 8 × 7B (experts) + shared = ~47B │
│ Active params: 2 × 7B (FFN) + shared = ~13B │
│ │
│ Attention is always dense (shared across all tokens) │
│ Only FFN is sparse (different experts for different tokens) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Expert Granularity
Experts can be full FFNs or smaller units:
┌─────────────────────────────────────────────────────────────────────────┐
│ EXPERT GRANULARITY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ COARSE-GRAINED (Few Large Experts): │
│ ──────────────────────────────────── │
│ │
│ 8 experts, each is a full FFN (4096 → 14336 → 4096) │
│ │
│ ┌────────────────────────────────┐ │
│ │ Expert 1 (Full FFN) │ │
│ │ W_up: 4096 × 14336 │ │
│ │ W_down: 14336 × 4096 │ │
│ └────────────────────────────────┘ │
│ │
│ Advantages: │
│ • Each expert has significant capacity │
│ • Clear specialization possible │
│ • Simpler implementation │
│ │
│ Disadvantages: │
│ • Coarse routing granularity │
│ • Memory: must load full expert │
│ │
│ Used by: Mixtral, GShard, most MoE models │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ FINE-GRAINED (Many Small Experts): │
│ ─────────────────────────────────── │
│ │
│ 64 experts, each is a small FFN (4096 → 1792 → 4096) │
│ Route to top-8, getting similar total capacity to top-2 with 8 │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ (64 small experts) │
│ │ E1 │ │ E2 │ │ E3 │ ... │
│ │ tiny │ │ tiny │ │ tiny │ │
│ └──────┘ └──────┘ └──────┘ │
│ │
│ Advantages: │
│ • More precise routing │
│ • Better load balancing │
│ • More flexible combinations │
│ │
│ Disadvantages: │
│ • More routing overhead │
│ • Less capacity per expert │
│ • More complex to implement efficiently │
│ │
│ Used by: Some research models (DeepSeek-MoE) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DEEPSEEK-MOE APPROACH (Hybrid): │
│ ─────────────────────────────── │
│ │
│ 1 shared expert (always active) + many routed experts │
│ │
│ Output = SharedExpert(x) + Σᵢ wᵢ × RoutedExpertᵢ(x) │
│ │
│ The shared expert handles common patterns │
│ Routed experts handle specialized patterns │
│ Best of both worlds │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part VI: Implementation Details
Efficient Routing Implementation
Implementing MoE efficiently requires careful attention to GPU computation patterns:
import torch
import torch.nn as nn
import torch.nn.functional as F
class MoELayer(nn.Module):
"""
Mixture of Experts layer with top-k routing.
This replaces the standard FFN in a transformer block.
"""
def __init__(
self,
hidden_size: int = 4096,
intermediate_size: int = 14336,
num_experts: int = 8,
num_experts_per_token: int = 2,
aux_loss_coef: float = 0.01,
):
super().__init__()
self.hidden_size = hidden_size
self.num_experts = num_experts
self.num_experts_per_token = num_experts_per_token
self.aux_loss_coef = aux_loss_coef
# Router: maps hidden states to expert scores
self.router = nn.Linear(hidden_size, num_experts, bias=False)
# Create experts (each is a small FFN)
self.experts = nn.ModuleList([
ExpertFFN(hidden_size, intermediate_size)
for _ in range(num_experts)
])
def forward(self, hidden_states: torch.Tensor):
"""
Args:
hidden_states: (batch_size, seq_len, hidden_size)
Returns:
output: (batch_size, seq_len, hidden_size)
aux_loss: scalar tensor for load balancing
"""
batch_size, seq_len, _ = hidden_states.shape
# Flatten batch and sequence dimensions
# (batch_size * seq_len, hidden_size)
flat_hidden = hidden_states.view(-1, self.hidden_size)
num_tokens = flat_hidden.shape[0]
# Compute routing scores
# (num_tokens, num_experts)
router_logits = self.router(flat_hidden)
router_probs = F.softmax(router_logits, dim=-1)
# Select top-k experts for each token
# (num_tokens, k)
top_k_probs, top_k_indices = torch.topk(
router_probs,
self.num_experts_per_token,
dim=-1
)
# Renormalize selected probabilities
top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
# Compute auxiliary load balancing loss
aux_loss = self._compute_aux_loss(router_probs, top_k_indices)
# Process tokens through selected experts
# This is the tricky part for efficiency!
output = self._process_experts(
flat_hidden, top_k_indices, top_k_probs
)
# Reshape back to (batch_size, seq_len, hidden_size)
output = output.view(batch_size, seq_len, self.hidden_size)
return output, aux_loss
def _process_experts(self, hidden, indices, weights):
"""
Process tokens through their selected experts.
Naive implementation: loop over experts.
Efficient implementation: batch tokens by expert.
"""
num_tokens = hidden.shape[0]
output = torch.zeros_like(hidden)
# For each expert, find which tokens selected it
for expert_idx in range(self.num_experts):
expert = self.experts[expert_idx]
# Find (token_idx, slot) pairs where this expert was selected
# indices shape: (num_tokens, k)
mask = (indices == expert_idx) # (num_tokens, k)
if not mask.any():
continue
# Get token indices and their weights for this expert
token_indices = mask.any(dim=-1).nonzero(as_tuple=True)[0]
if len(token_indices) == 0:
continue
# Get the tokens assigned to this expert
expert_input = hidden[token_indices] # (n_tokens, hidden)
# Process through expert
expert_output = expert(expert_input) # (n_tokens, hidden)
# Get weights for each token-expert pair
token_weights = weights[mask].view(-1, 1) # (n_tokens, 1)
# Accumulate weighted outputs
output.index_add_(
0,
token_indices,
expert_output * token_weights
)
return output
def _compute_aux_loss(self, router_probs, selected_indices):
"""
Compute auxiliary loss for load balancing.
Loss = α * num_experts * Σᵢ (fᵢ * Pᵢ)
where:
- fᵢ = fraction of tokens routed to expert i
- Pᵢ = average routing probability for expert i
"""
num_tokens = router_probs.shape[0]
# Compute fraction of tokens per expert (f)
# Count how many tokens selected each expert (across all k slots)
expert_counts = torch.zeros(
self.num_experts,
device=router_probs.device
)
for k in range(self.num_experts_per_token):
expert_counts.scatter_add_(
0,
selected_indices[:, k],
torch.ones_like(selected_indices[:, k], dtype=torch.float)
)
# Normalize to get fractions
f = expert_counts / (num_tokens * self.num_experts_per_token)
# Compute average probability per expert (P)
P = router_probs.mean(dim=0)
# Compute balance loss
aux_loss = self.aux_loss_coef * self.num_experts * (f * P).sum()
return aux_loss
class ExpertFFN(nn.Module):
"""
Single expert FFN with SwiGLU activation.
Same architecture as dense transformer FFN.
"""
def __init__(self, hidden_size: int, intermediate_size: int):
super().__init__()
self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
def forward(self, x):
# SwiGLU: gate * up, then down
gate = F.silu(self.gate_proj(x))
up = self.up_proj(x)
return self.down_proj(gate * up)
Efficient Expert Batching
The naive implementation above loops over experts. Production implementations batch tokens by expert for GPU efficiency:
┌─────────────────────────────────────────────────────────────────────────┐
│ EFFICIENT EXPERT BATCHING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ NAIVE APPROACH (Loop): │
│ ────────────────────── │
│ │
│ for expert in experts: │
│ tokens_for_expert = select(all_tokens, expert_id) │
│ outputs = expert(tokens_for_expert) │
│ scatter_back(outputs) │
│ │
│ Problem: 8 sequential kernel launches, low GPU utilization │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EFFICIENT APPROACH (Grouped GEMM): │
│ ──────────────────────────────────── │
│ │
│ 1. Sort tokens by their primary expert assignment │
│ 2. Create "groups" of contiguous tokens for each expert │
│ 3. Use grouped GEMM to process all groups in one kernel │
│ │
│ Before sorting: │
│ tokens: [T1, T2, T3, T4, T5, T6, T7, T8] │
│ experts: [E2, E1, E2, E3, E1, E2, E3, E1] │
│ │
│ After sorting by expert: │
│ tokens: [T2, T5, T8, T1, T3, T6, T4, T7] │
│ experts: [E1, E1, E1, E2, E2, E2, E3, E3] │
│ groups: [----E1----] [----E2----] [--E3--] │
│ │
│ Grouped GEMM processes all three groups efficiently. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MEGABLOCKS LIBRARY: │
│ ─────────────────── │
│ │
│ Stanford's megablocks library provides efficient MoE kernels: │
│ │
│ from megablocks.layers import dmoe │
│ │
│ moe_layer = dmoe.dMoE( │
│ hidden_size=4096, │
│ ffn_hidden_size=14336, │
│ num_experts=8, │
│ top_k=2, │
│ ) │
│ │
│ Key optimizations: │
│ • Block-sparse matrix operations │
│ • Efficient token permutation │
│ • Fused kernels for routing + expert computation │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part VII: Training MoE Models
Training Considerations
Training MoE models requires attention to several unique challenges:
┌─────────────────────────────────────────────────────────────────────────┐
│ MOE TRAINING CONSIDERATIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. BATCH SIZE REQUIREMENTS: │
│ ─────────────────────────── │
│ │
│ MoE needs larger batch sizes than dense models: │
│ • Each expert should see enough tokens for meaningful gradients │
│ • With 8 experts, each sees ~12.5% of tokens on average │
│ • Need batch_size * seq_len >> num_experts for stable training │
│ │
│ Typical: batch_size >= 1024 tokens per expert │
│ For 8 experts: total batch >= 8192 tokens │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 2. AUXILIARY LOSS TUNING: │
│ ───────────────────────── │
│ │
│ L_total = L_LM + α × L_balance │
│ │
│ α too low (0.001): │
│ • Load balancing ignored │
│ • Expert collapse likely │
│ • Wasted parameters │
│ │
│ α too high (0.1): │
│ • Router becomes uniform │
│ • No specialization │
│ • Defeats purpose of MoE │
│ │
│ Sweet spot: 0.01 - 0.02 │
│ May need tuning per model/dataset │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 3. GRADIENT NOISE: │
│ ────────────────── │
│ │
│ Each expert sees different subsets of tokens each batch. │
│ This creates higher gradient variance than dense models. │
│ │
│ Mitigations: │
│ • Larger batch sizes │
│ • Lower learning rate │
│ • Gradient clipping │
│ • Expert parallelism (experts across devices see all tokens) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 4. LEARNING RATE SCHEDULES: │
│ ─────────────────────────── │
│ │
│ Router often needs different learning rate than experts: │
│ │
│ • Router LR: 10× lower than experts │
│ - Prevents router from changing too fast │
│ - Allows experts to adapt to routing decisions │
│ │
│ • Expert LR: Standard transformer LR │
│ - Same as dense FFN would use │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 5. INITIALIZATION: │
│ ───────────────── │
│ │
│ All experts typically initialized identically. │
│ Specialization emerges from: │
│ • Different tokens routed to different experts │
│ • Different gradient updates per expert │
│ • Self-reinforcing specialization │
│ │
│ Router initialized to produce uniform distribution initially. │
│ Small random noise breaks symmetry. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Distributed Training for MoE
Large MoE models require distributed training strategies:
┌─────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED MOE TRAINING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EXPERT PARALLELISM: │
│ ─────────────────── │
│ │
│ Place different experts on different devices: │
│ │
│ GPU 0: Experts 1, 2 │
│ GPU 1: Experts 3, 4 │
│ GPU 2: Experts 5, 6 │
│ GPU 3: Experts 7, 8 │
│ │
│ All-to-all communication: │
│ 1. Each GPU has subset of tokens │
│ 2. Route tokens: GPU 0 tokens may need Experts on GPU 1-3 │
│ 3. All-to-all exchange: send tokens to GPUs with their experts │
│ 4. Process through experts locally │
│ 5. All-to-all exchange: return outputs to original GPUs │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ALL-TO-ALL COMMUNICATION │ │
│ │ │ │
│ │ GPU 0 GPU 1 GPU 2 GPU 3 │ │
│ │ [T1,T2] [T3,T4] [T5,T6] [T7,T8] │ │
│ │ │ │ │ │ │ │
│ │ └──────────────┼──────────────┼──────────────┘ │ │
│ │ ┌───────┴───────┐ │ │ │
│ │ │ All-to-All │ │ │ │
│ │ └───────────────┘ │ │ │
│ │ ┌──────────────┬──────────────┼──────────────┐ │ │
│ │ ▼ ▼ ▼ ▼ │ │
│ │ [T1,T4] [T2,T6] [T3,T7] [T5,T8] │ │
│ │ (for E1,2) (for E3,4) (for E5,6) (for E7,8) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMMUNICATION COSTS: │
│ ──────────────────── │
│ │
│ Each token: 2 × hidden_size × dtype_size bytes │
│ For 4096 hidden, BF16: 2 × 4096 × 2 = 16 KB per token │
│ │
│ With 8 GPUs, batch of 8192 tokens: │
│ ~130 MB all-to-all communication per MoE layer │
│ │
│ This is significant! Efficient all-to-all is critical. │
│ NVLink helps but doesn't eliminate the cost. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMBINING WITH OTHER PARALLELISM: │
│ ───────────────────────────────── │
│ │
│ Real training combines: │
│ • Data parallelism: Replicate model, split data │
│ • Tensor parallelism: Split attention/FFN within node │
│ • Expert parallelism: Distribute experts across nodes │
│ • Pipeline parallelism: Split layers across nodes │
│ │
│ Example: Mixtral training │
│ • 64 GPUs total │
│ • 8-way expert parallelism (1 expert per GPU) │
│ • 8-way data parallelism │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part VIII: Inference Optimization
MoE Inference Challenges
MoE presents unique inference challenges:
┌─────────────────────────────────────────────────────────────────────────┐
│ MOE INFERENCE CHALLENGES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. MEMORY BANDWIDTH: │
│ ──────────────────── │
│ │
│ Even though only 2 experts active, ALL experts must fit in memory. │
│ │
│ Mixtral 8x7B: │
│ • Active params: ~13B (2 experts + shared) │
│ • Total params: ~47B (8 experts + shared) │
│ • Memory needed: ~94GB in FP16 │
│ │
│ Compare to dense 13B: ~26GB in FP16 │
│ MoE needs 3.6× more memory for same inference compute! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 2. LOAD IMBALANCE AT INFERENCE: │
│ ─────────────────────────────── │
│ │
│ Training uses large batches → good balance across experts │
│ Inference often uses small batches → high variance in routing │
│ │
│ Batch of 1 token: only 2 experts needed, 6 experts idle │
│ But all 8 experts occupy memory! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 3. SPECULATIVE DECODING COMPLICATIONS: │
│ ─────────────────────────────────────── │
│ │
│ Draft model routing != target model routing │
│ • Draft routes to different experts than target would │
│ • Verification must re-route through correct experts │
│ • Reduces speculation benefits │
│ │
│ Solutions: │
│ • Use same routing for draft and target (approximate) │
│ • Train draft to mimic target's routing │
│ • Accept some inefficiency │
│ │
└─────────────────────────────────────────────────────────────────────────┘
MoE-Specific Optimizations
┌─────────────────────────────────────────────────────────────────────────┐
│ MOE INFERENCE OPTIMIZATIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. EXPERT OFFLOADING: │
│ ───────────────────── │
│ │
│ Keep only frequently-used experts on GPU. │
│ Load others from CPU/disk on demand. │
│ │
│ For Mixtral on 24GB GPU: │
│ • Keep 2-4 most popular experts on GPU │
│ • Offload rest to CPU RAM │
│ • Load on demand (adds latency but fits in memory) │
│ │
│ Works because routing is often predictable. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 2. EXPERT QUANTIZATION: │
│ ──────────────────────── │
│ │
│ Quantize experts independently: │
│ • Popular experts: FP16/INT8 (quality matters more) │
│ • Rare experts: INT4 (less impact on overall quality) │
│ │
│ Can also quantize KV cache per-expert. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 3. BATCHING FOR MOE: │
│ ──────────────────── │
│ │
│ Continuous batching helps MoE especially: │
│ • More tokens → better load balance across experts │
│ • Higher GPU utilization │
│ │
│ Batch 1: Token routes to E1, E3 → 2 experts used │
│ Batch 32: Tokens distributed → all 8 experts used │
│ │
│ vLLM handles this automatically. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 4. EXPERT CACHING: │
│ ────────────────── │
│ │
│ Cache expert outputs for repeated inputs: │
│ • Same token in same context → same routing → same output │
│ • Useful for prefix caching │
│ │
│ Implementation: Hash (token, context, layer) → cached expert output │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part IX: Recent Innovations (2024-2025)
Multi-head Latent Attention (MLA)
DeepSeek introduced Multi-head Latent Attention to dramatically reduce KV cache memory in MoE models:
┌─────────────────────────────────────────────────────────────────────────┐
│ MULTI-HEAD LATENT ATTENTION (MLA) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE PROBLEM: │
│ ──────────── │
│ │
│ Standard attention stores full K, V for each token: │
│ KV cache per token = 2 × num_heads × head_dim × num_layers │
│ │
│ For DeepSeek-V3 (without MLA): │
│ = 2 × 128 × 128 × 61 = ~2MB per token! │
│ 128K context → 256GB KV cache │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MLA SOLUTION: │
│ ───────────── │
│ │
│ Compress K, V into low-dimensional latent vector: │
│ │
│ Standard: K, V ∈ R^(num_heads × head_dim) = R^16384 │
│ MLA: c_kv ∈ R^512 (compressed latent) │
│ │
│ At inference: │
│ 1. Store only latent c_kv (32× smaller!) │
│ 2. When computing attention, project back: │
│ K = c_kv × W_uk, V = c_kv × W_uv │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MEMORY SAVINGS: │
│ │
│ Method KV Cache/Token 128K Context │
│ ───────────────────────────────────────────────────── │
│ Standard MHA 2MB 256 GB │
│ GQA (8 groups) 250KB 32 GB │
│ MLA 62KB 8 GB │
│ │
│ MLA achieves 32× reduction vs standard MHA! │
│ Even better than GQA while maintaining quality. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Auxiliary-Loss-Free Load Balancing
Traditional MoE uses auxiliary losses to balance expert load, which can hurt model quality. DeepSeek pioneered a loss-free approach:
┌─────────────────────────────────────────────────────────────────────────┐
│ AUXILIARY-LOSS-FREE BALANCING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL APPROACH (Auxiliary Loss): │
│ ─────────────────────────────────────── │
│ │
│ L_total = L_language + α × L_balance │
│ │
│ Problem: L_balance gradient interferes with L_language. │
│ Tuning α is difficult—too high hurts quality, too low causes collapse.│
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DEEPSEEK'S APPROACH (Bias Adjustment): │
│ ─────────────────────────────────────── │
│ │
│ Instead of auxiliary loss, dynamically adjust expert biases: │
│ │
│ router_scores = softmax(h × W_router + bias) │
│ │
│ If expert_i is overloaded: decrease bias_i │
│ If expert_i is underused: increase bias_i │
│ │
│ Bias updates DON'T flow gradients to the main loss! │
│ Load balancing is decoupled from language modeling. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ BENEFITS: │
│ ───────── │
│ │
│ • No hyperparameter α to tune │
│ • No interference with language modeling objective │
│ • Achieves better balance than auxiliary loss │
│ • Simpler training dynamics │
│ │
└─────────────────────────────────────────────────────────────────────────┘
FP8 Training at Scale
DeepSeek-V3 was the first model to validate FP8 training at extreme scale (671B parameters):
- Mixed precision: FP8 for most operations, higher precision for sensitive ops
- Computation-communication overlap: Nearly full overlap in cross-node MoE
- Training efficiency: 2.79M H800 GPU hours (vs. estimated 10M+ for comparable dense model)
Part X: Notable MoE Models
Model Comparison
┌─────────────────────────────────────────────────────────────────────────┐
│ NOTABLE MOE MODELS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MIXTRAL 8x7B (Mistral AI, 2023): │
│ ───────────────────────────────── │
│ • 8 experts per layer, top-2 routing │
│ • 47B total, ~13B active │
│ • Matches Llama 2 70B quality at 6× lower inference cost │
│ • Open weights, very popular │
│ │
│ MIXTRAL 8x22B (Mistral AI, 2024): │
│ ────────────────────────────────── │
│ • 8 experts per layer, top-2 routing │
│ • 176B total, ~44B active │
│ • State-of-the-art open model at release │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SWITCH TRANSFORMER (Google, 2022): │
│ ─────────────────────────────────── │
│ • Up to 2048 experts, top-1 routing │
│ • 1.6T total parameters │
│ • Research model, demonstrated MoE scaling │
│ │
│ GLAIVE (Google, 2023): │
│ ────────────────────── │
│ • 64 experts, fine-grained routing │
│ • Efficient training with expert parallelism │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DEEPSEEK-V2 (DeepSeek, 2024): │
│ ────────────────────────────── │
│ • 2 shared experts + 160 routed experts │
│ • 236B total, ~21B active (6 routed + 2 shared) │
│ • Multi-head Latent Attention (MLA) for KV cache compression │
│ • Pioneered auxiliary-loss-free load balancing │
│ │
│ DEEPSEEK-V3 (DeepSeek, Dec 2024): │
│ ────────────────────────────────── │
│ • 671B total, ~37B active (1 shared + 8 routed per token) │
│ • MLA: Compresses KV cache to low-dimensional latent vectors │
│ • FP8 mixed precision training (first at this scale) │
│ • Trained on 14.8T tokens in only 2.79M H800 GPU hours │
│ • Outperforms GPT-4 at 1/10th the cost │
│ │
│ QWEN3 MOE (Alibaba, 2025): │
│ ─────────────────────────── │
│ • Qwen3-235B-A22B and Qwen3-30B-A3B (similar sparsity to DeepSeek) │
│ • Removed shared expert (earlier Qwen2.5-MoE used shared expert) │
│ • Trained on 36T tokens, supports 119 languages │
│ • Qwen3 Next 80B-A3B released September 2025 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LLAMA 4 (Meta, April 2025): │
│ ──────────────────────────── │
│ • Scout: 109B total, 17B active, 16 experts, up to 10M context │
│ • Maverick: 400B total, 17B active, 128 experts, 1M context │
│ • Behemoth (teacher): 2T params, 288B active, 16 experts │
│ • Uses shared + routed experts (like DeepSeek-V3) │
│ • Trained on 40T tokens, natively multimodal │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ KIMI K2 (Moonshot AI, July 2025): │
│ ────────────────────────────────── │
│ • 1.04T total parameters, 32B active │
│ • 384 experts (8 routed + 1 shared per token) │
│ • MLA attention (like DeepSeek), 7168 hidden dim │
│ • MuonClip optimizer: Stabilizes trillion-param training │
│ • 128K context, top-1 on LMSYS Arena for open-source │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OPENAI GPT-OSS-120B (OpenAI, 2025): │
│ ───────────────────────────────────── │
│ • 128 routed experts, 4 activated per token │
│ • First open-weight MoE from OpenAI │
│ • Part of trend toward 256-384 expert configurations │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DEEPSEEK-V3.2 (DeepSeek, Dec 2025): │
│ ───────────────────────────────────── │
│ • 685B total, 37B active (same architecture as V3) │
│ • V3.2-Speciale variant surpasses GPT-5 on reasoning │
│ • Gold medal performance on IMO 2025 and IOI │
│ • mHC: Manifold-Constrained Hyper-Connections (Dec 31) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ QWEN3-NEXT (Alibaba, Sept 2025): │
│ ───────────────────────────────── │
│ • 80B total, 3B active (extreme sparsity: 26.7×) │
│ • Hybrid: MoE + Gated DeltaNet + Gated Attention │
│ • Multi-Token Prediction (MTP) for efficiency │
│ • 256K native context, extendable to 1M │
│ • 10× inference throughput vs Qwen3-32B at 32K+ context │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ QWEN3-MAX (Preview, 2025): │
│ ────────────────────────── │
│ • 1T+ parameters (largest non-thinking Qwen model) │
│ • Ranked #6 in Text Arena │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MIMO-V2-FLASH (Xiaomi, 2025): │
│ ────────────────────────────── │
│ • 309B total, 15B active (20.6× efficiency) │
│ • Ultra-fast for reasoning, coding, agentic workflows │
│ • 256K context window with hybrid "thinking" mode │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ARCHITECTURE COMPARISON (2025): │
│ │
│ Model Experts Top-K Total Active Efficiency │
│ ──────────────────────────────────────────────────────────────── │
│ Mixtral 8x7B 8 2 47B 13B 3.6× │
│ Mixtral 8x22B 8 2 176B 44B 4.0× │
│ DeepSeek-V2 160+2 6+2 236B 21B 11.2× │
│ DeepSeek-V3/V3.2 256+1 8+1 685B 37B 18.5× │
│ Qwen3-235B ~256 8 235B 22B 10.7× │
│ Qwen3-Next ~256 MoE+DN 80B 3B 26.7× │
│ MiMo-V2-Flash MoE varies 309B 15B 20.6× │
│ Llama 4 Scout 16+1 1+1 109B 17B 6.4× │
│ Llama 4 Maverick 128+1 1+1 400B 17B 23.5× │
│ Llama 4 Behemoth 16+1 1+1 2T 288B 6.9× │
│ Kimi K2 384+1 8+1 1.04T 32B 32.5× │
│ gpt-oss-120B 128 4 ~500B ~120B ~4.2× │
│ │
│ Note: "+N" = shared experts, MoE+DN = hybrid MoE + Gated DeltaNet │
│ Efficiency = Total Params / Active Params │
│ │
│ 2025 TRENDS: │
│ • Hybrid architectures (Qwen3-Next: MoE + linear attention) │
│ • Extreme sparsity (3B active from 80B total = 26× efficiency) │
│ • Multi-Token Prediction (MTP) for faster inference │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Summary
Mixture of Experts represents a fundamental shift in how we think about model scaling. By decoupling total parameters (capacity) from active parameters (compute cost), MoE enables models with dramatically more knowledge at tractable inference costs.
┌─────────────────────────────────────────────────────────────────────────┐
│ KEY TAKEAWAYS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CORE CONCEPT: │
│ • Replace single FFN with multiple "expert" FFNs │
│ • Router selects top-k experts per token │
│ • Sparse activation: many params, few used per token │
│ │
│ EFFICIENCY GAINS: │
│ • 4-6× more parameters per FLOP than dense models │
│ • Mixtral 8x7B matches Llama 70B at fraction of cost │
│ • Enables trillion-parameter models │
│ │
│ CRITICAL CHALLENGES: │
│ • Load balancing: Prevent expert collapse │
│ • Memory: All experts must fit, even if few active │
│ • Training: Needs large batches, careful auxiliary loss tuning │
│ │
│ SOLUTIONS: │
│ • Auxiliary balance loss: Penalize uneven routing │
│ • Capacity factors: Limit tokens per expert │
│ • Expert parallelism: Distribute experts across devices │
│ │
│ PRACTICAL IMPLICATIONS: │
│ • MoE likely the future of LLM scaling │
│ • Inference needs MoE-aware optimizations │
│ • Memory requirements higher than compute suggests │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Frequently Asked Questions
Related Articles
Transformer Architecture: A Complete Deep Dive
A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.
LLM Inference Optimization: From Quantization to Speculative Decoding
A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.
Distributed Training: How to Train 70B+ Parameter Models
A comprehensive deep dive into distributed training—how to train models that don't fit on a single GPU. Understand data parallelism, tensor parallelism, pipeline parallelism, ZeRO optimization, and the engineering behind training frontier LLMs.
Open-Source LLMs: The Complete 2025 Guide
A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.