Skip to main content
Back to Blog

mHC: How DeepSeek Fixed the Residual Connection Bottleneck with a 1967 Algorithm

DeepSeek's Manifold-Constrained Hyper-Connections (mHC) solve training instability in deep networks by projecting residual mixing matrices onto the Birkhoff Polytope using the Sinkhorn-Knopp algorithm. A deep dive into the architecture that may power DeepSeek R2 and V4.

12 min read
Share:

Paper Overview

TitlemHC: Manifold-Constrained Hyper-Connections
ArXiv2512.24880
DateDecember 31, 2025
AuthorsZhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang (CEO)
AffiliationDeepSeek
Contactxie.zhenda@deepseek.com

The Residual Connection Bottleneck

On December 31, 2025, DeepSeek published a paper that could reshape how we build large language models. The work, co-authored by CEO Wenfeng Liang and 18 researchers, introduces Manifold-Constrained Hyper-Connections (mHC)—a framework that fixes a fundamental limitation in transformer architecture.

The paper addresses one of the longest-standing constraints in deep learning: the residual stream bottleneck. Since ResNet (2015) and Transformers (2017), residual connections have been the backbone of trainable deep networks. They work beautifully—but they force all information through a single narrow pathway.

From the paper: "Hyper-Connections extend the residual connection paradigm by expanding the residual stream width and diversifying connectivity patterns. While this yields substantial performance gains, it fundamentally compromises the identity mapping property intrinsic to residual connections, causing severe training instability and restricted scalability."

The key insight: You can widen the residual stream for better performance, but doing so naively causes signal explosion that makes training impossible. mHC solves this with a mathematical constraint that preserves trainability while enabling richer information flow.

This post provides a complete technical breakdown of:

  • Why residual connections work (and their limitations)
  • What Hyper-Connections attempt and why they fail at scale
  • How mHC uses the Birkhoff Polytope and Sinkhorn-Knopp algorithm to restore stability
  • Complete benchmark results (3B, 9B, 27B models) and ablation studies
  • Infrastructure optimizations (TileLang, DualPipe, kernel fusion)
  • Implications for future models (DeepSeek R2, V4)

Key Equations at a Glance

Standard Residual Connection:

Code
x_{l+1} = x_l + F(x_l)

Hyper-Connections (HC):

Code
x_{l+1} = H_l^{res} · x_l + H_l^{post,T} · F(H_l^{pre} · x_l, W_l)

mHC Constraint (Birkhoff Polytope):

Code
H_res ∈ {M : M_{ij} ≥ 0, Σⱼ M_{ij} = 1, Σᵢ M_{ij} = 1}

Sinkhorn-Knopp Projection:

Code
for k = 1 to 20:
    A ← A / row_sums(A)    # Row normalize
    A ← A / col_sums(A)    # Column normalize

Key Property: Spectral norm of doubly stochastic matrices ≤ 1 → signals can't explode


Part I: Understanding Residual Connections

Why Residual Connections Exist

Before diving into mHC, we need to understand why residual connections became essential. In 2015, Kaiming He's ResNet paper solved the "degradation problem"—the counterintuitive observation that deeper networks performed worse than shallow ones, even on training data.

The solution was elegantly simple: instead of learning a transformation F(x), learn the residual F(x) - x, then add x back:

Output=F(x)+x\text{Output} = F(x) + x

This creates a "skip connection" that allows gradients to flow directly through the network without passing through every layer. If a layer isn't useful, it can simply learn F(x) ≈ 0, effectively becoming an identity mapping.

Code
┌──────────────────────────────────────────────────────────────┐
│                    TRADITIONAL RESIDUAL CONNECTION            │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│     Input x ─────────────────────────────────────┐           │
│         │                                         │           │
│         ▼                                         │           │
│    ┌─────────┐                                    │           │
│    │  Layer  │  (Attention, FFN, etc.)           │           │
│    │  F(x)   │                                    │           │
│    └────┬────┘                                    │           │
│         │                                         │           │
│         ▼                                         ▼           │
│       ┌───────────────────────────────────────────┐          │
│       │              ADD (⊕)                      │          │
│       │         Output = F(x) + x                 │          │
│       └───────────────────────────────────────────┘          │
│                                                               │
│  Key Property: Identity mapping preserved                     │
│  If F(x) → 0, then Output → x (signal preserved)             │
│                                                               │
└──────────────────────────────────────────────────────────────┘

The Identity Mapping Property

The magic of residual connections lies in the identity mapping property. When you stack many layers, the signal magnitude stays bounded:

Python
# Simplified view of residual propagation
def residual_block(x, layer):
    return layer(x) + x  # Identity + learned transformation

# After L layers
x_L = x_0 + F_1(x_0) + F_2(x_1) + ... + F_L(x_{L-1})

Mathematically, the Jacobian (derivative) of the identity mapping is exactly 1. This means:

  • Forward pass: Signals don't explode or vanish
  • Backward pass: Gradients flow directly to early layers
  • Scaling: You can stack hundreds of layers without instability

From research: "Residual connections enable training of networks with 100+ layers by providing gradient highways that bypass the vanishing gradient problem."

The Bottleneck Problem

However, traditional residual connections have a fundamental limitation: all information must flow through a single stream.

Code
┌─────────────────────────────────────────────────────────────┐
│              THE RESIDUAL STREAM BOTTLENECK                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Layer 1 ──► Layer 2 ──► Layer 3 ──► ... ──► Layer L        │
│      │          │          │                    │            │
│      ▼          ▼          ▼                    ▼            │
│  ┌──────────────────────────────────────────────────┐       │
│  │           SINGLE RESIDUAL STREAM                  │       │
│  │    All information compressed into one pathway    │       │
│  │           (d_model dimensions)                    │       │
│  └──────────────────────────────────────────────────┘       │
│                                                              │
│  Problem: As models scale, this becomes a bottleneck.       │
│  The model has capacity to process more, but can't.         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

As models grow wider (more attention heads, larger FFN), the residual stream doesn't grow proportionally. This creates an information bottleneck—the model has the capacity to compute richer representations, but must compress everything through the same narrow pathway.


Part II: Hyper-Connections (The Promising but Unstable Idea)

Widening the Residual Stream

The natural solution is to widen the residual stream. Instead of one stream, use multiple parallel streams that can exchange information. This is the core idea behind Hyper-Connections (HC), introduced by ByteDance.

Code
┌─────────────────────────────────────────────────────────────┐
│                    HYPER-CONNECTIONS (HC)                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Input: [x₁, x₂, x₃, x₄]  (4 parallel streams)             │
│              │                                               │
│              ▼                                               │
│        ┌──────────┐                                         │
│        │  Layer   │                                         │
│        └────┬─────┘                                         │
│              │                                               │
│              ▼                                               │
│     ┌────────────────┐                                      │
│     │   H_res        │  ← Mixing matrix (learned)           │
│     │   (4 × 4)      │                                      │
│     └────────┬───────┘                                      │
│              │                                               │
│              ▼                                               │
│  Output: [y₁, y₂, y₃, y₄]                                  │
│                                                              │
│  y = H_res · x  (matrix multiplication for mixing)          │
│                                                              │
│  Benefit: 4× wider residual pathway                         │
│  Problem: H_res is unconstrained → instability              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The idea is compelling:

  • Multiple streams carry different aspects of the representation
  • Mixing matrix H_res allows streams to exchange information
  • More capacity for information flow without increasing model width

Why Hyper-Connections Fail at Scale

Here's the critical problem: the mixing matrix H_res is unconstrained.

When you multiply arbitrary matrices together across hundreds of layers, disaster strikes. Any eigenvalue > 1 causes exponential growth:

Python
# Simplified: What happens with unconstrained mixing
import numpy as np

def simulate_hc_propagation(n_layers=100):
    """Simulate signal magnitude through HC layers"""
    # Random mixing matrix (unconstrained)
    H_res = np.random.randn(4, 4) * 0.5 + np.eye(4)

    x = np.ones(4)  # Initial signal
    magnitudes = [np.linalg.norm(x)]

    for _ in range(n_layers):
        x = H_res @ x
        magnitudes.append(np.linalg.norm(x))

    return magnitudes

# Result: Signal often explodes to 10^6 or higher
# or collapses to near-zero

From the DeepSeek paper: "In a 27B parameter model, unconstrained HC caused signal gains exceeding 3000×, leading to catastrophic divergence."

This isn't a tuning issue—it's a structural instability. The mixing matrices have eigenvalues that aren't bounded, so:

  1. Forward pass: Signals explode exponentially
  2. Backward pass: Gradients explode or vanish
  3. Training: Model diverges or fails to learn
Code
┌─────────────────────────────────────────────────────────────┐
│              SIGNAL EXPLOSION IN HYPER-CONNECTIONS           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Layer 1    Layer 10    Layer 50    Layer 100               │
│     │          │           │           │                     │
│     ▼          ▼           ▼           ▼                     │
│   1.0   →   3.2    →   847   →   3000+  → DIVERGENCE        │
│                                                              │
│  Signal magnitude grows exponentially                        │
│  Eigenvalues of H_res > 1 compound across layers            │
│                                                              │
│  Max signal gain in 27B model:                              │
│  - Unconstrained HC: ~3000×  ❌ Unstable                    │
│  - mHC (constrained): ~1.6×  ✓ Stable                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Part III: The mHC Solution

The Key Insight: Constrain to a Manifold

DeepSeek's insight was geometric: project the mixing matrix onto a manifold where signals can't explode.

The specific manifold they chose is the Birkhoff Polytope—the set of all doubly stochastic matrices:

Code
Doubly Stochastic Matrix:
- All entries are non-negative: H[i,j] ≥ 0
- Every row sums to 1: Σⱼ H[i,j] = 1
- Every column sums to 1: Σᵢ H[i,j] = 1

Why doubly stochastic matrices?

The spectral norm (maximum stretch factor) of any doubly stochastic matrix is ≤ 1. This guarantees:

  • Signals can't explode through the mixing operation
  • Information is redistributed, not amplified
  • The identity mapping property is approximately preserved
Code
┌─────────────────────────────────────────────────────────────┐
│                    THE BIRKHOFF POLYTOPE                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Set of all doubly stochastic matrices                       │
│                                                              │
│  Example (4×4):                                             │
│  ┌                                    ┐                     │
│  │  0.4   0.3   0.2   0.1            │  Row sum = 1        │
│  │  0.2   0.4   0.3   0.1            │  Row sum = 1        │
│  │  0.3   0.1   0.4   0.2            │  Row sum = 1        │
│  │  0.1   0.2   0.1   0.6            │  Row sum = 1        │
│  └                                    ┘                     │
│   ↓     ↓     ↓     ↓                                       │
│  Col   Col   Col   Col                                       │
│  =1    =1    =1    =1                                        │
│                                                              │
│  Properties:                                                 │
│  - Spectral norm ≤ 1 (no signal explosion)                  │
│  - Preserves total "mass" of signal                         │
│  - Vertices are permutation matrices                         │
│  - Identity matrix is in the interior                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Sinkhorn-Knopp Algorithm (1967)

How do you project an arbitrary matrix onto the Birkhoff Polytope? DeepSeek uses the Sinkhorn-Knopp algorithm, a classical result from 1967.

The algorithm is remarkably simple: alternately normalize rows and columns.

Python
def sinkhorn_knopp(M, num_iterations=20):
    """
    Project matrix M onto the Birkhoff Polytope (doubly stochastic matrices).

    Args:
        M: Input matrix (n × n), should have positive entries
        num_iterations: Number of alternating normalizations

    Returns:
        Doubly stochastic matrix approximation
    """
    # Ensure positive entries (apply softmax or exp)
    A = torch.exp(M)  # or torch.softmax(M, dim=-1)

    for _ in range(num_iterations):
        # Row normalization: make each row sum to 1
        A = A / A.sum(dim=1, keepdim=True)

        # Column normalization: make each column sum to 1
        A = A / A.sum(dim=0, keepdim=True)

    return A

# Example
M = torch.randn(4, 4)
H_res = sinkhorn_knopp(M, num_iterations=20)
# H_res is now (approximately) doubly stochastic
# Rows sum to ~1, columns sum to ~1

Why 20 iterations? The algorithm converges exponentially fast. With 20 iterations, the matrix is close enough to doubly stochastic for stability, while keeping computational cost manageable.

mHC Architecture

Putting it together, mHC modifies Hyper-Connections by constraining the mixing matrices:

Code
┌─────────────────────────────────────────────────────────────┐
│            MANIFOLD-CONSTRAINED HYPER-CONNECTIONS (mHC)      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Input: [x₁, x₂, x₃, x₄]  (n=4 streams)                    │
│              │                                               │
│              ▼                                               │
│        ┌──────────┐                                         │
│        │  Layer   │  (Attention or FFN)                     │
│        └────┬─────┘                                         │
│              │                                               │
│              ▼                                               │
│     ┌─────────────────────────────────────────┐             │
│     │         LEARNABLE PARAMETERS             │             │
│     │              W (n × n)                   │             │
│     └─────────────────┬───────────────────────┘             │
│                       │                                      │
│                       ▼                                      │
│     ┌─────────────────────────────────────────┐             │
│     │      SINKHORN-KNOPP PROJECTION          │             │
│     │   H_res = SinkhornKnopp(exp(W), k=20)   │             │
│     │                                          │             │
│     │   Guarantees: H_res is doubly stochastic │             │
│     │              Spectral norm ≤ 1          │             │
│     └─────────────────┬───────────────────────┘             │
│                       │                                      │
│                       ▼                                      │
│     ┌─────────────────────────────────────────┐             │
│     │           CONSTRAINED MIXING             │             │
│     │         y = H_res · x + residual         │             │
│     └─────────────────┬───────────────────────┘             │
│                       │                                      │
│                       ▼                                      │
│  Output: [y₁, y₂, y₃, y₄]                                  │
│                                                              │
│  ✓ Wider residual stream (4× information flow)              │
│  ✓ Stable training (bounded signal magnitude)               │
│  ✓ Preserves identity mapping property                      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Mathematical Guarantee

The key theorem underlying mHC:

Theorem (Bounded Signal Propagation): For any doubly stochastic matrix H, and any vector x:

Code
||H · x||₂ ≤ ||x||₂

This means the mixing operation cannot amplify signals. Information is redistributed between streams, but total magnitude is preserved or reduced.

After L layers:

Code
||x_L|| ≤ ||x_0|| · (1 + ε)^L where ε ≈ 0

In practice, DeepSeek observed max signal gain of ~1.6× versus ~3000× for unconstrained HC.


Part IV: Implementation Details

Full mHC Layer

Here's a more complete implementation of an mHC layer:

Python
import torch
import torch.nn as nn

class SinkhornKnopp(torch.autograd.Function):
    """
    Custom autograd function for Sinkhorn-Knopp projection.
    Includes backward pass for gradient computation.
    """
    @staticmethod
    def forward(ctx, M, num_iterations=20):
        # Store for backward
        ctx.num_iterations = num_iterations

        # Ensure positive entries
        A = torch.exp(M)

        # Alternating row/column normalization
        for _ in range(num_iterations):
            A = A / A.sum(dim=-1, keepdim=True)  # Row normalize
            A = A / A.sum(dim=-2, keepdim=True)  # Column normalize

        ctx.save_for_backward(A, M)
        return A

    @staticmethod
    def backward(ctx, grad_output):
        A, M = ctx.saved_tensors
        # Implicit differentiation through Sinkhorn iterations
        # (Simplified - actual implementation uses implicit gradients)
        grad_M = grad_output * A
        return grad_M, None

def sinkhorn_knopp(M, num_iterations=20):
    return SinkhornKnopp.apply(M, num_iterations)


class mHCBlock(nn.Module):
    """
    Manifold-Constrained Hyper-Connection block.

    Args:
        d_model: Model hidden dimension
        n_streams: Number of parallel residual streams (expansion rate)
        num_sinkhorn_iters: Sinkhorn-Knopp iterations
    """
    def __init__(self, d_model, n_streams=4, num_sinkhorn_iters=20):
        super().__init__()
        self.d_model = d_model
        self.n_streams = n_streams
        self.num_sinkhorn_iters = num_sinkhorn_iters

        # Learnable parameters for mixing matrix
        # Will be projected to doubly stochastic via Sinkhorn-Knopp
        self.W_mix = nn.Parameter(torch.zeros(n_streams, n_streams))
        nn.init.eye_(self.W_mix)  # Initialize near identity

    def get_mixing_matrix(self):
        """Project learnable weights to Birkhoff Polytope."""
        return sinkhorn_knopp(self.W_mix, self.num_sinkhorn_iters)

    def forward(self, streams, layer_output):
        """
        Apply mHC mixing to residual streams.

        Args:
            streams: Tensor of shape (batch, seq_len, n_streams, d_model)
            layer_output: Output from attention/FFN layer (batch, seq_len, d_model)

        Returns:
            Updated streams tensor
        """
        batch, seq_len, n_streams, d_model = streams.shape

        # Get constrained mixing matrix
        H_res = self.get_mixing_matrix()  # (n_streams, n_streams)

        # Mix streams: each new stream is weighted combination of old streams
        # streams: (B, S, N, D) -> (B, S, D, N) for matmul
        streams_t = streams.permute(0, 1, 3, 2)  # (B, S, D, N)
        mixed = torch.matmul(streams_t, H_res.T)  # (B, S, D, N)
        mixed = mixed.permute(0, 1, 3, 2)  # (B, S, N, D)

        # Add layer output to first stream (or distribute)
        mixed[:, :, 0, :] = mixed[:, :, 0, :] + layer_output

        return mixed


class mHCTransformerBlock(nn.Module):
    """
    Full transformer block with mHC residual connections.
    """
    def __init__(self, d_model, n_heads, n_streams=4, ffn_ratio=4):
        super().__init__()
        self.d_model = d_model
        self.n_streams = n_streams

        # Standard transformer components
        self.ln1 = nn.LayerNorm(d_model)
        self.attention = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * ffn_ratio),
            nn.GELU(),
            nn.Linear(d_model * ffn_ratio, d_model)
        )

        # mHC mixing blocks
        self.mhc_attn = mHCBlock(d_model, n_streams)
        self.mhc_ffn = mHCBlock(d_model, n_streams)

    def forward(self, streams, mask=None):
        """
        Args:
            streams: (batch, seq_len, n_streams, d_model)
            mask: Optional attention mask

        Returns:
            Updated streams
        """
        # Use first stream as "main" representation for attention
        x = streams[:, :, 0, :]  # (B, S, D)

        # Attention with mHC residual
        attn_out, _ = self.attention(
            self.ln1(x), self.ln1(x), self.ln1(x),
            attn_mask=mask
        )
        streams = self.mhc_attn(streams, attn_out)

        # FFN with mHC residual
        x = streams[:, :, 0, :]
        ffn_out = self.ffn(self.ln2(x))
        streams = self.mhc_ffn(streams, ffn_out)

        return streams

Parameterization Details

The paper specifies how the three matrices are parameterized:

H_pre and H_post (Input/Output Projections):

  • Use sigmoid activation to ensure non-negative entries
  • This prevents cancellation between streams
  • Interpretation as weighted averaging remains clear

H_res (Residual Mixing):

  • Projected to doubly stochastic via Sinkhorn-Knopp
  • Initialized near identity for stable start
  • Small initialization values (α = 0.01) for gating factors

Expansion Rate n:

  • Primary experiments use n = 4 (4 parallel streams)
  • Creates 4× wider residual pathway
  • Each stream has dimension d_model (same as original)

Part V: Infrastructure Optimizations

The DeepSeek paper emphasizes that raw mHC would be too slow without careful optimization. Expanding the data stream by 4× typically slows down training significantly because the GPU has to move much more data in and out of memory—the "Memory Wall" problem.

DeepSeek's key insight: transform mHC from a memory-intensive operation into a compute-intensive one.

1. Kernel Fusion with TileLang

DeepSeek developed TileLang, a Domain Specific Language built on TVM, allowing engineers to write low-level CUDA code using Python syntax. This enables precise control over GPU processors and memory hierarchies.

Python
# Naive implementation: 40+ memory round-trips per layer
# A = exp(W)                    # Write to HBM
# for i in range(20):
#     A = A / A.sum(dim=1)      # Read from HBM, write to HBM
#     A = A / A.sum(dim=0)      # Read from HBM, write to HBM

# TileLang fused kernel: Everything in SRAM
@tilelang.kernel
def fused_sinkhorn(W, output, num_iters):
    """
    Fused Sinkhorn-Knopp with shared memory.
    Once data is loaded from HBM to SRAM, all 20 iterations
    complete within SRAM—eliminating intermediate writes.
    """
    # Load W into shared memory (SRAM) once
    shared_A = exp(W)

    for i in range(num_iters):
        # Row normalize entirely in SRAM
        row_sums = shared_A.sum(dim=1)
        shared_A = shared_A / row_sums

        # Column normalize entirely in SRAM
        col_sums = shared_A.sum(dim=0)
        shared_A = shared_A / col_sums

    # Single write to output (HBM)
    output = shared_A

2. Custom Backward Pass

Instead of backpropagating through 20 iterations explicitly, DeepSeek uses implicit differentiation:

Python
# Naive backward: Unroll 20 iterations → 40+ backward steps
# Efficient backward: Implicit differentiation

# The Sinkhorn solution satisfies:
# H_res = D1 @ exp(W) @ D2  where D1, D2 are diagonal scaling matrices

# Gradient can be computed by solving a linear system
# instead of unrolling the iterations

3. Selective Recomputation

The n-stream residual design introduces substantial memory overhead during training. DeepSeek mitigates this by:

  • Discarding intermediate activations of mHC kernels after forward pass
  • Recomputing them on-the-fly during backward pass
  • Trading compute for memory (worthwhile given kernel efficiency)

4. Communication Overlapping (DualPipe)

In distributed training, GPUs often sit idle while waiting for network communication. DeepSeek's solution:

Code
┌─────────────────────────────────────────────────────────────┐
│                    DUALPIPE SCHEDULE                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Main Stream:    [Compute]  [Wait for AllReduce]  [Compute] │
│                              ↓                               │
│  Priority Stream:           [mHC Sinkhorn]                  │
│                                                              │
│  When main stream awaits network sync, GPU switches to      │
│  high-priority stream for mHC calculations → "free" compute │
│                                                              │
└─────────────────────────────────────────────────────────────┘

This Compute-Communication Overlap makes mHC's computations almost "free."

5. Mixed Precision

TileLang-based kernels enable:

  • FP8 computation for matrix operations (speed)
  • FP32 precision for numerically sensitive normalization steps (accuracy)

Overhead Summary

OptimizationImpact
Kernel FusionEliminates 40+ HBM round-trips per layer
Implicit GradientsSingle backward instead of 40 steps
Selective RecomputationTrades compute for memory
DualPipe OverlapMakes mHC compute "free" during comm
Mixed PrecisionFaster matmuls, stable normalization

Final overhead: 6.7% training time for 4× wider residual stream.


Part VI: Experimental Results

Model Configurations

All models use Mixture-of-Experts architectures inspired by DeepSeek-V3:

ModelTotal ParamsActive ParamsLayersTraining Tokens
3B MoE3B~0.5B321T
9B MoE9B~1.5B481T
27B MoE27B~4.5B641T

Both HC and mHC use expansion rate n = 4 for the widened residual stream.

Signal Gain Analysis by Model Size

The paper provides detailed measurements of the Amax Gain Magnitude (worst-case signal amplification):

ModelBaselineHCmHC
3B1.2×48×1.5×
9B1.3×287×1.6×
27B1.4×~3000×1.6×

Critical observation: HC signal gain grows exponentially with model size (48→287→3000), while mHC remains bounded (~1.5-1.6×). This is why HC diverges at scale.

Training Stability

Taking mHC as baseline, the paper shows HC exhibits an unexpected loss surge around step 12k, highly correlated with gradient instability:

Code
┌─────────────────────────────────────────────────────────────┐
│                    TRAINING LOSS CURVES                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Loss                                                        │
│   │                                                          │
│   │      HC Loss Spike                                      │
│   │           ╱╲                                            │
│   │          ╱  ╲                                           │
│   │  HC ────╱    ╲────× (DIVERGED)                         │
│   │                                                          │
│   │  mHC ─────────────────────────────── (Stable)          │
│   │                                                          │
│   │  Baseline ──────────────────────────                    │
│   │                                                          │
│   └────────────────────────────────────────────────── Steps │
│        0      5k     10k    12k    15k    20k               │
│                                                              │
│  HC diverges at 12k steps; mHC trains smoothly throughout   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Complete Benchmark Results (27B Model)

Table 2 from paper - Eight diverse benchmarks:

BenchmarkBaselineHCmHCΔ (mHC vs HC)
BBH43.848.951.0+2.1
DROP47.051.653.9+2.3
GSM8K46.750.253.8+3.6
HellaSwag72.174.375.8+1.5
MATH24.327.129.4+2.3
MMLU59.061.863.4+1.6
PIQA78.279.580.3+0.8
TriviaQA58.460.261.9+1.7

Key observations:

  • mHC consistently outperforms both baseline and HC across all 8 benchmarks
  • Largest improvements on reasoning tasks: BBH (+2.1%), DROP (+2.3%), GSM8K (+3.6%)
  • Training loss: mHC achieves 0.021 lower loss than baseline

Ablation Study: Component Analysis

Table 1 from paper - Which component matters most?

ConfigurationBBHGSM8KStable?
Baseline (no HC)43.846.7
H_pre + H_post only (no H_res)44.147.2
H_res only (no H_pre/H_post)49.852.1✗ (unstable)
Full HC (unconstrained)48.950.2✗ (diverges)
Full mHC (constrained H_res)51.053.8

Critical finding: H_res is the primary driver of performance gains. When they removed H_res (keeping only H_pre and H_post), performance dropped dramatically. However, unconstrained H_res causes instability. mHC solves this by constraining H_res to be doubly stochastic.

From the paper: "The ablation studies prove that H_res is the most critical component—the highlight of the process is when features from different depths get to interact and swap information."

Scaling Behavior Across Model Sizes

Performance advantages maintained across all scales:

ModelBBH (mHC vs Baseline)Stable?
3B+5.8
9B+6.5
27B+7.2

mHC shows consistent scaling—improvements don't attenuate at larger scales.

Gradient Norm Analysis

The paper shows gradient norm variance is significantly reduced with mHC:

MetricHCmHC
Gradient norm varianceHigh (spikes)Low (stable)
Max gradient magnitudeUnboundedBounded
Loss curve smoothnessSpikes at 12kSmooth throughout

Part VII: Mathematical Deep Dive

Why Doubly Stochastic Works

Let's prove why constraining to doubly stochastic matrices bounds signal magnitude.

Lemma 1 (Spectral Norm Bound): For any doubly stochastic matrix H:

Code
||H||₂ ≤ 1

Proof sketch:

  • The all-ones vector 1 is an eigenvector of H with eigenvalue 1 (since rows sum to 1)
  • By Perron-Frobenius theorem, this is the largest eigenvalue for non-negative matrices
  • Therefore spectral norm (largest singular value) ≤ 1 ∎

Corollary (Bounded Propagation): For any vector x:

Code
||H · x||₂ ≤ ||H||₂ · ||x||₂ ≤ ||x||₂

This means mixing cannot amplify the signal—only redistribute it.

The Birkhoff-von Neumann Theorem

The Birkhoff Polytope has beautiful structure:

Theorem (Birkhoff-von Neumann): Every doubly stochastic matrix is a convex combination of permutation matrices.

Code
H = Σᵢ λᵢ Pᵢ   where:
- Pᵢ are permutation matrices
- λᵢ ≥ 0 and Σᵢ λᵢ = 1

This means mHC mixing can be interpreted as a soft permutation of streams—a weighted average of all possible reorderings.

Sinkhorn-Knopp Convergence

Theorem (Sinkhorn-Knopp Convergence): For any positive matrix M, the Sinkhorn-Knopp iterations converge to a unique doubly stochastic matrix at rate O(exp(-k)) where k is the iteration count.

The convergence is exponentially fast, which is why 20 iterations suffice for practical purposes.


Part VIII: Implications and Future Directions

Expected Use in Future DeepSeek Models

The mHC paper was co-authored by DeepSeek CEO Liang Wenfeng, signaling strategic importance. Industry analysts expect:

  • DeepSeek R2 (expected Q1 2026): Likely first production model with mHC
  • DeepSeek V4 (expected 2026): Full integration with MLA and other innovations
  • Open-source release: Following DeepSeek's pattern of open-sourcing innovations

From analysis: "mHC addresses a fundamental bottleneck. Combined with DeepSeek's other innovations (MLA, DeepSeekMoE), this could enable training of models significantly larger than current frontier."

Comparison with Other Approaches

ApproachMechanismOverheadScalability
Standard ResidualSingle stream0%✓ Limited capacity
Hyper-ConnectionsMultiple streams, unconstrained~5%✗ Unstable
mHCMultiple streams, Birkhoff constraint~6.7%✓ Stable
Mixture of ExpertsSparse activationVariable✓ Different dimension

mHC is orthogonal to MoE—they solve different problems:

  • MoE: Sparse activation for compute efficiency
  • mHC: Wider residual stream for information flow

They can be combined (and likely will be in future DeepSeek models).

Open Questions

  1. Optimal expansion rate n? The paper uses n=4, but optimal value may depend on model size
  2. Interaction with other architectures? How does mHC interact with MLA, SwiGLU, etc.?
  3. Vision and multimodal? Does mHC benefit non-language modalities?
  4. Inference optimization? Can Sinkhorn be precomputed for inference?

Conclusion

mHC represents a fundamental advance in neural network architecture design. By recognizing that the residual connection bottleneck could be solved with a geometric constraint rather than architectural complexity, DeepSeek has opened a new path for scaling deep networks.

Key takeaways:

  1. The problem was structural: Hyper-Connections fail not due to tuning but due to unbounded eigenvalues causing signal explosion

  2. The solution is elegant: Project mixing matrices onto the Birkhoff Polytope using the classical Sinkhorn-Knopp algorithm (1967)

  3. The overhead is minimal: 6.7% training time for 4× wider residual stream and consistent benchmark improvements

  4. The implications are significant: Enables stable training at scales where previous approaches diverge

  5. A 1967 algorithm saves 2025 AI: Sometimes the best solutions come from revisiting classical mathematics

From the paper: "mHC provides tangible performance improvements and superior scalability, laying the groundwork for more capable foundation models."

Watch for mHC in DeepSeek R2 and V4—this architectural innovation may be a key enabler of the next generation of frontier models.


Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles

EducationLLMs

Transformer Architecture: A Complete Deep Dive

A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.

30 min read
LLMsML Engineering

Attention Mechanisms: From Self-Attention to FlashAttention

A comprehensive deep dive into attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.

7 min read
LLMsML Engineering

Mixture of Experts: Scaling LLMs Beyond Dense Models

A comprehensive deep dive into Mixture of Experts (MoE) architecture—how models like Mixtral and GPT-4 achieve massive capacity without proportional compute costs. Understand routing mechanisms, expert specialization, load balancing, and why MoE represents the future of LLM scaling.

6 min read
LLMsML Engineering

Open-Source LLMs: The Complete 2025 Guide

A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.

3 min read
LLMsML Engineering

Distributed Training: How to Train 70B+ Parameter Models

A comprehensive deep dive into distributed training—how to train models that don't fit on a single GPU. Understand data parallelism, tensor parallelism, pipeline parallelism, ZeRO optimization, and the engineering behind training frontier LLMs.

3 min read