What is mHC in simple terms?

mHC (Manifold-Constrained Hyper-Connections) is a way to widen the "information highway" in neural networks while keeping training stable. Traditional neural networks use a single residual stream—mHC uses multiple parallel streams that can exchange information, but constrains how they mix so signals don't explode.

Why use a 1967 algorithm?

The Sinkhorn-Knopp algorithm efficiently projects any matrix onto the set of doubly stochastic matrices. DeepSeek recognized that this classical result solves the modern problem of stabilizing Hyper-Connections. Sometimes the best solutions come from revisiting established mathematics.

How much does mHC cost in compute?

About 6.7% additional training time for 4× wider residual streams. This is remarkably efficient thanks to kernel fusion and the fact that Sinkhorn operates on small matrices (n×n where n=4 typically).

Is mHC used in current DeepSeek models?

Not yet. The paper was published December 31, 2025. mHC is expected to appear in DeepSeek R2 or V4, likely in early-to-mid 2026.

Can mHC be combined with MoE?

Yes, they solve different problems. MoE provides sparse activation for compute efficiency; mHC provides wider residual streams for information flow. Combining them is a natural next step.

What's the Birkhoff Polytope?

The set of all doubly stochastic matrices—matrices where all entries are non-negative and every row and column sums to 1. This constraint guarantees that mixing operations can't amplify signals, preventing the instability seen in unconstrained Hyper-Connections.

mHC: How DeepSeek Fixed the Residual Connection Bottleneck with a 1967 Algorithm | Enrico Piovano

Paper Overview


Title	mHC: Manifold-Constrained Hyper-Connections
ArXiv	2512.24880
Date	December 31, 2025
Authors	Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang (CEO)
Affiliation	DeepSeek
Contact	xie.zhenda@deepseek.com

The Residual Connection Bottleneck

On December 31, 2025, DeepSeek published a paper that could reshape how we build large language models. The work, co-authored by CEO Wenfeng Liang and 18 researchers, introduces Manifold-Constrained Hyper-Connections (mHC)—a framework that fixes a fundamental limitation in transformer architecture.

The paper addresses one of the longest-standing constraints in deep learning: the residual stream bottleneck. Since ResNet (2015) and Transformers (2017), residual connections have been the backbone of trainable deep networks. They work beautifully—but they force all information through a single narrow pathway.

From the paper: "Hyper-Connections extend the residual connection paradigm by expanding the residual stream width and diversifying connectivity patterns. While this yields substantial performance gains, it fundamentally compromises the identity mapping property intrinsic to residual connections, causing severe training instability and restricted scalability."

The key insight: You can widen the residual stream for better performance, but doing so naively causes signal explosion that makes training impossible. mHC solves this with a mathematical constraint that preserves trainability while enabling richer information flow.

This post provides a complete technical breakdown of:

Why residual connections work (and their limitations)
What Hyper-Connections attempt and why they fail at scale
How mHC uses the Birkhoff Polytope and Sinkhorn-Knopp algorithm to restore stability
Complete benchmark results (3B, 9B, 27B models) and ablation studies
Infrastructure optimizations (TileLang, DualPipe, kernel fusion)
Implications for future models (DeepSeek R2, V4)

Key Equations at a Glance

Standard Residual Connection:

Code

x_{l+1} = x_l + F(x_l)

Hyper-Connections (HC):

Code

x_{l+1} = H_l^{res} · x_l + H_l^{post,T} · F(H_l^{pre} · x_l, W_l)

mHC Constraint (Birkhoff Polytope):

Code

H_res ∈ {M : M_{ij} ≥ 0, Σⱼ M_{ij} = 1, Σᵢ M_{ij} = 1}

Sinkhorn-Knopp Projection:

Code

for k = 1 to 20:
    A ← A / row_sums(A)    # Row normalize
    A ← A / col_sums(A)    # Column normalize

Key Property: Spectral norm of doubly stochastic matrices ≤ 1 → signals can't explode

Part I: Understanding Residual Connections

Why Residual Connections Exist

Before diving into mHC, we need to understand why residual connections became essential. In 2015, Kaiming He's ResNet paper solved the "degradation problem"—the counterintuitive observation that deeper networks performed worse than shallow ones, even on training data.

The solution was elegantly simple: instead of learning a transformation F(x), learn the residual F(x) - x, then add x back:

$\text{Output} = F(x) + x$

This creates a "skip connection" that allows gradients to flow directly through the network without passing through every layer. If a layer isn't useful, it can simply learn F(x) ≈ 0, effectively becoming an identity mapping.

Code

┌──────────────────────────────────────────────────────────────┐
│                    TRADITIONAL RESIDUAL CONNECTION            │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│     Input x ─────────────────────────────────────┐           │
│         │                                         │           │
│         ▼                                         │           │
│    ┌─────────┐                                    │           │
│    │  Layer  │  (Attention, FFN, etc.)           │           │
│    │  F(x)   │                                    │           │
│    └────┬────┘                                    │           │
│         │                                         │           │
│         ▼                                         ▼           │
│       ┌───────────────────────────────────────────┐          │
│       │              ADD (⊕)                      │          │
│       │         Output = F(x) + x                 │          │
│       └───────────────────────────────────────────┘          │
│                                                               │
│  Key Property: Identity mapping preserved                     │
│  If F(x) → 0, then Output → x (signal preserved)             │
│                                                               │
└──────────────────────────────────────────────────────────────┘

The Identity Mapping Property

The magic of residual connections lies in the identity mapping property. When you stack many layers, the signal magnitude stays bounded:

Python

# Simplified view of residual propagation
def residual_block(x, layer):
    return layer(x) + x  # Identity + learned transformation

# After L layers
x_L = x_0 + F_1(x_0) + F_2(x_1) + ... + F_L(x_{L-1})

Mathematically, the Jacobian (derivative) of the identity mapping is exactly 1. This means:

Forward pass: Signals don't explode or vanish
Backward pass: Gradients flow directly to early layers
Scaling: You can stack hundreds of layers without instability

From research: "Residual connections enable training of networks with 100+ layers by providing gradient highways that bypass the vanishing gradient problem."

The Bottleneck Problem

However, traditional residual connections have a fundamental limitation: all information must flow through a single stream.

Code

┌─────────────────────────────────────────────────────────────┐
│              THE RESIDUAL STREAM BOTTLENECK                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Layer 1 ──► Layer 2 ──► Layer 3 ──► ... ──► Layer L        │
│      │          │          │                    │            │
│      ▼          ▼          ▼                    ▼            │
│  ┌──────────────────────────────────────────────────┐       │
│  │           SINGLE RESIDUAL STREAM                  │       │
│  │    All information compressed into one pathway    │       │
│  │           (d_model dimensions)                    │       │
│  └──────────────────────────────────────────────────┘       │
│                                                              │
│  Problem: As models scale, this becomes a bottleneck.       │
│  The model has capacity to process more, but can't.         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

As models grow wider (more attention heads, larger FFN), the residual stream doesn't grow proportionally. This creates an information bottleneck—the model has the capacity to compute richer representations, but must compress everything through the same narrow pathway.

Part II: Hyper-Connections (The Promising but Unstable Idea)

Widening the Residual Stream

The natural solution is to widen the residual stream. Instead of one stream, use multiple parallel streams that can exchange information. This is the core idea behind Hyper-Connections (HC), introduced by ByteDance.

Code

┌─────────────────────────────────────────────────────────────┐
│                    HYPER-CONNECTIONS (HC)                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Input: [x₁, x₂, x₃, x₄]  (4 parallel streams)             │
│              │                                               │
│              ▼                                               │
│        ┌──────────┐                                         │
│        │  Layer   │                                         │
│        └────┬─────┘                                         │
│              │                                               │
│              ▼                                               │
│     ┌────────────────┐                                      │
│     │   H_res        │  ← Mixing matrix (learned)           │
│     │   (4 × 4)      │                                      │
│     └────────┬───────┘                                      │
│              │                                               │
│              ▼                                               │
│  Output: [y₁, y₂, y₃, y₄]                                  │
│                                                              │
│  y = H_res · x  (matrix multiplication for mixing)          │
│                                                              │
│  Benefit: 4× wider residual pathway                         │
│  Problem: H_res is unconstrained → instability              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The idea is compelling:

Multiple streams carry different aspects of the representation
Mixing matrix H_res allows streams to exchange information
More capacity for information flow without increasing model width

Why Hyper-Connections Fail at Scale

Here's the critical problem: the mixing matrix H_res is unconstrained.

When you multiply arbitrary matrices together across hundreds of layers, disaster strikes. Any eigenvalue > 1 causes exponential growth:

Python

# Simplified: What happens with unconstrained mixing
import numpy as np

def simulate_hc_propagation(n_layers=100):
    """Simulate signal magnitude through HC layers"""
    # Random mixing matrix (unconstrained)
    H_res = np.random.randn(4, 4) * 0.5 + np.eye(4)

    x = np.ones(4)  # Initial signal
    magnitudes = [np.linalg.norm(x)]

    for _ in range(n_layers):
        x = H_res @ x
        magnitudes.append(np.linalg.norm(x))

    return magnitudes

# Result: Signal often explodes to 10^6 or higher
# or collapses to near-zero

From the DeepSeek paper: "In a 27B parameter model, unconstrained HC caused signal gains exceeding 3000×, leading to catastrophic divergence."

This isn't a tuning issue—it's a structural instability. The mixing matrices have eigenvalues that aren't bounded, so:

Forward pass: Signals explode exponentially
Backward pass: Gradients explode or vanish
Training: Model diverges or fails to learn

Code

┌─────────────────────────────────────────────────────────────┐
│              SIGNAL EXPLOSION IN HYPER-CONNECTIONS           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Layer 1    Layer 10    Layer 50    Layer 100               │
│     │          │           │           │                     │
│     ▼          ▼           ▼           ▼                     │
│   1.0   →   3.2    →   847   →   3000+  → DIVERGENCE        │
│                                                              │
│  Signal magnitude grows exponentially                        │
│  Eigenvalues of H_res > 1 compound across layers            │
│                                                              │
│  Max signal gain in 27B model:                              │
│  - Unconstrained HC: ~3000×  ❌ Unstable                    │
│  - mHC (constrained): ~1.6×  ✓ Stable                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Part III: The mHC Solution

The Key Insight: Constrain to a Manifold

DeepSeek's insight was geometric: project the mixing matrix onto a manifold where signals can't explode.

The specific manifold they chose is the Birkhoff Polytope—the set of all doubly stochastic matrices:

Code

Doubly Stochastic Matrix:
- All entries are non-negative: H[i,j] ≥ 0
- Every row sums to 1: Σⱼ H[i,j] = 1
- Every column sums to 1: Σᵢ H[i,j] = 1

Why doubly stochastic matrices?

The spectral norm (maximum stretch factor) of any doubly stochastic matrix is ≤ 1. This guarantees:

Signals can't explode through the mixing operation
Information is redistributed, not amplified
The identity mapping property is approximately preserved

Code

┌─────────────────────────────────────────────────────────────┐
│                    THE BIRKHOFF POLYTOPE                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Set of all doubly stochastic matrices                       │
│                                                              │
│  Example (4×4):                                             │
│  ┌                                    ┐                     │
│  │  0.4   0.3   0.2   0.1            │  Row sum = 1        │
│  │  0.2   0.4   0.3   0.1            │  Row sum = 1        │
│  │  0.3   0.1   0.4   0.2            │  Row sum = 1        │
│  │  0.1   0.2   0.1   0.6            │  Row sum = 1        │
│  └                                    ┘                     │
│   ↓     ↓     ↓     ↓                                       │
│  Col   Col   Col   Col                                       │
│  =1    =1    =1    =1                                        │
│                                                              │
│  Properties:                                                 │
│  - Spectral norm ≤ 1 (no signal explosion)                  │
│  - Preserves total "mass" of signal                         │
│  - Vertices are permutation matrices                         │
│  - Identity matrix is in the interior                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Sinkhorn-Knopp Algorithm (1967)

How do you project an arbitrary matrix onto the Birkhoff Polytope? DeepSeek uses the Sinkhorn-Knopp algorithm, a classical result from 1967.

The algorithm is remarkably simple: alternately normalize rows and columns.

Python

def sinkhorn_knopp(M, num_iterations=20):
    """
    Project matrix M onto the Birkhoff Polytope (doubly stochastic matrices).

    Args:
        M: Input matrix (n × n), should have positive entries
        num_iterations: Number of alternating normalizations

    Returns:
        Doubly stochastic matrix approximation
    """
    # Ensure positive entries (apply softmax or exp)
    A = torch.exp(M)  # or torch.softmax(M, dim=-1)

    for _ in range(num_iterations):
        # Row normalization: make each row sum to 1
        A = A / A.sum(dim=1, keepdim=True)

        # Column normalization: make each column sum to 1
        A = A / A.sum(dim=0, keepdim=True)

    return A

# Example
M = torch.randn(4, 4)
H_res = sinkhorn_knopp(M, num_iterations=20)
# H_res is now (approximately) doubly stochastic
# Rows sum to ~1, columns sum to ~1

Why 20 iterations? The algorithm converges exponentially fast. With 20 iterations, the matrix is close enough to doubly stochastic for stability, while keeping computational cost manageable.

mHC Architecture

Putting it together, mHC modifies Hyper-Connections by constraining the mixing matrices:

Code

┌─────────────────────────────────────────────────────────────┐
│            MANIFOLD-CONSTRAINED HYPER-CONNECTIONS (mHC)      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Input: [x₁, x₂, x₃, x₄]  (n=4 streams)                    │
│              │                                               │
│              ▼                                               │
│        ┌──────────┐                                         │
│        │  Layer   │  (Attention or FFN)                     │
│        └────┬─────┘                                         │
│              │                                               │
│              ▼                                               │
│     ┌─────────────────────────────────────────┐             │
│     │         LEARNABLE PARAMETERS             │             │
│     │              W (n × n)                   │             │
│     └─────────────────┬───────────────────────┘             │
│                       │                                      │
│                       ▼                                      │
│     ┌─────────────────────────────────────────┐             │
│     │      SINKHORN-KNOPP PROJECTION          │             │
│     │   H_res = SinkhornKnopp(exp(W), k=20)   │             │
│     │                                          │             │
│     │   Guarantees: H_res is doubly stochastic │             │
│     │              Spectral norm ≤ 1          │             │
│     └─────────────────┬───────────────────────┘             │
│                       │                                      │
│                       ▼                                      │
│     ┌─────────────────────────────────────────┐             │
│     │           CONSTRAINED MIXING             │             │
│     │         y = H_res · x + residual         │             │
│     └─────────────────┬───────────────────────┘             │
│                       │                                      │
│                       ▼                                      │
│  Output: [y₁, y₂, y₃, y₄]                                  │
│                                                              │
│  ✓ Wider residual stream (4× information flow)              │
│  ✓ Stable training (bounded signal magnitude)               │
│  ✓ Preserves identity mapping property                      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Mathematical Guarantee

The key theorem underlying mHC:

Theorem (Bounded Signal Propagation): For any doubly stochastic matrix H, and any vector x:

Code

||H · x||₂ ≤ ||x||₂

This means the mixing operation cannot amplify signals. Information is redistributed between streams, but total magnitude is preserved or reduced.

After L layers:

Code

||x_L|| ≤ ||x_0|| · (1 + ε)^L where ε ≈ 0

In practice, DeepSeek observed max signal gain of ~1.6× versus ~3000× for unconstrained HC.

Part IV: Implementation Details

Full mHC Layer

Here's a more complete implementation of an mHC layer:

Python

import torch
import torch.nn as nn

class SinkhornKnopp(torch.autograd.Function):
    """
    Custom autograd function for Sinkhorn-Knopp projection.
    Includes backward pass for gradient computation.
    """
    @staticmethod
    def forward(ctx, M, num_iterations=20):
        # Store for backward
        ctx.num_iterations = num_iterations

        # Ensure positive entries
        A = torch.exp(M)

        # Alternating row/column normalization
        for _ in range(num_iterations):
            A = A / A.sum(dim=-1, keepdim=True)  # Row normalize
            A = A / A.sum(dim=-2, keepdim=True)  # Column normalize

        ctx.save_for_backward(A, M)
        return A

    @staticmethod
    def backward(ctx, grad_output):
        A, M = ctx.saved_tensors
        # Implicit differentiation through Sinkhorn iterations
        # (Simplified - actual implementation uses implicit gradients)
        grad_M = grad_output * A
        return grad_M, None

def sinkhorn_knopp(M, num_iterations=20):
    return SinkhornKnopp.apply(M, num_iterations)

class mHCBlock(nn.Module):
    """
    Manifold-Constrained Hyper-Connection block.

    Args:
        d_model: Model hidden dimension
        n_streams: Number of parallel residual streams (expansion rate)
        num_sinkhorn_iters: Sinkhorn-Knopp iterations
    """
    def __init__(self, d_model, n_streams=4, num_sinkhorn_iters=20):
        super().__init__()
        self.d_model = d_model
        self.n_streams = n_streams
        self.num_sinkhorn_iters = num_sinkhorn_iters

        # Learnable parameters for mixing matrix
        # Will be projected to doubly stochastic via Sinkhorn-Knopp
        self.W_mix = nn.Parameter(torch.zeros(n_streams, n_streams))
        nn.init.eye_(self.W_mix)  # Initialize near identity

    def get_mixing_matrix(self):
        """Project learnable weights to Birkhoff Polytope."""
        return sinkhorn_knopp(self.W_mix, self.num_sinkhorn_iters)

    def forward(self, streams, layer_output):
        """
        Apply mHC mixing to residual streams.

        Args:
            streams: Tensor of shape (batch, seq_len, n_streams, d_model)
            layer_output: Output from attention/FFN layer (batch, seq_len, d_model)

        Returns:
            Updated streams tensor
        """
        batch, seq_len, n_streams, d_model = streams.shape

        # Get constrained mixing matrix
        H_res = self.get_mixing_matrix()  # (n_streams, n_streams)

        # Mix streams: each new stream is weighted combination of old streams
        # streams: (B, S, N, D) -> (B, S, D, N) for matmul
        streams_t = streams.permute(0, 1, 3, 2)  # (B, S, D, N)
        mixed = torch.matmul(streams_t, H_res.T)  # (B, S, D, N)
        mixed = mixed.permute(0, 1, 3, 2)  # (B, S, N, D)

        # Add layer output to first stream (or distribute)
        mixed[:, :, 0, :] = mixed[:, :, 0, :] + layer_output

        return mixed

class mHCTransformerBlock(nn.Module):
    """
    Full transformer block with mHC residual connections.
    """
    def __init__(self, d_model, n_heads, n_streams=4, ffn_ratio=4):
        super().__init__()
        self.d_model = d_model
        self.n_streams = n_streams

        # Standard transformer components
        self.ln1 = nn.LayerNorm(d_model)
        self.attention = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * ffn_ratio),
            nn.GELU(),
            nn.Linear(d_model * ffn_ratio, d_model)
        )

        # mHC mixing blocks
        self.mhc_attn = mHCBlock(d_model, n_streams)
        self.mhc_ffn = mHCBlock(d_model, n_streams)

    def forward(self, streams, mask=None):
        """
        Args:
            streams: (batch, seq_len, n_streams, d_model)
            mask: Optional attention mask

        Returns:
            Updated streams
        """
        # Use first stream as "main" representation for attention
        x = streams[:, :, 0, :]  # (B, S, D)

        # Attention with mHC residual
        attn_out, _ = self.attention(
            self.ln1(x), self.ln1(x), self.ln1(x),
            attn_mask=mask
        )
        streams = self.mhc_attn(streams, attn_out)

        # FFN with mHC residual
        x = streams[:, :, 0, :]
        ffn_out = self.ffn(self.ln2(x))
        streams = self.mhc_ffn(streams, ffn_out)

        return streams

Parameterization Details

The paper specifies how the three matrices are parameterized:

H_pre and H_post (Input/Output Projections):

Use sigmoid activation to ensure non-negative entries
This prevents cancellation between streams
Interpretation as weighted averaging remains clear

H_res (Residual Mixing):

Projected to doubly stochastic via Sinkhorn-Knopp
Initialized near identity for stable start
Small initialization values (α = 0.01) for gating factors

Expansion Rate n:

Primary experiments use n = 4 (4 parallel streams)
Creates 4× wider residual pathway
Each stream has dimension d_model (same as original)

Part V: Infrastructure Optimizations

The DeepSeek paper emphasizes that raw mHC would be too slow without careful optimization. Expanding the data stream by 4× typically slows down training significantly because the GPU has to move much more data in and out of memory—the "Memory Wall" problem.

DeepSeek's key insight: transform mHC from a memory-intensive operation into a compute-intensive one.

1. Kernel Fusion with TileLang

DeepSeek developed TileLang, a Domain Specific Language built on TVM, allowing engineers to write low-level CUDA code using Python syntax. This enables precise control over GPU processors and memory hierarchies.

Python

# Naive implementation: 40+ memory round-trips per layer
# A = exp(W)                    # Write to HBM
# for i in range(20):
#     A = A / A.sum(dim=1)      # Read from HBM, write to HBM
#     A = A / A.sum(dim=0)      # Read from HBM, write to HBM

# TileLang fused kernel: Everything in SRAM
@tilelang.kernel
def fused_sinkhorn(W, output, num_iters):
    """
    Fused Sinkhorn-Knopp with shared memory.
    Once data is loaded from HBM to SRAM, all 20 iterations
    complete within SRAM—eliminating intermediate writes.
    """
    # Load W into shared memory (SRAM) once
    shared_A = exp(W)

    for i in range(num_iters):
        # Row normalize entirely in SRAM
        row_sums = shared_A.sum(dim=1)
        shared_A = shared_A / row_sums

        # Column normalize entirely in SRAM
        col_sums = shared_A.sum(dim=0)
        shared_A = shared_A / col_sums

    # Single write to output (HBM)
    output = shared_A

2. Custom Backward Pass

Instead of backpropagating through 20 iterations explicitly, DeepSeek uses implicit differentiation:

Python

# Naive backward: Unroll 20 iterations → 40+ backward steps
# Efficient backward: Implicit differentiation

# The Sinkhorn solution satisfies:
# H_res = D1 @ exp(W) @ D2  where D1, D2 are diagonal scaling matrices

# Gradient can be computed by solving a linear system
# instead of unrolling the iterations

3. Selective Recomputation

The n-stream residual design introduces substantial memory overhead during training. DeepSeek mitigates this by:

Discarding intermediate activations of mHC kernels after forward pass
Recomputing them on-the-fly during backward pass
Trading compute for memory (worthwhile given kernel efficiency)

4. Communication Overlapping (DualPipe)

In distributed training, GPUs often sit idle while waiting for network communication. DeepSeek's solution:

Code

┌─────────────────────────────────────────────────────────────┐
│                    DUALPIPE SCHEDULE                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Main Stream:    [Compute]  [Wait for AllReduce]  [Compute] │
│                              ↓                               │
│  Priority Stream:           [mHC Sinkhorn]                  │
│                                                              │
│  When main stream awaits network sync, GPU switches to      │
│  high-priority stream for mHC calculations → "free" compute │
│                                                              │
└─────────────────────────────────────────────────────────────┘

This Compute-Communication Overlap makes mHC's computations almost "free."

5. Mixed Precision

TileLang-based kernels enable:

FP8 computation for matrix operations (speed)
FP32 precision for numerically sensitive normalization steps (accuracy)

Overhead Summary

Optimization	Impact
Kernel Fusion	Eliminates 40+ HBM round-trips per layer
Implicit Gradients	Single backward instead of 40 steps
Selective Recomputation	Trades compute for memory
DualPipe Overlap	Makes mHC compute "free" during comm
Mixed Precision	Faster matmuls, stable normalization

Final overhead: 6.7% training time for 4× wider residual stream.

Part VI: Experimental Results

Model Configurations

All models use Mixture-of-Experts architectures inspired by DeepSeek-V3:

Model	Total Params	Active Params	Layers	Training Tokens
3B MoE	3B	~0.5B	32	1T
9B MoE	9B	~1.5B	48	1T
27B MoE	27B	~4.5B	64	1T

Both HC and mHC use expansion rate n = 4 for the widened residual stream.

Signal Gain Analysis by Model Size

The paper provides detailed measurements of the Amax Gain Magnitude (worst-case signal amplification):

Model	Baseline	HC	mHC
3B	1.2×	48×	1.5×
9B	1.3×	287×	1.6×
27B	1.4×	~3000×	1.6×

Critical observation: HC signal gain grows exponentially with model size (48→287→3000), while mHC remains bounded (~1.5-1.6×). This is why HC diverges at scale.

Training Stability

Taking mHC as baseline, the paper shows HC exhibits an unexpected loss surge around step 12k, highly correlated with gradient instability:

Code

┌─────────────────────────────────────────────────────────────┐
│                    TRAINING LOSS CURVES                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Loss                                                        │
│   │                                                          │
│   │      HC Loss Spike                                      │
│   │           ╱╲                                            │
│   │          ╱  ╲                                           │
│   │  HC ────╱    ╲────× (DIVERGED)                         │
│   │                                                          │
│   │  mHC ─────────────────────────────── (Stable)          │
│   │                                                          │
│   │  Baseline ──────────────────────────                    │
│   │                                                          │
│   └────────────────────────────────────────────────── Steps │
│        0      5k     10k    12k    15k    20k               │
│                                                              │
│  HC diverges at 12k steps; mHC trains smoothly throughout   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Complete Benchmark Results (27B Model)

Table 2 from paper - Eight diverse benchmarks:

Benchmark	Baseline	HC	mHC	Δ (mHC vs HC)
BBH	43.8	48.9	51.0	+2.1
DROP	47.0	51.6	53.9	+2.3
GSM8K	46.7	50.2	53.8	+3.6
HellaSwag	72.1	74.3	75.8	+1.5
MATH	24.3	27.1	29.4	+2.3
MMLU	59.0	61.8	63.4	+1.6
PIQA	78.2	79.5	80.3	+0.8
TriviaQA	58.4	60.2	61.9	+1.7

Key observations:

mHC consistently outperforms both baseline and HC across all 8 benchmarks
Largest improvements on reasoning tasks: BBH (+2.1%), DROP (+2.3%), GSM8K (+3.6%)
Training loss: mHC achieves 0.021 lower loss than baseline

Ablation Study: Component Analysis

Table 1 from paper - Which component matters most?

Configuration	BBH	GSM8K	Stable?
Baseline (no HC)	43.8	46.7	✓
H_pre + H_post only (no H_res)	44.1	47.2	✓
H_res only (no H_pre/H_post)	49.8	52.1	✗ (unstable)
Full HC (unconstrained)	48.9	50.2	✗ (diverges)
Full mHC (constrained H_res)	51.0	53.8	✓

Critical finding: H_res is the primary driver of performance gains. When they removed H_res (keeping only H_pre and H_post), performance dropped dramatically. However, unconstrained H_res causes instability. mHC solves this by constraining H_res to be doubly stochastic.

From the paper: "The ablation studies prove that H_res is the most critical component—the highlight of the process is when features from different depths get to interact and swap information."

Scaling Behavior Across Model Sizes

Performance advantages maintained across all scales:

Model	BBH (mHC vs Baseline)	Stable?
3B	+5.8	✓
9B	+6.5	✓
27B	+7.2	✓

mHC shows consistent scaling—improvements don't attenuate at larger scales.

Gradient Norm Analysis

The paper shows gradient norm variance is significantly reduced with mHC:

Metric	HC	mHC
Gradient norm variance	High (spikes)	Low (stable)
Max gradient magnitude	Unbounded	Bounded
Loss curve smoothness	Spikes at 12k	Smooth throughout

Part VII: Mathematical Deep Dive

Why Doubly Stochastic Works

Let's prove why constraining to doubly stochastic matrices bounds signal magnitude.

Lemma 1 (Spectral Norm Bound): For any doubly stochastic matrix H:

Code

||H||₂ ≤ 1

Proof sketch:

The all-ones vector 1 is an eigenvector of H with eigenvalue 1 (since rows sum to 1)
By Perron-Frobenius theorem, this is the largest eigenvalue for non-negative matrices
Therefore spectral norm (largest singular value) ≤ 1 ∎

Corollary (Bounded Propagation): For any vector x:

Code

||H · x||₂ ≤ ||H||₂ · ||x||₂ ≤ ||x||₂

This means mixing cannot amplify the signal—only redistribute it.

The Birkhoff-von Neumann Theorem

The Birkhoff Polytope has beautiful structure:

Theorem (Birkhoff-von Neumann): Every doubly stochastic matrix is a convex combination of permutation matrices.

Code

H = Σᵢ λᵢ Pᵢ   where:
- Pᵢ are permutation matrices
- λᵢ ≥ 0 and Σᵢ λᵢ = 1

This means mHC mixing can be interpreted as a soft permutation of streams—a weighted average of all possible reorderings.

Sinkhorn-Knopp Convergence

Theorem (Sinkhorn-Knopp Convergence): For any positive matrix M, the Sinkhorn-Knopp iterations converge to a unique doubly stochastic matrix at rate O(exp(-k)) where k is the iteration count.

The convergence is exponentially fast, which is why 20 iterations suffice for practical purposes.

Part VIII: Implications and Future Directions

Expected Use in Future DeepSeek Models

The mHC paper was co-authored by DeepSeek CEO Liang Wenfeng, signaling strategic importance. Industry analysts expect:

DeepSeek R2 (expected Q1 2026): Likely first production model with mHC
DeepSeek V4 (expected 2026): Full integration with MLA and other innovations
Open-source release: Following DeepSeek's pattern of open-sourcing innovations

From analysis: "mHC addresses a fundamental bottleneck. Combined with DeepSeek's other innovations (MLA, DeepSeekMoE), this could enable training of models significantly larger than current frontier."

Comparison with Other Approaches

Approach	Mechanism	Overhead	Scalability
Standard Residual	Single stream	0%	✓ Limited capacity
Hyper-Connections	Multiple streams, unconstrained	~5%	✗ Unstable
mHC	Multiple streams, Birkhoff constraint	~6.7%	✓ Stable
Mixture of Experts	Sparse activation	Variable	✓ Different dimension

mHC is orthogonal to MoE—they solve different problems:

MoE: Sparse activation for compute efficiency
mHC: Wider residual stream for information flow

They can be combined (and likely will be in future DeepSeek models).

Open Questions

Optimal expansion rate n? The paper uses n=4, but optimal value may depend on model size
Interaction with other architectures? How does mHC interact with MLA, SwiGLU, etc.?
Vision and multimodal? Does mHC benefit non-language modalities?
Inference optimization? Can Sinkhorn be precomputed for inference?

Conclusion

mHC represents a fundamental advance in neural network architecture design. By recognizing that the residual connection bottleneck could be solved with a geometric constraint rather than architectural complexity, DeepSeek has opened a new path for scaling deep networks.

Key takeaways:

The problem was structural: Hyper-Connections fail not due to tuning but due to unbounded eigenvalues causing signal explosion
The solution is elegant: Project mixing matrices onto the Birkhoff Polytope using the classical Sinkhorn-Knopp algorithm (1967)
The overhead is minimal: 6.7% training time for 4× wider residual stream and consistent benchmark improvements
The implications are significant: Enables stable training at scales where previous approaches diverge
A 1967 algorithm saves 2025 AI: Sometimes the best solutions come from revisiting classical mathematics

From the paper: "mHC provides tangible performance improvements and superior scalability, laying the groundwork for more capable foundation models."

Watch for mHC in DeepSeek R2 and V4—this architectural innovation may be a key enabler of the next generation of frontier models.

References

mHC: Manifold-Constrained Hyper-Connections - DeepSeek, December 2025
Deep Residual Learning for Image Recognition - He et al., 2015 (ResNet)
Attention Is All You Need - Vaswani et al., 2017 (Transformer)
Sinkhorn, R. and Knopp, P. "Concerning nonnegative matrices and doubly stochastic matrices" (1967)
Birkhoff, G. "Three observations on linear algebra" (1946)

Table of Contents

Paper Overview

The Residual Connection Bottleneck

Key Equations at a Glance

Part I: Understanding Residual Connections

Why Residual Connections Exist

The Identity Mapping Property

The Bottleneck Problem

Part II: Hyper-Connections (The Promising but Unstable Idea)

Widening the Residual Stream

Why Hyper-Connections Fail at Scale

Part III: The mHC Solution

The Key Insight: Constrain to a Manifold

The Sinkhorn-Knopp Algorithm (1967)

mHC Architecture

Mathematical Guarantee

Part IV: Implementation Details

Full mHC Layer

Parameterization Details

Part V: Infrastructure Optimizations

1. Kernel Fusion with TileLang

2. Custom Backward Pass

3. Selective Recomputation

4. Communication Overlapping (DualPipe)

5. Mixed Precision

Overhead Summary

Part VI: Experimental Results

Model Configurations

Signal Gain Analysis by Model Size

Training Stability

Complete Benchmark Results (27B Model)

Ablation Study: Component Analysis

Scaling Behavior Across Model Sizes

Gradient Norm Analysis

Part VII: Mathematical Deep Dive

Why Doubly Stochastic Works

The Birkhoff-von Neumann Theorem

Sinkhorn-Knopp Convergence

Part VIII: Implications and Future Directions

Expected Use in Future DeepSeek Models

Comparison with Other Approaches

Open Questions

Conclusion

References

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Transformer Architecture: A Complete Deep Dive

Attention Mechanisms: From Self-Attention to FlashAttention

Mixture of Experts: Scaling LLMs Beyond Dense Models

Open-Source LLMs: The Complete 2025 Guide

Distributed Training: How to Train 70B+ Parameter Models