Skip to main content
Back to Blog

State-Space Models: Mamba, Jamba, and the Post-Transformer Era

A comprehensive guide to state-space models (SSMs) including Mamba and Jamba architectures that challenge transformer dominance with linear-time complexity, efficient long-context processing, and hybrid designs combining the best of both worlds.

6 min read
Share:

The transformer architecture has dominated natural language processing since 2017, but its quadratic attention complexity creates fundamental scaling limitations. State-space models (SSMs) offer a compelling alternative with linear-time complexity and constant memory requirements during inference. This guide explores the mathematical foundations, architectural innovations, and practical implications of SSMs—from the foundational S4 model to Mamba's selective mechanisms and AI21's hybrid Jamba architecture.

The Transformer Bottleneck

Before understanding why state-space models matter, we need to appreciate the fundamental limitation they address. Transformers compute attention over all pairs of tokens in a sequence, resulting in O(n2)O(n^2) time and memory complexity where nn is the sequence length. This quadratic scaling creates severe practical constraints.

For a 128K token context window, the attention mechanism must compute and store approximately 16 billion attention weights per layer. Double the context to 256K tokens, and you need four times the compute and memory. This isn't just a theoretical concern—it directly limits what applications can be built. Real-time processing of long documents, efficient inference on edge devices, and cost-effective deployment at scale all suffer from this fundamental architectural choice.

The attention mechanism also lacks an inherent notion of recurrence or state. Each forward pass recomputes relationships from scratch, with no persistent memory of what came before. While this enables powerful parallel training, it means transformers must explicitly attend to all relevant context at inference time, even when processing sequential data where recurrence would be more natural.

State-space models address both limitations by reformulating sequence modeling as a continuous dynamical system that can be discretized and computed efficiently. The result is linear-time complexity with respect to sequence length and a compressed hidden state that persists across time steps.

Mathematical Foundations of State-Space Models

State-space models originate from control theory, where they describe how a system's internal state evolves over time in response to inputs. The continuous-time formulation defines a linear time-invariant (LTI) system with the following equations:

dh(t)dt=Ah(t)+Bx(t)\frac{dh(t)}{dt} = Ah(t) + Bx(t) y(t)=Ch(t)+Dx(t)y(t) = Ch(t) + Dx(t)

Here, x(t)x(t) is the input signal, h(t)h(t) is the hidden state, and y(t)y(t) is the output. The matrices AA, BB, CC, and DD are learnable parameters that define how the system evolves. Matrix AA (the state matrix) is particularly important—it controls how the hidden state transitions over time, determining what information is retained or forgotten.

To use this formulation with discrete sequences like text, we must discretize the continuous system. Using the zero-order hold (ZOH) discretization with step size Δ\Delta, we obtain:

Aˉ=exp(ΔA)\bar{A} = \exp(\Delta A) Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B

The discretized system then operates as a recurrence:

hk=Aˉhk1+Bˉxkh_k = \bar{A}h_{k-1} + \bar{B}x_k yk=Chky_k = Ch_k

This recurrence can be unrolled into a convolution, which is the key insight enabling efficient parallel training. If we expand the recurrence:

yk=CAˉkBˉx0+CAˉk1Bˉx1+...+CBˉxky_k = C\bar{A}^k\bar{B}x_0 + C\bar{A}^{k-1}\bar{B}x_1 + ... + C\bar{B}x_k

This is a convolution with kernel Kˉ=(CBˉ,CAˉBˉ,CAˉ2Bˉ,...)\bar{K} = (C\bar{B}, C\bar{A}\bar{B}, C\bar{A}^2\bar{B}, ...). During training, we can compute this convolution efficiently using FFT in O(nlogn)O(n \log n) time. During inference, we use the recurrent form, processing one token at a time in O(1)O(1) time per step with O(d)O(d) memory for the hidden state, where dd is the state dimension.

This dual view—convolutional for training, recurrent for inference—gives SSMs their unique efficiency profile. But the basic LTI formulation has a critical limitation: the matrices AA, BB, CC, DD are fixed, meaning the system processes all inputs identically regardless of content.

S4: Structured State Spaces

The S4 (Structured State Space) model, introduced by Gu et al. in 2022, made SSMs practical for deep learning by addressing a key challenge: how to parameterize and initialize the state matrix AA to enable learning long-range dependencies.

The fundamental problem is that naive random initialization of AA leads to either exploding or vanishing gradients over long sequences. S4 solves this by constraining AA to a specific structured form based on the HiPPO (High-order Polynomial Projection Operator) theory.

HiPPO provides a principled way to compress a continuous signal's history into a fixed-size state vector. The HiPPO-LegS (Legendre scaled) matrix is defined as:

(2n+1)^{1/2}(2k+1)^{1/2} & \text{if } n > k \\ n+1 & \text{if } n = k \\ 0 & \text{if } n < k \end{cases}$$ This initialization ensures the state vector optimally approximates the input history using Legendre polynomials. More importantly, it provides stable gradient flow over extremely long sequences—S4 can model dependencies spanning tens of thousands of time steps. S4 also introduces efficient computation of the SSM convolution kernel. Rather than explicitly computing $\bar{K} = (C\bar{B}, C\bar{A}\bar{B}, ...)$, which would require $O(n)$ matrix-vector products, S4 uses a diagonal plus low-rank decomposition of $A$ that enables kernel computation in $O(n \log n)$ time via FFT. The resulting architecture stacks multiple SSM layers with nonlinearities between them, similar to a transformer but with SSM blocks replacing attention. S4 achieved breakthrough results on the Long Range Arena benchmark, which tests sequence models on tasks requiring understanding of very long contexts (up to 16K tokens). However, S4 retains the LTI limitation—its parameters are fixed regardless of input content. This means S4 cannot perform content-based reasoning that selectively attends to relevant information based on what appears in the input. ## Mamba: Selective State Spaces Mamba, introduced by Gu and Dao in December 2023, represents the breakthrough that made SSMs competitive with transformers on language modeling. The key innovation is making the SSM parameters input-dependent, enabling selective, content-aware processing. ### The Selectivity Mechanism In standard SSMs, the matrices $B$, $C$, and the discretization step $\Delta$ are fixed. Mamba makes them functions of the input: $$B_t = \text{Linear}_B(x_t)$$ $$C_t = \text{Linear}_C(x_t)$$ $$\Delta_t = \text{softplus}(\text{Linear}_\Delta(x_t))$$ This seemingly simple change has profound implications. With input-dependent parameters, the model can: 1. **Selectively remember**: By modulating $\Delta$, the model controls how much new information overwrites the existing state. A large $\Delta$ allows more input influence; a small $\Delta$ preserves the existing state. 2. **Selectively filter**: Input-dependent $B$ controls what aspects of the current input enter the state. The model learns to gate irrelevant information. 3. **Selectively output**: Input-dependent $C$ controls what aspects of the state are read out. The model learns to extract relevant information from compressed history. This selectivity enables Mamba to perform tasks that require content-based reasoning, like selective copying (copying only certain tokens based on their content) and induction heads (recognizing and continuing patterns). These capabilities were previously thought to require attention. ### Hardware-Efficient Implementation Making parameters input-dependent breaks the convolution view that enabled efficient S4 training. The system is no longer time-invariant, so we cannot precompute a global convolution kernel. Naively, this would require computing the recurrence step-by-step, losing parallelism. Mamba solves this with a hardware-aware parallel scan algorithm. The key insight is that while we cannot use FFT-based convolution, the recurrence has associative structure that enables parallel prefix computation. The recurrence: $$h_t = \bar{A}_th_{t-1} + \bar{B}_tx_t$$ Can be written as a linear operator and composed associatively: $$\begin{pmatrix} h_t \\ 1 \end{pmatrix} = \begin{pmatrix} \bar{A}_t & \bar{B}_tx_t \\ 0 & 1 \end{pmatrix} \begin{pmatrix} h_{t-1} \\ 1 \end{pmatrix}$$ These 2×2 matrices can be multiplied in any order (associativity), enabling parallel prefix sum computation. On modern GPUs, this runs in $O(\log n)$ parallel time with $O(n)$ work total. Mamba further optimizes by fusing the discretization, recurrence, and output computation into a single CUDA kernel that minimizes memory bandwidth. The implementation avoids materializing the full $(B, L, D, N)$ tensor of SSM states, instead computing results directly in registers and SRAM. The result is remarkable efficiency: Mamba achieves 5× higher throughput than comparably-sized transformers during training and 3× faster inference for long sequences. The memory footprint scales linearly with sequence length rather than quadratically. ### Architecture Design A Mamba block consists of: 1. **Linear projection** expanding the input dimension 2. **1D convolution** for local context (typically kernel size 4) 3. **Selective SSM** with input-dependent parameters 4. **Gated output** combining SSM output with a parallel pathway 5. **Linear projection** back to model dimension ``` Input (B, L, D) │ ├──────────────────────┐ │ │ ▼ ▼ Linear (D → E) Linear (D → E) │ │ ▼ │ Conv1D (k=4) │ │ │ ▼ │ SiLU │ │ │ ▼ │ SSM (selective) │ │ │ ▼ ▼ └───────► × ◄──────SiLU │ ▼ Linear (E → D) │ ▼ Output (B, L, D) ``` The expansion factor $E/D$ is typically 2, similar to transformer feed-forward networks. The gated pathway provides a residual-like connection that helps gradient flow. Mamba stacks these blocks with normalization (typically RMSNorm) and residual connections, following the pre-norm transformer pattern. The resulting architecture matches transformer perplexity on language modeling while being substantially more efficient at long sequences. ### Benchmark Performance On standard language modeling benchmarks, Mamba-3B matches the perplexity of transformers with twice the parameters. The efficiency gains compound at longer sequences: | Sequence Length | Transformer Attention | Mamba SSM | |-----------------|----------------------|-----------| | 2K tokens | 1.0× | 1.0× | | 8K tokens | 4.0× slower | 1.1× | | 32K tokens | 16.0× slower | 1.2× | | 128K tokens | 64.0× slower | 1.3× | For inference specifically, Mamba achieves constant-time generation regardless of context length (after the initial prompt processing), while transformer autoregressive generation requires attending to all previous tokens at each step. ## Mamba-2: Simplified and Faster Mamba-2, released in mid-2024, refines the architecture with theoretical insights that enable even more efficient computation. The key contribution is recognizing that selective SSMs can be viewed as a form of structured linear attention. ### State Space Duality Mamba-2 introduces the State Space Duality (SSD) framework, showing that SSMs and attention are two views of the same underlying computation. The selective SSM with scalar state dimension $N=1$ is equivalent to a specific form of linear attention: $$y = \text{SSM}(A, B, C)(x) = \text{Attention}(Q, K, V)$$ Where the attention is "linear" (no softmax) and uses a specific masking pattern determined by the cumulative product of $A$ values. This duality has practical implications. It means we can use attention-style algorithms (like chunked computation with quadratic attention within chunks) when they're faster, while maintaining the SSM recurrence for inference. Mamba-2 implements a chunked algorithm that: 1. Divides the sequence into chunks of size 64-256 2. Within each chunk, uses a quadratic (attention-like) computation 3. Between chunks, uses the SSM recurrence to propagate state This hybrid approach achieves 2-8× speedup over Mamba-1 on GPUs by better utilizing tensor cores, which are optimized for matrix-matrix multiplication (the within-chunk computation) rather than the scan operations in Mamba-1. ### Simplified Architecture Mamba-2 also simplifies the block design by: 1. Removing the separate gated pathway 2. Using grouped-value heads (similar to grouped-query attention) 3. Increasing the state dimension to compensate The original Mamba (the core "S6" layer) is exactly a selective SSM with diagonal structure. The SSD layer of Mamba-2 makes one small modification: it restricts the diagonal $A$ to a scalar times identity structure, meaning the diagonal elements of $A$ must all be the same value. This simplification enables the efficient chunked algorithm while maintaining expressiveness. The resulting architecture is easier to implement and faster to execute while maintaining quality. Mamba-2 models match Transformer++ (a strong transformer baseline with RoPE, SwiGLU, and RMSNorm) on language modeling. One notable benefit is the development of a Mamba equivalent of multi-head attention (MHA), where a Mamba block can be split into multiple "Mamba heads" akin to the attention heads in transformers, enabling even more efficiency through tensor parallelism. ## Mamba-3: The Next Evolution (2025) Mamba-3, presented at NeurIPS 2025, represents the latest evolution of the selective state-space architecture. Guided by an inference-first perspective, Mamba-3 introduces three core methodological improvements: ### Key Innovations **More Expressive Recurrence**: Mamba-3 extends the recurrence formulation to capture richer temporal dynamics. The state update rule now incorporates: $$h_t = f(A_t, h_{t-1}) + g(B_t, x_t, x_{t-1})$$ Where $f$ and $g$ are learnable functions that enable more complex state transitions than the linear updates in Mamba-1/2. **Complex State Update Rule**: Mamba-3 introduces complex-valued state representations that enable richer state tracking. The complex formulation allows the model to represent oscillatory patterns and phase relationships that are difficult with purely real-valued states: $$h_t \in \mathbb{C}^d, \quad A_t \in \mathbb{C}^{d \times d}$$ **Multi-Input Multi-Output (MIMO) Formulation**: Rather than processing each dimension independently, Mamba-3 allows cross-dimensional interactions in the state-space computation, resulting in a stronger model that better exploits hardware parallelism during decoding. ### Performance Improvements Together with architectural refinements, Mamba-3 achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. As one analysis puts it: "Mamba-3 is the first linear sequence model that looks like a serious long-term alternative to transformers rather than a clever curiosity." Mamba-3 particularly excels at tasks that stressed earlier Mamba versions, including in-context learning and complex reasoning, while maintaining the linear-time inference that makes SSMs attractive. ### Known Limitations Research has also identified fundamental limitations of the Mamba architecture that persist even in Mamba-3: **Asymmetry Bias**: A NeurIPS 2025 Spotlight paper reveals that Mamba's nonlinear convolution introduces an asymmetry bias that significantly impairs its ability to recognize symmetrical patterns. Tasks requiring symmetric pattern matching remain challenging. **Long-Context Degradation**: While Mamba handles long sequences efficiently, research indicates it generally underperforms compared to transformers in tasks involving long-context *understanding* (as opposed to just processing). This has prompted development of alternative models like DeciMamba, LongMamba, and ReMamba that specifically address this limitation. **Copy and Retrieval Tasks**: Pure SSMs still struggle with tasks requiring exact copying or retrieval from specific positions, where attention's ability to directly reference past tokens provides an advantage. ## Jamba: Hybrid Transformer-Mamba Architecture While Mamba demonstrates that pure SSMs can match transformers, practical deployment often benefits from combining both architectures. AI21 Labs' Jamba, released in March 2024 and substantially updated through 2025, represents the most successful hybrid approach. ### Design Philosophy Jamba's core insight is that attention and SSMs have complementary strengths: **Attention excels at:** - In-context learning (copying and using information from the prompt) - Precise positional reasoning - Tasks requiring comparison across distant positions - Complex reasoning over structured inputs **SSMs excel at:** - Efficient long-context processing - Linear-time inference - Low memory footprint - Smooth degradation with sequence length By combining both mechanisms, Jamba aims to capture attention's expressiveness while maintaining SSM efficiency. ### Architecture Jamba uses a block structure that interleaves attention and Mamba layers: ``` Jamba Architecture (52B parameters, 12B active) ├── Block 1 │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ └── Attention Layer (with MoE) ├── Block 2 │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ ├── Mamba Layer │ └── Attention Layer (with MoE) ├── Block 3 │ └── ... (same pattern) └── Block 4 └── ... (same pattern) ``` The ratio is 7:1 Mamba to attention layers, meaning 87.5% of layers are SSM-based. Attention layers use grouped-query attention (GQA) for efficiency. ### Mixture of Experts Integration Jamba incorporates Mixture of Experts (MoE) in alternating layers, applying MoE to the MLP component. The configuration uses: - 16 experts per MoE layer - Top-2 expert selection per token - MoE applied every other layer With 52B total parameters but only 12B active per forward pass, Jamba achieves the capacity of a much larger model while maintaining efficient inference. The MoE layers are placed strategically at attention layers, where the additional capacity has the most impact. ### Memory and Throughput The hybrid design dramatically reduces memory requirements for long contexts. For a 256K token context: | Model | KV Cache Size | Peak Memory | |-------|---------------|-------------| | Llama-2 70B | 256 GB | 320 GB | | Mixtral 8x7B | 128 GB | 180 GB | | Jamba 52B | 32 GB | 80 GB | Jamba's KV cache is 8× smaller than Llama-2 because only 12.5% of layers require caching attention states. The Mamba layers maintain a fixed-size state regardless of sequence length. Throughput benefits are equally dramatic. On a single A100 80GB GPU, Jamba processes 3× more tokens per second than Mixtral at 128K context length, despite having more parameters. ### Jamba 1.5 and Beyond AI21 has continued scaling the Jamba architecture through 2025: **Jamba 1.5 Large** (September 2024): - 398B total parameters, 94B active - 72 layers with the same 7:1 Mamba-to-attention ratio - 256K context window - State-of-the-art on NVIDIA's RULER long-context benchmark - Improved instruction following and reasoning **Jamba Mini 1.7** (August 2025): - Compact version optimized for deployment - 256K context—one of the largest among open-source models - Efficient enough for single-GPU inference - Strong performance on long-document tasks **Jamba Reasoning 3B** (October 2025): - Designed for edge deployment on consumer hardware - 2-4× faster inference than comparable models - ~10× memory reduction vs traditional transformers - Runs on laptops and smartphones - Optimized for function calling and multi-document QA ### When Hybrid Beats Pure Empirical results show hybrid models outperform both pure transformers and pure SSMs on many tasks: 1. **Long-context QA**: Jamba maintains performance at 256K while transformer performance degrades beyond 64K 2. **Needle-in-haystack**: Jamba achieves near-perfect recall across the full context, while transformers struggle in the middle 3. **Multi-document reasoning**: The combination of attention (for precise reference) and SSM (for efficient context) excels 4. **Inference cost**: At 128K+ contexts, Jamba's cost per token is 3-5× lower than equivalent transformers The optimal ratio of attention to SSM layers appears task-dependent. More attention helps reasoning-heavy tasks; more SSM helps throughput-sensitive deployments. ## Beyond Jamba: The Evolving Hybrid Landscape The success of Jamba has sparked broader exploration of hybrid architectures: ### Bamba IBM's Bamba combines Mamba-2 with attention using a different ratio and training recipe. Bamba 9B achieves competitive results with training-efficient mixing of the two mechanisms. ### Zamba Zyphra's Zamba uses a more aggressive SSM-heavy design with attention only in early layers, hypothesizing that attention is most valuable for building initial representations. ### Griffin and RecurrentGemma Google's approach (used in RecurrentGemma) interleaves local attention windows with recurrent layers, achieving linear complexity with attention-like expressiveness within windows. ### RWKV-7 "Goose" and RWKV-8 "Heron" RWKV represents another major branch of sub-quadratic sequence modeling, achieving constant memory and constant inference time per token: **RWKV-7 "Goose" (March 2025):** - Introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates - Features a relaxed value replacement rule enabling better state tracking - Can recognize all regular languages (exceeds transformer capabilities under standard complexity conjectures) - The 2.9B parameter model achieves new 3B state-of-the-art on multilingual tasks - Trained on 3.1 trillion tokens across multiple languages - Described as "the strongest linear-time & constant-space (no kv-cache) & attention-free & 100% RNN architecture" **RWKV-8 "Heron" Preview (May 2025):** - First feature: DeepEmbed—achieves MoE-like reasoning performance without consuming VRAM - Enables truly sparse large models on edge devices - Represents next evolution of efficient sequence modeling ### Industry Implications NVIDIA is reportedly in advanced talks to acquire AI21 Labs for $2-3 billion, signaling major industry interest in SSM and hybrid architectures. The acquisition would give NVIDIA direct ownership of Jamba's technology and team. This interest reflects a broader recognition that the future of efficient LLMs likely involves moving beyond pure transformers. The specific architecture—whether SSM, hybrid, or something new—remains an open research question, but the direction is clear. ## Detailed Benchmark Comparisons Understanding exactly when and why SSMs outperform transformers requires examining benchmark results across different dimensions. ### Language Modeling Perplexity On standard language modeling benchmarks (The Pile, WikiText-103), Mamba models match transformer perplexity at equivalent compute: | Model | Parameters | Pile PPL | WikiText-103 PPL | Training FLOPs | |-------|------------|----------|------------------|----------------| | GPT-3 (curie) | 6.7B | 5.8 | 15.2 | 1.3e22 | | Mamba | 2.8B | 5.8 | 14.9 | 5.2e21 | | Mamba-2 | 2.7B | 5.6 | 14.5 | 4.8e21 | | Jamba | 12B active | 5.2 | 13.8 | 8.1e21 | Mamba achieves similar perplexity with roughly 2.5× fewer parameters and 2.5× less training compute. This efficiency comes from the linear complexity enabling longer effective context during training. ### Long-Context Performance The advantage of SSMs becomes dramatic at longer contexts. On the RULER benchmark (testing long-context understanding): | Context Length | Llama-2-70B | Mixtral-8x7B | Jamba 1.5 Large | |----------------|-------------|--------------|-----------------| | 4K tokens | 92.3% | 91.8% | 93.1% | | 16K tokens | 88.1% | 87.5% | 92.8% | | 64K tokens | 71.2% | 73.8% | 91.5% | | 128K tokens | 52.4% | 58.9% | 90.2% | | 256K tokens | — | — | 88.7% | Transformers degrade significantly beyond 64K tokens due to the "lost in the middle" phenomenon where attention becomes diluted. Jamba's hybrid design maintains strong performance across the full context. ### Needle-in-Haystack Results The needle-in-haystack test places a critical piece of information at various positions in a long context and tests retrieval: ``` Jamba 1.5 at 256K context: Position Accuracy ────────────────────── Start 99.8% 25% 99.5% 50% 99.1% 75% 99.3% End 99.7% Llama-2-70B at 64K context: Position Accuracy ────────────────────── Start 98.2% 25% 91.3% 50% 78.4% ← "Lost in middle" 75% 89.7% End 97.1% ``` The attention mechanism in pure transformers struggles with information in the middle of long contexts. Jamba's SSM layers maintain uniform access across all positions. ### Inference Throughput Real-world inference measurements on A100 80GB: | Model | Prompt (tok/s) | Generation (tok/s) | Memory @ 128K | |-------|----------------|--------------------| --------------| | Llama-2-70B | 2,100 | 18 | 160GB+ (OOM) | | Mixtral-8x7B | 3,800 | 42 | 95GB | | Jamba 52B | 8,200 | 85 | 45GB | | Mamba-2.8B | 15,000 | 320 | 8GB | Mamba's constant-time generation (after prompt processing) provides consistent throughput regardless of context length, while transformers slow down linearly as context grows. ### Task-Specific Performance Different architectures excel at different task types: **Tasks where transformers excel:** - Complex multi-hop reasoning (HotpotQA): Transformer +5-8% - Precise positional tasks (sorting, reversal): Transformer +10-15% - In-context learning with many examples: Transformer +3-5% **Tasks where SSMs excel:** - Long document summarization: SSM +8-12% - Streaming data processing: SSM +20-30% - Memory-constrained deployment: SSM fits where transformer can't **Tasks where hybrid excels:** - Long-context QA requiring both recall and reasoning: Jamba best overall - Multi-document tasks: Hybrid +5-10% over both pure architectures ## Implementation Deep Dive ### Memory Layout and Data Flow Understanding the memory layout helps optimize SSM deployments: ``` Mamba Block Memory Flow: ──────────────────────── Input tensor: [batch, seq_len, d_model] │ ▼ Linear projection: [batch, seq_len, d_inner] (d_inner = 2 * d_model typically) │ ┌───────────────┴───────────────┐ │ │ ▼ ▼ Conv1D branch Gate branch [batch, seq_len, d_inner] [batch, seq_len, d_inner] │ │ ▼ │ SSM params (B, C, Δ) │ generated per-position │ │ │ ▼ │ Selective scan │ [batch, seq_len, d_inner] │ │ │ └───────────► × ◄───────────────┘ │ ▼ Linear projection [batch, seq_len, d_model] ``` ### State Caching for Inference During autoregressive generation, SSMs maintain a compressed state rather than a full KV cache: **Transformer KV cache:** - Size: $O(n \times d \times L)$ where $n$ = context length, $d$ = dimension, $L$ = layers - At 128K context with 70B model: ~80GB KV cache alone **SSM state:** - Size: $O(d \times N \times L)$ where $N$ = state dimension (typically 16-64) - Constant regardless of context length - At 128K context with 2.8B Mamba: ~50MB state This dramatic difference enables SSMs to handle much longer contexts on the same hardware. ### Selective Scan Implementation Details The selective scan is the core computational primitive. Here's how it achieves efficiency: **Naive implementation** (quadratic): For each output position $k$, compute the full recurrence from $h_0$ to $h_k$. This requires $O(n^2)$ operations. **Parallel scan implementation** (log-linear): The key insight is that the recurrence can be expressed as a matrix multiplication: $$\begin{pmatrix} h_k \\ 1 \end{pmatrix} = M_k M_{k-1} \cdots M_1 \begin{pmatrix} h_0 \\ 1 \end{pmatrix}$$ Where $M_i = \begin{pmatrix} \bar{A}_i & \bar{B}_i x_i \\ 0 & 1 \end{pmatrix}$ Matrix multiplication is associative, so we can compute prefix products in parallel: ``` Step 1: Compute M_1, M_2, M_3, M_4, ... (all parallel) Step 2: Compute M_1·M_2, M_3·M_4, ... (n/2 parallel multiplications) Step 3: Compute (M_1·M_2)·(M_3·M_4), ... (n/4 parallel multiplications) ... Total: O(log n) sequential steps with O(n) total work ``` **Memory optimization:** Rather than materializing intermediate results, the implementation fuses the scan with input/output operations: 1. Load input chunk to shared memory 2. Compute SSM parameters (B, C, Δ) in registers 3. Perform scan within shared memory 4. Write output directly to global memory This achieves near-optimal memory bandwidth utilization. ### Numerical Stability Considerations The discretization step requires computing $\exp(\Delta A)$, which can be numerically unstable for large $\Delta$ or $A$ values. Mamba uses several techniques: **Softplus for Δ**: The discretization step size is computed as $\Delta = \text{softplus}(\text{Linear}(x))$, ensuring $\Delta > 0$ and providing smooth gradients. **Diagonal A constraint**: By restricting $A$ to be diagonal, the matrix exponential becomes elementwise: $\exp(\Delta A) = \text{diag}(\exp(\Delta a_1), ..., \exp(\Delta a_d))$ **Log-space computation**: For very long sequences, intermediate values are computed in log-space to prevent overflow. ## Practical Considerations for Adoption ### When to Consider SSMs/Hybrids **Strong fit:** - Long-context applications (100K+ tokens) - Real-time inference requirements - Edge deployment with memory constraints - High-throughput batch processing - Streaming/continuous input processing - Applications where inference cost dominates (high-volume serving) **Less clear fit:** - Tasks requiring precise positional reasoning - Complex in-context learning with many examples - Existing transformer-optimized infrastructure investment - Tasks where context length is always short (<4K tokens) - Applications requiring extensive ecosystem tooling ### Decision Framework Use this framework to decide between architectures: ``` Context Length Short (<8K) Long (>32K) ┌─────────────┬─────────────┐ Reasoning │ Transformer │ Hybrid │ Complexity High │ (GPT-4, │ (Jamba) │ │ Claude) │ │ ├─────────────┼─────────────┤ Low │ Either │ SSM │ │ (depends │ (Mamba) │ │ on cost) │ │ └─────────────┴─────────────┘ ``` ### Framework Support As of early 2025, SSM support is maturing across the ecosystem: **Training frameworks:** - **Mamba**: Official PyTorch implementation with custom CUDA kernels - **HuggingFace Transformers**: Native Mamba and Jamba support - **JAX/Flax**: Community implementations available **Inference frameworks:** - **vLLM**: PagedAttention extended for hybrid models, Jamba support - **TensorRT-LLM**: Mamba and Jamba optimization paths - **llama.cpp**: Experimental Mamba support - **ONNX**: Export support for deployment **Cloud platforms:** - **AI21 Studio**: Native Jamba API access - **HuggingFace Inference Endpoints**: Jamba and Mamba deployment - **AWS Bedrock**: Jamba models available - **Google Vertex AI**: Jamba 1.5 family ### Training Considerations Training SSMs requires different considerations than transformers: **Initialization:** - HiPPO-based initialization is crucial for long-range learning - Random initialization leads to vanishing/exploding gradients over long sequences - Use the official initialization schemes from the Mamba codebase **Learning rates:** - SSM parameters (A, B, C, D) often benefit from lower learning rates (0.1-0.5× the base LR) - The discretization step Δ is particularly sensitive - Consider using parameter groups with different learning rates **Sequence length curriculum:** - Starting with shorter sequences and gradually increasing helps stability - Typical curriculum: 2K → 8K → 32K → 128K tokens - Each stage should have sufficient training steps for convergence **Hardware requirements:** - Mamba's custom CUDA kernels require compute capability 8.0+ (Ampere or newer) - Training Mamba-2.8B requires ~40GB GPU memory - Training Jamba requires multi-GPU setup (8× A100 minimum for full model) **Batch size considerations:** - SSMs benefit from larger batch sizes due to better parallelization of the scan - Effective batch size of 1M+ tokens recommended for stable training ### Fine-tuning Existing Models Both Mamba and Jamba support standard fine-tuning approaches: **Full fine-tuning:** - Works but expensive (same compute as transformer fine-tuning) - All SSM parameters are trainable - Risk of catastrophic forgetting similar to transformers **LoRA (Low-Rank Adaptation):** - Adapts efficiently, typically on linear projections - Apply LoRA to the input/output projections and SSM parameter projections - Rank 8-64 typically sufficient - 10-100× more parameter efficient than full fine-tuning **QLoRA:** - Enables fine-tuning on consumer GPUs - Quantize base model to 4-bit, train LoRA adapters in FP16 - Mamba-2.8B can be fine-tuned on 16GB GPU with QLoRA **Hybrid-specific considerations:** - For Jamba, can apply LoRA selectively to attention or SSM components - Attention-only LoRA useful for tasks requiring improved reasoning - SSM-only LoRA useful for tasks requiring improved long-context handling ### Deployment Patterns **Single-GPU deployment:** - Mamba models up to 7B fit on consumer GPUs (RTX 3090/4090) - Jamba Mini fits on single A100 80GB - Use quantization (INT8/INT4) for larger models **Multi-GPU deployment:** - Tensor parallelism works differently for SSMs (split state dimension) - Pipeline parallelism straightforward (split layers) - Jamba 52B typically deployed on 4× A100 with tensor parallelism **Edge deployment:** - Jamba Reasoning 3B designed for laptops/smartphones - ONNX export enables deployment on diverse hardware - Consider quantization to INT4 for memory-constrained devices **Streaming applications:** - SSMs naturally support streaming (process tokens as they arrive) - No need to re-process full context like transformers with KV cache - State can be checkpointed and resumed ## The Future of Sequence Modeling State-space models represent a fundamental rethinking of sequence modeling, not just an incremental improvement. The key insights—continuous dynamics, selective state evolution, and efficient parallel algorithms—open new possibilities beyond current architectures. ### Research Directions **Scaling laws**: How do SSMs scale compared to transformers? Early evidence suggests favorable scaling, but systematic study at frontier scale is ongoing. **Theoretical understanding**: Why do selective SSMs work? The connection to linear attention provides some insight, but a complete theory of what computations SSMs can and cannot perform remains elusive. **Architectural search**: The optimal combination of attention, SSM, MoE, and other mechanisms is likely task-dependent. Automated architecture search may discover better configurations than human intuition. **Hardware co-design**: SSMs have different computational patterns than transformers. Custom hardware optimized for scan operations could provide further speedups. ### Practical Outlook For practitioners, the message is clear: state-space models and hybrid architectures are production-ready for appropriate use cases. The 2-3× inference speedup and dramatically reduced memory for long contexts translate directly to cost savings and new capabilities. The transformer isn't going away—attention remains the most expressive sequence modeling mechanism we have. But the pure transformer's dominance is ending. The future is hybrid, combining mechanisms based on their complementary strengths rather than dogmatic commitment to any single architecture.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles

EducationLLMs

Transformer Architecture: A Complete Deep Dive

A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.

30 min read
LLMsML Engineering

Attention Mechanisms: From Self-Attention to FlashAttention

A comprehensive deep dive into attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.

7 min read
LLMsML Engineering

Mixture of Experts: Scaling LLMs Beyond Dense Models

A comprehensive deep dive into Mixture of Experts (MoE) architecture—how models like Mixtral and GPT-4 achieve massive capacity without proportional compute costs. Understand routing mechanisms, expert specialization, load balancing, and why MoE represents the future of LLM scaling.

6 min read
LLMsML Engineering

Context Extension: How LLMs Scale Beyond Training Length

A comprehensive deep dive into context extension techniques—how models trained on 4K tokens extrapolate to 128K+. Understand RoPE scaling, Position Interpolation, NTK-aware scaling, YaRN, and the mathematics of long-context LLMs.

5 min read
LLMsML Engineering

LLM Inference Optimization: From Quantization to Speculative Decoding

A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.

12 min read