How do state-space models compare to transformers for short sequences?

For sequences under 2K tokens, transformers and SSMs perform similarly on most tasks, with transformers often having a slight edge on reasoning-heavy tasks due to attention's expressiveness. The SSM advantage emerges at longer sequences where quadratic attention becomes prohibitive. For typical chatbot interactions (1-4K tokens), either architecture works well.

Can I use Mamba as a drop-in replacement for transformers?

Not directly—the architectures are different enough that weights don't transfer. However, you can train a Mamba model from scratch or fine-tune existing Mamba checkpoints (like Mamba-2.8B or Jamba). For existing transformer deployments, the switching cost involves retraining, which may or may not be justified depending on your context length requirements.

What's the relationship between Mamba and linear attention?

Mamba-2's State Space Duality shows they're mathematically related—selective SSMs can be viewed as a structured form of linear attention. However, they differ in parameterization and implementation. Linear attention removes the softmax from standard attention; Mamba uses a different computational form (state-space recurrence) that happens to compute similar results. The practical difference is that Mamba's formulation enables more efficient implementation on current hardware.

How does Jamba handle the 256K context window?

Jamba's hybrid design processes long contexts efficiently because 87.5% of layers are Mamba (linear complexity) while only 12.5% are attention (quadratic). The attention layers use GQA to further reduce memory. The result is that Jamba's memory scales roughly linearly with context length, unlike pure transformers where KV cache grows quadratically in practice.

Are there pre-trained Mamba or Jamba models I can use?

Yes. Mamba has official checkpoints at 130M, 370M, 790M, 1.4B, and 2.8B parameters. Jamba models are available through AI21's API and Hugging Face, including the open-weight Jamba 1.5 Mini and Jamba 1.7. These can be fine-tuned for specific tasks using standard techniques.

What hardware do I need to run Mamba efficiently?

Mamba's custom CUDA kernels require an NVIDIA GPU with compute capability 8.0+ (Ampere architecture: A100, A10, RTX 30xx series or newer). On older GPUs, it falls back to slower implementations. For Jamba specifically, a single A100 80GB can handle the full 52B model at 256K context; smaller Jamba variants run on consumer GPUs.

How does Mamba-3 differ from Mamba-2?

Mamba-3 introduces three key improvements: (1) more expressive recurrence with learnable state transition functions, (2) complex-valued state representations for richer pattern capture, and (3) a Multi-Input Multi-Output (MIMO) formulation that allows cross-dimensional interactions. These changes significantly improve performance on retrieval and state-tracking tasks while maintaining linear-time complexity.

What are the main limitations of SSMs compared to transformers?

Research has identified several limitations: (1) **Asymmetry bias** from nonlinear convolution impairs symmetric pattern recognition, (2) **Long-context understanding** (not just processing) often lags behind transformers, (3) **Exact copying/retrieval** tasks remain challenging without attention. Hybrid architectures like Jamba address these by combining SSM efficiency with attention expressiveness.

State-Space Models: Mamba, Jamba, and the Post-Transformer Era | Enrico Piovano

The transformer architecture has dominated natural language processing since 2017, but its quadratic attention complexity creates fundamental scaling limitations. State-space models (SSMs) offer a compelling alternative with linear-time complexity and constant memory requirements during inference. This guide explores the mathematical foundations, architectural innovations, and practical implications of SSMs—from the foundational S4 model to Mamba's selective mechanisms and AI21's hybrid Jamba architecture.

The Transformer Bottleneck

Before understanding why state-space models matter, we need to appreciate the fundamental limitation they address. Transformers compute attention over all pairs of tokens in a sequence, resulting in $O(n^2)$ time and memory complexity where $n$ is the sequence length. This quadratic scaling creates severe practical constraints.

For a 128K token context window, the attention mechanism must compute and store approximately 16 billion attention weights per layer. Double the context to 256K tokens, and you need four times the compute and memory. This isn't just a theoretical concern—it directly limits what applications can be built. Real-time processing of long documents, efficient inference on edge devices, and cost-effective deployment at scale all suffer from this fundamental architectural choice.

The attention mechanism also lacks an inherent notion of recurrence or state. Each forward pass recomputes relationships from scratch, with no persistent memory of what came before. While this enables powerful parallel training, it means transformers must explicitly attend to all relevant context at inference time, even when processing sequential data where recurrence would be more natural.

State-space models address both limitations by reformulating sequence modeling as a continuous dynamical system that can be discretized and computed efficiently. The result is linear-time complexity with respect to sequence length and a compressed hidden state that persists across time steps.

Mathematical Foundations of State-Space Models

State-space models originate from control theory, where they describe how a system's internal state evolves over time in response to inputs. The continuous-time formulation defines a linear time-invariant (LTI) system with the following equations:

$\frac{dh(t)}{dt} = Ah(t) + Bx(t)$ $y(t) = Ch(t) + Dx(t)$

Here, $x(t)$ is the input signal, $h(t)$ is the hidden state, and $y(t)$ is the output. The matrices $A$ , $B$ , $C$ , and $D$ are learnable parameters that define how the system evolves. Matrix $A$ (the state matrix) is particularly important—it controls how the hidden state transitions over time, determining what information is retained or forgotten.

To use this formulation with discrete sequences like text, we must discretize the continuous system. Using the zero-order hold (ZOH) discretization with step size $\Delta$ , we obtain:

$\bar{A} = \exp(\Delta A)$ $\bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B$

The discretized system then operates as a recurrence:

$h_k = \bar{A}h_{k-1} + \bar{B}x_k$ $y_k = Ch_k$

This recurrence can be unrolled into a convolution, which is the key insight enabling efficient parallel training. If we expand the recurrence:

$y_k = C\bar{A}^k\bar{B}x_0 + C\bar{A}^{k-1}\bar{B}x_1 + ... + C\bar{B}x_k$

This is a convolution with kernel $\bar{K} = (C\bar{B}, C\bar{A}\bar{B}, C\bar{A}^2\bar{B}, ...)$ . During training, we can compute this convolution efficiently using FFT in $O(n \log n)$ time. During inference, we use the recurrent form, processing one token at a time in $O(1)$ time per step with $O(d)$ memory for the hidden state, where $d$ is the state dimension.

This dual view—convolutional for training, recurrent for inference—gives SSMs their unique efficiency profile. But the basic LTI formulation has a critical limitation: the matrices $A$ , $B$ , $C$ , $D$ are fixed, meaning the system processes all inputs identically regardless of content.

S4: Structured State Spaces

The S4 (Structured State Space) model, introduced by Gu et al. in 2022, made SSMs practical for deep learning by addressing a key challenge: how to parameterize and initialize the state matrix $A$ to enable learning long-range dependencies.

The fundamental problem is that naive random initialization of $A$ leads to either exploding or vanishing gradients over long sequences. S4 solves this by constraining $A$ to a specific structured form based on the HiPPO (High-order Polynomial Projection Operator) theory.

HiPPO provides a principled way to compress a continuous signal's history into a fixed-size state vector. The HiPPO-LegS (Legendre scaled) matrix is defined as:

State-Space Models: Mamba, Jamba, and the Post-Transformer Era

Table of Contents

The Transformer Bottleneck

Mathematical Foundations of State-Space Models

S4: Structured State Spaces

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Transformer Architecture: A Complete Deep Dive

Attention Mechanisms: From Self-Attention to FlashAttention

Mixture of Experts: Scaling LLMs Beyond Dense Models

Context Extension: How LLMs Scale Beyond Training Length

LLM Inference Optimization: From Quantization to Speculative Decoding