Does speculative decoding change the model's outputs?

No—when implemented correctly with rejection sampling, speculative decoding produces exactly the same output distribution as standard decoding. This is a mathematical guarantee, not an approximation. The only difference is speed.

What speedup can I realistically expect?

For most tasks, expect 2-3× speedup with well-tuned speculative decoding (EAGLE-3). Code generation and structured text can see up to 4-6×. Creative writing and highly unpredictable tasks may see only 1.5-2×. The speedup depends heavily on how predictable your specific workload is.

Should I use EAGLE, Medusa, or Lookahead?

EAGLE-3 generally achieves the best speedups (3-6.5×) but requires training a draft head. Use **SpecForge** with SGLang for out-of-the-box EAGLE-3 training. Medusa is simpler to deploy and works well if you can accept slightly lower speedups. Lookahead requires no training and is good for code or structured text. Start with Lookahead to establish a baseline, then consider EAGLE if you need maximum performance.

Can I use speculative decoding with quantized models?

Yes—speculative decoding works with INT8, INT4, and other quantized models. The draft model can often use more aggressive quantization than the target. Some implementations (like TensorRT-LLM) support FP8 draft models with FP16/BF16 targets.

How does speculative decoding interact with batching?

Speculative decoding provides the largest benefit for single-request latency. With large batches, the GPU is already well-utilized, so speculation adds less value. For latency-sensitive applications with small batches (1-4 requests), speculation is very effective. For throughput-focused serving with large batches (32+), standard decoding may be preferable.

What's the memory overhead?

The draft model adds memory proportional to its size. For a 7B draft with a 70B target, expect roughly 10% additional memory. EAGLE and Medusa add smaller heads (typically <1B parameters). The speculation buffer for candidate tokens is negligible.

Can I use speculative decoding with streaming?

Yes, but with caveats. Tokens can only be streamed after verification, not during speculation. This means tokens are streamed in bursts (when verification completes) rather than one-by-one. The user experience may feel different even though total time is lower.

Which frameworks support EAGLE-3 in production?

As of 2025, EAGLE-3 has native support in: - **vLLM 0.8.5+**: Both EAGLE-1 and EAGLE-3, with CUDA graphs support in v0.9.1+ - **TensorRT-LLM**: Disaggregated serving with the two-model approach - **SGLang**: Via SpecForge for training and inference

What's the difference between lossless and lossy speculative decoding?

**Lossless** (EAGLE, standard Medusa-1): Uses rejection sampling to guarantee identical output distribution to standard decoding. No quality tradeoff. **Lossy** (Medusa-2, Fuzzy Speculative Decoding): Relaxes acceptance criteria to accept "close enough" tokens. Provides higher speedups but may slightly alter output distribution. Suitable when exact reproducibility isn't required.

Speculative Decoding: Accelerating LLM Inference Without Sacrificing Quality | Enrico Piovano

Q: Can I use speculative decoding with streaming?

Yes, but with caveats. Tokens can only be streamed after verification, not during speculation. This means tokens are streamed in bursts (when verification completes) rather than one-by-one. The user experience may feel different even though total time is lower.

Q: Which frameworks support EAGLE-3 in production?

As of 2025, EAGLE-3 has native support in: - **vLLM 0.8.5+**: Both EAGLE-1 and EAGLE-3, with CUDA graphs support in v0.9.1+ - **TensorRT-LLM**: Disaggregated serving with the two-model approach - **SGLang**: Via SpecForge for training and inference

Q: What's the difference between lossless and lossy speculative decoding?

**Lossless** (EAGLE, standard Medusa-1): Uses rejection sampling to guarantee identical output distribution to standard decoding. No quality tradeoff. **Lossy** (Medusa-2, Fuzzy Speculative Decoding): Relaxes acceptance criteria to accept "close enough" tokens. Provides higher speedups but may slightly alter output distribution. Suitable when exact reproducibility isn't required.

Large language models generate text one token at a time, with each token requiring a full forward pass through billions of parameters. This autoregressive bottleneck means that generating a 500-token response requires 500 sequential forward passes, regardless of available compute. Speculative decoding breaks this limitation by predicting multiple tokens ahead and verifying them in parallel, achieving 2-4× speedups while mathematically guaranteeing identical outputs to standard decoding.

The Autoregressive Bottleneck

Understanding why speculative decoding works requires first understanding why standard LLM inference is slow. When a transformer generates text, it produces one token at a time through a sequential process:

The model processes all previous tokens (prompt + generated so far)
The final layer outputs a probability distribution over the vocabulary
A token is sampled from this distribution
The sampled token is appended, and the process repeats

Each step requires a complete forward pass through the model. For a 70B parameter model on an A100 GPU, a single forward pass takes approximately 30-50ms. Generating 100 tokens therefore takes 3-5 seconds—not because of computational limits, but because the sequential nature prevents parallelization.

The bottleneck is memory bandwidth, not compute. Modern GPUs can perform far more operations per second than they can load model weights from memory. A 70B model requires loading 140GB of weights (in FP16) for each forward pass. Even the A100's 2TB/s memory bandwidth limits throughput to roughly 14 forward passes per second for this weight movement alone.

This creates a paradox: GPUs sit idle waiting for memory transfers while the model generates tokens one by one. The theoretical throughput based on compute (FLOPS) is 10-100× higher than actual achieved throughput. Speculative decoding exploits this gap by doing useful work during what would otherwise be idle time.

The Core Insight

Speculative decoding is based on a simple observation: verifying whether a sequence of tokens is correct is much cheaper than generating them. If we can cheaply predict multiple future tokens, we can verify them all at once with the expensive model, accepting correct predictions and rejecting incorrect ones.

The process works as follows:

Draft phase: A fast mechanism proposes $K$ candidate tokens
Verification phase: The target model processes all $K$ candidates in parallel
Acceptance phase: Correct predictions are accepted; the first incorrect prediction triggers regeneration from that point

The key mathematical property is that this process is lossless—the output distribution is identical to standard autoregressive decoding. This isn't an approximation or quality tradeoff; speculative decoding produces exactly the same outputs, just faster.

Why Verification is Cheap

When the target model verifies $K$ proposed tokens, it processes them as a single batch rather than $K$ sequential passes. The KV cache from verified tokens is computed once and reused, and the model's attention can process all positions in parallel.

For a sequence of length $N$ with $K$ proposed tokens, verification requires attention computation over $N+K$ positions. In contrast, sequential generation would require $K$ separate forward passes, each with growing attention complexity. The parallel verification amortizes the fixed costs (weight loading, kernel launches) across multiple tokens.

The speedup potential is bounded by the draft mechanism's accuracy. If the draft correctly predicts all $K$ tokens, we get $K$ × speedup. If predictions are wrong, we waste some verification compute but still make progress. The expected speedup depends on the acceptance rate $\alpha$ :

$\text{Expected tokens per step} = \frac{1 - \alpha^{K+1}}{1 - \alpha}$

For $\alpha = 0.7$ and $K = 4$ , this gives approximately 2.8 tokens per verification step—a 2.8× speedup if draft and verification costs are balanced.

Speculative Sampling: The Theoretical Foundation

The acceptance/rejection mechanism must be carefully designed to preserve the target model's distribution. Naive acceptance (accept if draft matches greedy decoding) would bias outputs toward the draft model's preferences. Instead, speculative decoding uses a rejection sampling scheme.

Let $q(x)$ be the draft model's probability for token $x$ and $p(x)$ be the target model's probability. The acceptance probability for a drafted token is:

$\text{accept}(x) = \min\left(1, \frac{p(x)}{q(x)}\right)$

If the token is rejected, we sample from a residual distribution:

$p_{\text{residual}}(x) = \text{normalize}\left(\max(0, p(x) - q(x))\right)$

This scheme ensures the final token distribution equals the target model's distribution exactly. When $q(x) \geq p(x)$ (draft overestimates), we sometimes reject even when draft and target agree, compensating for oversampling. When $q(x) < p(x)$ (draft underestimates), we always accept but might resample from the residual.

The beauty of this scheme is that it requires only the probability values, not any architectural constraints on the draft model. Any mechanism that produces probability estimates can serve as a draft, as long as we can compute both $p(x)$ and $q(x)$ for verification.

Temperature and Top-p Considerations

Real deployments use sampling with temperature and nucleus (top-p) filtering. Speculative decoding extends naturally to these settings:

Temperature: Both draft and target probabilities are temperature-scaled before computing acceptance ratios. The scheme remains valid because temperature is applied consistently.

Top-p/Top-k: Filtering is applied to the target distribution before computing acceptance. If the drafted token falls outside the target's nucleus, it's automatically rejected.

Beam search: Speculative decoding can be combined with beam search by running multiple draft sequences in parallel and verifying all beams together.

The key constraint is that draft and target must use identical sampling parameters. Mismatched temperatures would break the theoretical guarantees.

Draft Model Approaches

The draft mechanism is the critical component determining speedup. An ideal draft is fast, accurate, and produces well-calibrated probabilities. Several approaches have been developed:

Independent Draft Models

The simplest approach uses a separate, smaller model from the same family. For example, using Llama-7B to draft for Llama-70B:

Advantages:

Simple implementation
Draft model can be optimized independently
Works with any target model

Disadvantages:

Requires loading two separate models
Draft accuracy limited by capacity gap
No sharing of target model's representations

Empirically, same-family draft models achieve 60-70% acceptance rates, enabling 1.5-2× speedups. The draft model should be 10-20× smaller than the target for the compute tradeoff to be favorable.

Self-Speculative Decoding

Rather than a separate model, self-speculative methods use the target model itself with early exit or layer skipping:

Early Exit: Run only the first N layers of the target model for drafting. A prediction head on layer N outputs draft probabilities. This achieves 70-80% acceptance rates with minimal overhead.

Layer Skipping: Skip every other layer during drafting, effectively halving the compute per draft token. The missing layers' contributions are approximated through residual connections.

Advantage: No additional model weights; draft naturally aligns with target Disadvantage: Requires architecture modifications; draft speed limited by target's structure

Knowledge Distillation Drafts

Train a small draft model specifically to mimic the target:

Generate data from the target model
Train the draft to match target outputs (not just ground truth)
The draft learns the target's distribution, not just the training distribution

Distilled drafts achieve higher acceptance rates (75-85%) because they're explicitly trained to predict what the target would produce. The cost is additional training and the need to retrain when the target changes.

EAGLE: State-of-the-Art Speculative Decoding

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) represents the current state of the art for speculative decoding. Developed through three iterations (EAGLE-1, EAGLE-2, EAGLE-3), it achieves 2-3× speedups across diverse tasks while maintaining output equivalence.

Core Innovation

EAGLE's key insight is that the target model's internal representations contain rich information for predicting future tokens. Rather than training a separate draft model, EAGLE attaches a lightweight prediction head to the target model's hidden states.

The architecture works as follows:

After each target model forward pass, extract the second-to-last layer's hidden states
Feed these states through a small autoregressive head (typically 1-2 transformer layers)
The head predicts the next token's hidden state, which is decoded to a token
Repeat to generate multiple draft tokens

This approach leverages the target model's own feature representations, achieving much higher acceptance rates than independent draft models. EAGLE-1 reported 80%+ acceptance rates on many benchmarks.

EAGLE-2: Dynamic Draft Trees

EAGLE-2 extends the original with dynamic draft tree construction. Rather than generating a single sequence of draft tokens, EAGLE-2 generates a tree of possibilities:

Code

            [root]
           /      \
        [a]        [b]
       / | \        |
     [c][d][e]     [f]

Each path through the tree represents a possible continuation. The target model verifies all paths in a single batched forward pass using a carefully constructed attention mask.

The tree structure is built dynamically based on the draft model's confidence:

High confidence → single path (save compute)
Low confidence → branch to explore alternatives
Medium confidence → limited branching

This adaptive strategy achieves better speedups than fixed-width speculation by focusing compute where uncertainty is highest. EAGLE-2 achieves approximately 2× the speed of Medusa and 2.3× the speed of Lookahead decoding.

EAGLE-3: Production Optimization (NeurIPS 2025)

EAGLE-3, presented at NeurIPS 2025, focuses on production deployment with several key innovations:

Multi-Level Feature Fusion: EAGLE-3 removes the feature prediction constraint from earlier versions. Instead of relying solely on top-layer features, it replaces them with a fusion of low-, mid-, and high-level semantic features. This provides richer context for draft prediction:

$h_{\text{draft}} = \text{Fuse}(h^{(l_1)}, h^{(l_2)}, h^{(l_3)})$

Where $l_1, l_2, l_3$ represent early, middle, and late layers respectively.

Training-Time Testing: EAGLE-3 simulates the speculative decoding process during training, allowing the draft head to learn from realistic inference-time conditions rather than just teacher forcing.

Performance Improvements: EAGLE-3 achieves 3.0×-6.5× speedup compared to vanilla autoregressive generation, representing a 20-40% improvement over EAGLE-2.

Disaggregated Serving: EAGLE-3 supports separating the draft and target models across different GPUs or even different machines. This enables:

Dedicated draft GPU running continuously
Target GPU handling verification batches
Pipelined execution hiding draft latency

Framework Integration: Official support for major inference frameworks:

vLLM 0.8.5+: Native EAGLE-1 and EAGLE-3 support with CUDA graphs (v0.9.1+)
TensorRT-LLM: Optimized kernels and disaggregated serving for Llama 4 Maverick
SGLang: SpecForge for accelerated EAGLE-3 training
Speculative decoding metrics for production monitoring

Mixed Precision: Draft head runs in FP8 or INT8 while target maintains FP16/BF16, reducing draft overhead further.

According to Spec-Bench (a comprehensive benchmark for speculative decoding), EAGLE-3 currently achieves the best speedups across different model sizes and tasks.

Medusa: Parallel Draft Heads

Medusa takes a different approach: instead of autoregressive draft generation, it predicts multiple future positions simultaneously using parallel heads.

Architecture

Medusa adds $K$ prediction heads to the target model, each predicting a different future position:

Head 1: Predicts token at position $t+1$
Head 2: Predicts token at position $t+2$
...
Head K: Predicts token at position $t+K$

Each head is a small MLP that takes the last hidden state and outputs vocabulary logits. The heads are trained jointly with the (frozen) target model to predict future tokens given current context.

Tree-Based Verification

Since each head makes independent predictions, we can consider multiple candidates per position. With top-3 predictions from each of 4 heads, we have $3^4 = 81$ possible sequences. Rather than verifying all 81, Medusa constructs a tree:

Code

Position 1: [A, B, C]
Position 2: [D, E] given A, [F] given B, [G, H] given C
Position 3: ...

The tree is constructed to maximize expected acceptance while minimizing verification cost. Medusa uses a sparse attention mask to verify all tree paths in a single forward pass.

Comparison with EAGLE

Aspect	EAGLE	Medusa
Draft Generation	Autoregressive	Parallel
Architecture	Feature extraction + AR head	Multiple independent heads
Training	On target model features	On target model predictions
Acceptance Rate	~80%	~60%
Draft Overhead	Higher	Lower
Speedup	2-3×	1.5-2×
Lossless	Yes	Configurable

EAGLE achieves higher acceptance rates because its autoregressive draft can condition on previously drafted tokens. Medusa's parallel heads are faster but each prediction is independent, leading to lower accuracy.

Medusa offers a speed-quality tradeoff: "Medusa-1" maintains exact output equivalence using rejection sampling, while "Medusa-2" relaxes this for higher speed by accepting likely-but-not-identical tokens. The relaxed mode is suitable when exact reproducibility isn't required.

Lookahead Decoding

Lookahead decoding, developed by researchers at UC Berkeley, takes yet another approach: using n-gram patterns from the prompt to predict future tokens.

Jacobi Iteration Perspective

Lookahead frames autoregressive generation as solving a fixed-point equation:

$x_{t+1:t+K} = f(x_{1:t}, x_{t+1:t+K-1})$

This can be solved iteratively using Jacobi iteration, where each step refines the guess for future tokens based on the current guess. The iteration converges when the model's predictions match the current hypothesis.

N-gram Cache

The practical implementation builds a cache of n-gram patterns observed in the input. When generating, it looks up n-grams matching recent context and uses their continuations as draft candidates.

For example, if the prompt contains "the quick brown fox" and we've generated "the quick", we hypothesize "brown fox" as the continuation based on the cached pattern.

This approach works surprisingly well for:

Code (repetitive patterns)
Structured text (JSON, XML)
Text with recurring phrases

It works less well for:

Creative writing
Conversations with novel content
Highly varied text

Comparison

Lookahead is training-free—it doesn't require fine-tuning any components. This makes it easy to apply to any model. However, its speedups are more variable, depending heavily on input characteristics. EAGLE and Medusa provide more consistent improvements across diverse inputs.

Comprehensive Method Comparison

Understanding the tradeoffs between different speculative decoding approaches helps choose the right method for your use case.

Feature Comparison Matrix

Feature	Independent Draft	Self-Speculative	EAGLE-3	Medusa	Lookahead
Training required	Optional distillation	Architecture mod	Draft head training	Head training	None
Additional parameters	Full draft model	None	~1B	~0.5B	None
Acceptance rate	60-70%	70-80%	80-90%	55-65%	Variable
Typical speedup	1.5-2×	1.8-2.2×	3-6.5×	1.5-2×	1.3-2×
Memory overhead	10-50%	<5%	5-10%	5-10%	<1%
Best for	Any target	Single model	Maximum speed	Simplicity	Code/structured

Cost-Benefit Analysis

Understanding the economics of speculative decoding:

Computation cost per generated token:

Let $C_d$ = draft model forward pass cost, $C_t$ = target model forward pass cost, $K$ = speculation depth, $\alpha$ = acceptance rate.

Standard decoding cost per token: $C_t$

Speculative decoding cost per token: $C_{spec} = \frac{K \cdot C_d + C_t}{E[\text{accepted tokens}]} = \frac{K \cdot C_d + C_t}{\frac{1 - \alpha^{K+1}}{1 - \alpha}}$

Speculative decoding is beneficial when $C_{spec} < C_t$ , which simplifies to:

$\frac{C_d}{C_t} < \frac{1 - \alpha^{K+1}}{K(1-\alpha)} - \frac{1}{K}$

For $\alpha = 0.8$ and $K = 5$ : draft can be up to 60% the cost of target and still provide benefit.

Latency Breakdown

Understanding where time is spent helps optimization:

Code

Standard Autoregressive Generation (100 tokens):
├── Weight loading: 65% (memory bound)
├── Attention computation: 20%
├── FFN computation: 10%
└── Other (sampling, etc.): 5%
Total: ~3 seconds for 70B model

Speculative Decoding (100 tokens, 80% acceptance):
├── Draft generation (20 iterations × 5 tokens): 15%
├── Verification forward passes: 50%
├── Acceptance/rejection logic: 2%
├── Weight loading (amortized): 30%
└── Other: 3%
Total: ~1.2 seconds for 70B model (2.5× speedup)

The key insight: speculative decoding amortizes the weight loading cost (the dominant factor) across multiple tokens.

Specialized Techniques

Beyond the main approaches, several specialized techniques address specific scenarios:

Speculative Decoding for Mixture of Experts

MoE models like Mixtral activate only a subset of experts per token. Speculative decoding interacts interestingly with this:

Challenge: Draft and target might route to different experts, causing divergence

Solution: Route-aware drafting that conditions on expected expert selection

Utility-Driven Approach (Saxena et al., 2025):

Predicts which experts will be activated based on draft tokens
Adjusts speculation depth based on expected routing overlap
Achieves 15-20% higher acceptance rates than naive speculation on MoE models

Speculative Decoding for Long Contexts

Long contexts present unique challenges for speculative decoding:

KV Cache Management:

Both draft and target models need KV caches
For 128K context: draft cache (if transformer-based) can be significant
Solution: Use SSM-based draft models (constant state size)

Attention Cost:

Verification attention cost grows with context length
For very long contexts, verification cost can dominate
Solution: Use chunked verification or sliding window attention in draft

Draft Quality Degradation:

Draft accuracy may decrease with longer contexts
Earlier tokens harder to predict based on compressed state
Solution: Dynamic speculation depth based on context length

CTC-Based Drafting

Connectionist Temporal Classification (CTC), typically used for speech recognition, has been adapted for speculative decoding. A CTC model predicts multiple tokens in parallel, naturally producing variable-length outputs:

Mechanism:

CTC outputs a sequence with possible blanks
Collapse repeated tokens and remove blanks to get draft
Variable-length output naturally handles different generation speeds

Advantage: Very fast draft generation (single forward pass for multiple tokens) Disadvantage: CTC's conditional independence assumption limits accuracy

This approach is particularly effective for tasks with predictable structure, like code completion or form filling.

Multi-Model Speculation

Recent work explores using multiple draft models simultaneously:

Ensemble Drafting:

Run several small models in parallel
Combine their predictions (voting, ensemble)
Verify the consensus predictions

Cascaded Drafting:

Very fast model drafts first
Medium model refines high-uncertainty positions
Target model verifies final draft

Advantages:

Improved draft accuracy
Robustness to individual model failures
Can combine specialized drafts (e.g., code expert + language expert)

Disadvantages:

Increased complexity
Higher memory requirements
Coordination overhead

Retrieval-Augmented Speculative Decoding (RASD)

RASD (March 2025) uses retrieved examples to improve draft quality:

Mechanism:

Retrieve similar examples from a database based on current context
Use retrieved continuations to inform draft
Combine retrieval-based and model-based drafts

Benefits:

Higher acceptance rates for domain-specific applications
Can leverage existing retrieval infrastructure
Particularly effective for repetitive domains (customer support, documentation)

Considerations:

Requires maintaining and updating retrieval database
Additional latency from retrieval
Privacy implications of retrieved content

Real-World Performance Analysis

Case Study: Code Generation

Code generation is often cited as the best use case for speculative decoding due to predictable patterns:

GitHub Copilot-style completion:

High repetition (variable names, function patterns)
Structured syntax
EAGLE-3 achieves 4-6× speedup on code tasks

Benchmark results (HumanEval, MBPP):

Model Setup	Latency (ms/token)	Speedup	Acceptance Rate
Llama-70B baseline	45	1.0×	—
+ Llama-7B draft	28	1.6×	68%
+ EAGLE-3	12	3.8×	84%
+ Medusa	22	2.0×	61%

Case Study: Conversational AI

Conversational tasks are more challenging due to unpredictability:

Chat/Assistant workloads:

More diverse vocabulary
Context-dependent responses
Lower acceptance rates

MT-Bench results:

Model Setup	Latency (ms/token)	Speedup	Acceptance Rate
Vicuna-33B baseline	38	1.0×	—
+ Vicuna-7B draft	26	1.5×	62%
+ EAGLE-3	16	2.4×	76%

Case Study: Document Summarization

Summarization involves processing long inputs and generating condensed outputs:

Characteristics:

Long context (10K+ tokens)
Moderate acceptance rates
Benefits from hybrid approaches

CNN/DailyMail results:

Model Setup	Time per summary	Speedup
GPT-3.5 baseline	8.2s	1.0×
+ Speculative	4.1s	2.0×
Llama-70B baseline	12.5s	1.0×
+ EAGLE-3	4.8s	2.6×

Production Deployment

Deploying speculative decoding in production requires careful consideration of system architecture, resource allocation, and failure handling.

Framework Support

Major inference frameworks now support speculative decoding:

vLLM offers:

Draft model speculation with configurable speculation depth
Medusa and EAGLE integration
PagedAttention for efficient KV cache management during speculation
Automatic speculation depth tuning

TensorRT-LLM provides:

Optimized CUDA kernels for draft model inference
Fused verification operations
Support for disaggregated draft/target serving
INT8/FP8 draft models with FP16 targets

SGLang includes:

EAGLE-3 native support
Speculative decoding with RadixAttention
SpecForge for accelerated speculation training

System Architecture

Production deployments typically use one of two architectures:

Co-located: Draft and target models on the same GPU

Simpler deployment
Memory contention between models
Works well when draft is small (<10% of target)

Disaggregated: Draft and target on separate GPUs/machines

Better resource utilization
Additional network latency
Preferred for large-scale deployments

The disaggregated approach enables sophisticated pipelining:

Code

Time →

Draft GPU:   [Draft 1][Draft 2][Draft 3][Draft 4]...
              ↓         ↓         ↓         ↓
Target GPU:   [Verify 1][Verify 2][Verify 3]...

The target GPU is kept fully utilized, with draft tokens always ready when needed.

Batching Considerations

Speculative decoding complicates batching because different requests may accept different numbers of tokens:

Request A: Accepts 5/5 speculated tokens
Request B: Accepts 2/5 speculated tokens

Naive batching would process both to the minimum (2), wasting Request A's accepted tokens. Solutions include:

Selective batching: Only batch requests with similar expected acceptance rates Padding: Accept variable-length outputs with padding Request routing: Send high-acceptance requests to speculative path, others to standard decoding

vLLM's implementation handles this transparently, but custom deployments should consider the impact on throughput.

Monitoring and Tuning

Key metrics for speculative decoding:

Acceptance rate: Fraction of draft tokens accepted (target: >70%)
Effective speedup: Wall-clock time vs. standard decoding (target: >2×)
Speculation overhead: Time spent on rejected tokens
Memory utilization: Draft model + target model + speculation buffers

Tuning recommendations:

Speculation depth (K): Start with 4-5, increase if acceptance rate is high
Draft model size: 10-20× smaller than target for optimal tradeoff
Tree width (for EAGLE-2): Wider trees for uncertain prompts
Batch size: Smaller batches benefit more from speculation

Failure Handling

Speculative decoding adds failure modes:

Draft model crashes: Fall back to standard decoding Verification timeout: Accept partially verified sequence Memory exhaustion: Reduce speculation depth dynamically Consistency errors: Log and investigate; should never happen with correct implementation

Production systems should monitor for these failures and have automatic fallback paths.

Benchmarking Speculative Decoding

Spec-Bench, introduced alongside EAGLE, provides standardized evaluation:

Tasks

MT-bench: Multi-turn conversation
HumanEval: Code generation
GSM8K: Mathematical reasoning
Alpaca: Instruction following
CNN/DailyMail: Summarization

Metrics

Wallclock speedup: End-to-end time vs. standard decoding
Acceptance rate: Fraction of draft tokens accepted
Token efficiency: Tokens generated per forward pass
Memory overhead: Additional memory vs. standard decoding

Results (EAGLE-3 vs. Baselines)

Model	Standard	Medusa	Lookahead	EAGLE-3
Llama-2-7B	1.0×	1.8×	1.5×	2.4×
Llama-2-13B	1.0×	1.7×	1.4×	2.3×
Llama-2-70B	1.0×	1.6×	1.3×	2.1×
Vicuna-7B	1.0×	1.9×	1.6×	2.5×
Code Llama-7B	1.0×	2.1×	1.8×	2.8×

Code generation benefits most because code has predictable patterns that drafts model well. Conversational tasks benefit less due to higher unpredictability.

Advanced Topics

Speculative Decoding Theory

Recent theoretical work has characterized speculative decoding's optimality:

Token efficiency bound: For acceptance rate $\alpha$ and speculation depth $K$ , the expected tokens per iteration is:

$E[\text{tokens}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}$

Optimal speculation depth: Given draft cost $c_d$ and target cost $c_t$ , the optimal $K$ satisfies:

$K^* = \arg\max_K \frac{E[\text{tokens}]}{c_d \cdot K + c_t}$

For typical values ( $\alpha = 0.7$ , $c_d/c_t = 0.1$ ), optimal $K$ is 4-6.

Theoretical speedup limit: As $\alpha \to 1$ , speedup approaches $K$ . The practical limit is determined by draft model accuracy.

Speculative Decoding for Long Contexts

Long-context generation (10K+ tokens) presents unique challenges:

KV cache growth: Both draft and target caches grow with context
Draft accuracy decay: Drafts may become less accurate as context grows
Verification cost: Each verification processes the full context

Solutions:

Sliding window speculation: Only use recent context for drafting
Hierarchical drafts: Different draft strategies for different context ranges
Cached verification: Reuse KV cache across speculative iterations

Training Better Draft Models

Improving draft accuracy directly increases speedup. Approaches include:

Online distillation: Continuously update draft based on rejection patterns Rejection-aware training: Weight training loss by acceptance probability Multi-task drafts: Train draft on diverse prompts matching deployment distribution Reinforcement learning: Optimize draft for expected accepted tokens, not just accuracy

2025 Research Developments

Several new speculative decoding techniques have emerged in 2025:

Speculators v0.3.0 (December 2025): End-to-end training support for Eagle3 draft models with seamless vLLM integration. Includes offline data generation, single- and multi-layer draft model training, and support for both MoE and non-MoE verifiers.

SpecBundle & SpecForge v0.2 (December 2025): Collaboration between LMSYS, Ant, Meituan, Nex-AGI, and EigenAI releasing production-grade EAGLE-3 checkpoints trained on large-scale datasets. The Llama 4 Maverick draft model achieves 2.18× speedup on MT-Bench.

Fuzzy Speculative Decoding (February 2025): Relaxes the strict acceptance criteria to allow "close enough" matches, trading minimal quality degradation for significant speedup gains in specific domains.

DuoDecoding (March 2025): Hardware-aware heterogeneous speculative decoding that optimizes draft and target placement across different accelerator types (GPU + NPU, multi-GPU configurations).

RASD - Retrieval-Augmented Speculative Decoding (March 2025): Uses retrieved examples to improve draft quality for domain-specific applications, achieving higher acceptance rates when relevant context is available.

Falcon (Gao et al., 2025): Faster and parallel inference through enhanced semi-autoregressive drafting and custom-designed decoding trees.

Utility-Driven Speculative Decoding for MoE (Saxena et al., 2025): Optimizes speculation strategies specifically for Mixture-of-Experts models, accounting for expert routing overhead.

Production Framework Status (December 2025)

vLLM 0.9.1:

Native EAGLE-1 and EAGLE-3 support (since v0.8.5)
CUDA graphs for Eagle 1+3 reducing kernel launch overhead
Speculative decoding metrics: draft acceptance rate, per-position acceptance, mean acceptance length
Up to 2.5× inference speedup across diverse scenarios

SGLang with SpecForge:

Tight integration for training-to-deployment pipeline
Production benchmarks on H100: 1.81× throughput at batch size 2, 1.38× at batch size 64
Draft head overhead: ~0.25B parameters for 8B model, ~1B for 70B model

Future Directions

Active research areas include:

Adaptive speculation: Dynamically adjusting strategy based on input characteristics Hardware-aware speculation: Designing drafts for specific GPU architectures Speculation for fine-tuning: Using speculation during training, not just inference Cross-model speculation: Using one model family's draft for another family's target

Table of Contents

The Autoregressive Bottleneck

The Core Insight

Why Verification is Cheap

Speculative Sampling: The Theoretical Foundation

Temperature and Top-p Considerations

Draft Model Approaches

Independent Draft Models

Self-Speculative Decoding

Knowledge Distillation Drafts

EAGLE: State-of-the-Art Speculative Decoding

Core Innovation

EAGLE-2: Dynamic Draft Trees

EAGLE-3: Production Optimization (NeurIPS 2025)

Medusa: Parallel Draft Heads

Architecture

Tree-Based Verification

Comparison with EAGLE

Lookahead Decoding

Jacobi Iteration Perspective

N-gram Cache

Comparison

Comprehensive Method Comparison

Feature Comparison Matrix

Cost-Benefit Analysis

Latency Breakdown

Specialized Techniques

Speculative Decoding for Mixture of Experts

Speculative Decoding for Long Contexts

CTC-Based Drafting

Multi-Model Speculation

Retrieval-Augmented Speculative Decoding (RASD)

Real-World Performance Analysis

Case Study: Code Generation

Case Study: Conversational AI

Case Study: Document Summarization

Production Deployment

Framework Support

System Architecture

Batching Considerations

Monitoring and Tuning

Failure Handling

Benchmarking Speculative Decoding

Tasks

Metrics

Results (EAGLE-3 vs. Baselines)

Advanced Topics

Speculative Decoding Theory

Speculative Decoding for Long Contexts

Training Better Draft Models

2025 Research Developments

Production Framework Status (December 2025)

Future Directions

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

LLM Inference Optimization: From Quantization to Speculative Decoding

vLLM in Production: The Complete Guide to High-Performance LLM Serving

Text Generation & Decoding Strategies: A Complete Guide

Transformer Architecture: A Complete Deep Dive

LLM Cost Engineering: Token Budgeting, Caching, and Model Routing for Production