Skip to main content
Back to Blog

Speculative Decoding: Accelerating LLM Inference Without Sacrificing Quality

A comprehensive guide to speculative decoding techniques that accelerate LLM inference by 2-4× while maintaining exact output quality, covering draft models, EAGLE, Medusa, and production deployment strategies.

9 min read
Share:

Large language models generate text one token at a time, with each token requiring a full forward pass through billions of parameters. This autoregressive bottleneck means that generating a 500-token response requires 500 sequential forward passes, regardless of available compute. Speculative decoding breaks this limitation by predicting multiple tokens ahead and verifying them in parallel, achieving 2-4× speedups while mathematically guaranteeing identical outputs to standard decoding.

The Autoregressive Bottleneck

Understanding why speculative decoding works requires first understanding why standard LLM inference is slow. When a transformer generates text, it produces one token at a time through a sequential process:

  1. The model processes all previous tokens (prompt + generated so far)
  2. The final layer outputs a probability distribution over the vocabulary
  3. A token is sampled from this distribution
  4. The sampled token is appended, and the process repeats

Each step requires a complete forward pass through the model. For a 70B parameter model on an A100 GPU, a single forward pass takes approximately 30-50ms. Generating 100 tokens therefore takes 3-5 seconds—not because of computational limits, but because the sequential nature prevents parallelization.

The bottleneck is memory bandwidth, not compute. Modern GPUs can perform far more operations per second than they can load model weights from memory. A 70B model requires loading 140GB of weights (in FP16) for each forward pass. Even the A100's 2TB/s memory bandwidth limits throughput to roughly 14 forward passes per second for this weight movement alone.

This creates a paradox: GPUs sit idle waiting for memory transfers while the model generates tokens one by one. The theoretical throughput based on compute (FLOPS) is 10-100× higher than actual achieved throughput. Speculative decoding exploits this gap by doing useful work during what would otherwise be idle time.

The Core Insight

Speculative decoding is based on a simple observation: verifying whether a sequence of tokens is correct is much cheaper than generating them. If we can cheaply predict multiple future tokens, we can verify them all at once with the expensive model, accepting correct predictions and rejecting incorrect ones.

The process works as follows:

  1. Draft phase: A fast mechanism proposes KK candidate tokens
  2. Verification phase: The target model processes all KK candidates in parallel
  3. Acceptance phase: Correct predictions are accepted; the first incorrect prediction triggers regeneration from that point

The key mathematical property is that this process is lossless—the output distribution is identical to standard autoregressive decoding. This isn't an approximation or quality tradeoff; speculative decoding produces exactly the same outputs, just faster.

Why Verification is Cheap

When the target model verifies KK proposed tokens, it processes them as a single batch rather than KK sequential passes. The KV cache from verified tokens is computed once and reused, and the model's attention can process all positions in parallel.

For a sequence of length NN with KK proposed tokens, verification requires attention computation over N+KN+K positions. In contrast, sequential generation would require KK separate forward passes, each with growing attention complexity. The parallel verification amortizes the fixed costs (weight loading, kernel launches) across multiple tokens.

The speedup potential is bounded by the draft mechanism's accuracy. If the draft correctly predicts all KK tokens, we get KK× speedup. If predictions are wrong, we waste some verification compute but still make progress. The expected speedup depends on the acceptance rate α\alpha:

Expected tokens per step=1αK+11α\text{Expected tokens per step} = \frac{1 - \alpha^{K+1}}{1 - \alpha}

For α=0.7\alpha = 0.7 and K=4K = 4, this gives approximately 2.8 tokens per verification step—a 2.8× speedup if draft and verification costs are balanced.

Speculative Sampling: The Theoretical Foundation

The acceptance/rejection mechanism must be carefully designed to preserve the target model's distribution. Naive acceptance (accept if draft matches greedy decoding) would bias outputs toward the draft model's preferences. Instead, speculative decoding uses a rejection sampling scheme.

Let q(x)q(x) be the draft model's probability for token xx and p(x)p(x) be the target model's probability. The acceptance probability for a drafted token is:

accept(x)=min(1,p(x)q(x))\text{accept}(x) = \min\left(1, \frac{p(x)}{q(x)}\right)

If the token is rejected, we sample from a residual distribution:

presidual(x)=normalize(max(0,p(x)q(x)))p_{\text{residual}}(x) = \text{normalize}\left(\max(0, p(x) - q(x))\right)

This scheme ensures the final token distribution equals the target model's distribution exactly. When q(x)p(x)q(x) \geq p(x) (draft overestimates), we sometimes reject even when draft and target agree, compensating for oversampling. When q(x)<p(x)q(x) < p(x) (draft underestimates), we always accept but might resample from the residual.

The beauty of this scheme is that it requires only the probability values, not any architectural constraints on the draft model. Any mechanism that produces probability estimates can serve as a draft, as long as we can compute both p(x)p(x) and q(x)q(x) for verification.

Temperature and Top-p Considerations

Real deployments use sampling with temperature and nucleus (top-p) filtering. Speculative decoding extends naturally to these settings:

Temperature: Both draft and target probabilities are temperature-scaled before computing acceptance ratios. The scheme remains valid because temperature is applied consistently.

Top-p/Top-k: Filtering is applied to the target distribution before computing acceptance. If the drafted token falls outside the target's nucleus, it's automatically rejected.

Beam search: Speculative decoding can be combined with beam search by running multiple draft sequences in parallel and verifying all beams together.

The key constraint is that draft and target must use identical sampling parameters. Mismatched temperatures would break the theoretical guarantees.

Draft Model Approaches

The draft mechanism is the critical component determining speedup. An ideal draft is fast, accurate, and produces well-calibrated probabilities. Several approaches have been developed:

Independent Draft Models

The simplest approach uses a separate, smaller model from the same family. For example, using Llama-7B to draft for Llama-70B:

Advantages:

  • Simple implementation
  • Draft model can be optimized independently
  • Works with any target model

Disadvantages:

  • Requires loading two separate models
  • Draft accuracy limited by capacity gap
  • No sharing of target model's representations

Empirically, same-family draft models achieve 60-70% acceptance rates, enabling 1.5-2× speedups. The draft model should be 10-20× smaller than the target for the compute tradeoff to be favorable.

Self-Speculative Decoding

Rather than a separate model, self-speculative methods use the target model itself with early exit or layer skipping:

Early Exit: Run only the first N layers of the target model for drafting. A prediction head on layer N outputs draft probabilities. This achieves 70-80% acceptance rates with minimal overhead.

Layer Skipping: Skip every other layer during drafting, effectively halving the compute per draft token. The missing layers' contributions are approximated through residual connections.

Advantage: No additional model weights; draft naturally aligns with target Disadvantage: Requires architecture modifications; draft speed limited by target's structure

Knowledge Distillation Drafts

Train a small draft model specifically to mimic the target:

  1. Generate data from the target model
  2. Train the draft to match target outputs (not just ground truth)
  3. The draft learns the target's distribution, not just the training distribution

Distilled drafts achieve higher acceptance rates (75-85%) because they're explicitly trained to predict what the target would produce. The cost is additional training and the need to retrain when the target changes.

EAGLE: State-of-the-Art Speculative Decoding

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) represents the current state of the art for speculative decoding. Developed through three iterations (EAGLE-1, EAGLE-2, EAGLE-3), it achieves 2-3× speedups across diverse tasks while maintaining output equivalence.

Core Innovation

EAGLE's key insight is that the target model's internal representations contain rich information for predicting future tokens. Rather than training a separate draft model, EAGLE attaches a lightweight prediction head to the target model's hidden states.

The architecture works as follows:

  1. After each target model forward pass, extract the second-to-last layer's hidden states
  2. Feed these states through a small autoregressive head (typically 1-2 transformer layers)
  3. The head predicts the next token's hidden state, which is decoded to a token
  4. Repeat to generate multiple draft tokens

This approach leverages the target model's own feature representations, achieving much higher acceptance rates than independent draft models. EAGLE-1 reported 80%+ acceptance rates on many benchmarks.

EAGLE-2: Dynamic Draft Trees

EAGLE-2 extends the original with dynamic draft tree construction. Rather than generating a single sequence of draft tokens, EAGLE-2 generates a tree of possibilities:

Code
            [root]
           /      \
        [a]        [b]
       / | \        |
     [c][d][e]     [f]

Each path through the tree represents a possible continuation. The target model verifies all paths in a single batched forward pass using a carefully constructed attention mask.

The tree structure is built dynamically based on the draft model's confidence:

  • High confidence → single path (save compute)
  • Low confidence → branch to explore alternatives
  • Medium confidence → limited branching

This adaptive strategy achieves better speedups than fixed-width speculation by focusing compute where uncertainty is highest. EAGLE-2 achieves approximately 2× the speed of Medusa and 2.3× the speed of Lookahead decoding.

EAGLE-3: Production Optimization (NeurIPS 2025)

EAGLE-3, presented at NeurIPS 2025, focuses on production deployment with several key innovations:

Multi-Level Feature Fusion: EAGLE-3 removes the feature prediction constraint from earlier versions. Instead of relying solely on top-layer features, it replaces them with a fusion of low-, mid-, and high-level semantic features. This provides richer context for draft prediction:

hdraft=Fuse(h(l1),h(l2),h(l3))h_{\text{draft}} = \text{Fuse}(h^{(l_1)}, h^{(l_2)}, h^{(l_3)})

Where l1,l2,l3l_1, l_2, l_3 represent early, middle, and late layers respectively.

Training-Time Testing: EAGLE-3 simulates the speculative decoding process during training, allowing the draft head to learn from realistic inference-time conditions rather than just teacher forcing.

Performance Improvements: EAGLE-3 achieves 3.0×-6.5× speedup compared to vanilla autoregressive generation, representing a 20-40% improvement over EAGLE-2.

Disaggregated Serving: EAGLE-3 supports separating the draft and target models across different GPUs or even different machines. This enables:

  • Dedicated draft GPU running continuously
  • Target GPU handling verification batches
  • Pipelined execution hiding draft latency

Framework Integration: Official support for major inference frameworks:

  • vLLM 0.8.5+: Native EAGLE-1 and EAGLE-3 support with CUDA graphs (v0.9.1+)
  • TensorRT-LLM: Optimized kernels and disaggregated serving for Llama 4 Maverick
  • SGLang: SpecForge for accelerated EAGLE-3 training
  • Speculative decoding metrics for production monitoring

Mixed Precision: Draft head runs in FP8 or INT8 while target maintains FP16/BF16, reducing draft overhead further.

According to Spec-Bench (a comprehensive benchmark for speculative decoding), EAGLE-3 currently achieves the best speedups across different model sizes and tasks.

Medusa: Parallel Draft Heads

Medusa takes a different approach: instead of autoregressive draft generation, it predicts multiple future positions simultaneously using parallel heads.

Architecture

Medusa adds KK prediction heads to the target model, each predicting a different future position:

  • Head 1: Predicts token at position t+1t+1
  • Head 2: Predicts token at position t+2t+2
  • ...
  • Head K: Predicts token at position t+Kt+K

Each head is a small MLP that takes the last hidden state and outputs vocabulary logits. The heads are trained jointly with the (frozen) target model to predict future tokens given current context.

Tree-Based Verification

Since each head makes independent predictions, we can consider multiple candidates per position. With top-3 predictions from each of 4 heads, we have 34=813^4 = 81 possible sequences. Rather than verifying all 81, Medusa constructs a tree:

Code
Position 1: [A, B, C]
Position 2: [D, E] given A, [F] given B, [G, H] given C
Position 3: ...

The tree is constructed to maximize expected acceptance while minimizing verification cost. Medusa uses a sparse attention mask to verify all tree paths in a single forward pass.

Comparison with EAGLE

AspectEAGLEMedusa
Draft GenerationAutoregressiveParallel
ArchitectureFeature extraction + AR headMultiple independent heads
TrainingOn target model featuresOn target model predictions
Acceptance Rate~80%~60%
Draft OverheadHigherLower
Speedup2-3×1.5-2×
LosslessYesConfigurable

EAGLE achieves higher acceptance rates because its autoregressive draft can condition on previously drafted tokens. Medusa's parallel heads are faster but each prediction is independent, leading to lower accuracy.

Medusa offers a speed-quality tradeoff: "Medusa-1" maintains exact output equivalence using rejection sampling, while "Medusa-2" relaxes this for higher speed by accepting likely-but-not-identical tokens. The relaxed mode is suitable when exact reproducibility isn't required.

Lookahead Decoding

Lookahead decoding, developed by researchers at UC Berkeley, takes yet another approach: using n-gram patterns from the prompt to predict future tokens.

Jacobi Iteration Perspective

Lookahead frames autoregressive generation as solving a fixed-point equation:

xt+1:t+K=f(x1:t,xt+1:t+K1)x_{t+1:t+K} = f(x_{1:t}, x_{t+1:t+K-1})

This can be solved iteratively using Jacobi iteration, where each step refines the guess for future tokens based on the current guess. The iteration converges when the model's predictions match the current hypothesis.

N-gram Cache

The practical implementation builds a cache of n-gram patterns observed in the input. When generating, it looks up n-grams matching recent context and uses their continuations as draft candidates.

For example, if the prompt contains "the quick brown fox" and we've generated "the quick", we hypothesize "brown fox" as the continuation based on the cached pattern.

This approach works surprisingly well for:

  • Code (repetitive patterns)
  • Structured text (JSON, XML)
  • Text with recurring phrases

It works less well for:

  • Creative writing
  • Conversations with novel content
  • Highly varied text

Comparison

Lookahead is training-free—it doesn't require fine-tuning any components. This makes it easy to apply to any model. However, its speedups are more variable, depending heavily on input characteristics. EAGLE and Medusa provide more consistent improvements across diverse inputs.

Comprehensive Method Comparison

Understanding the tradeoffs between different speculative decoding approaches helps choose the right method for your use case.

Feature Comparison Matrix

FeatureIndependent DraftSelf-SpeculativeEAGLE-3MedusaLookahead
Training requiredOptional distillationArchitecture modDraft head trainingHead trainingNone
Additional parametersFull draft modelNone~1B~0.5BNone
Acceptance rate60-70%70-80%80-90%55-65%Variable
Typical speedup1.5-2×1.8-2.2×3-6.5×1.5-2×1.3-2×
Memory overhead10-50%<5%5-10%5-10%<1%
Best forAny targetSingle modelMaximum speedSimplicityCode/structured

Cost-Benefit Analysis

Understanding the economics of speculative decoding:

Computation cost per generated token:

Let CdC_d = draft model forward pass cost, CtC_t = target model forward pass cost, KK = speculation depth, α\alpha = acceptance rate.

Standard decoding cost per token: CtC_t

Speculative decoding cost per token: Cspec=KCd+CtE[accepted tokens]=KCd+Ct1αK+11αC_{spec} = \frac{K \cdot C_d + C_t}{E[\text{accepted tokens}]} = \frac{K \cdot C_d + C_t}{\frac{1 - \alpha^{K+1}}{1 - \alpha}}

Speculative decoding is beneficial when Cspec<CtC_{spec} < C_t, which simplifies to:

CdCt<1αK+1K(1α)1K\frac{C_d}{C_t} < \frac{1 - \alpha^{K+1}}{K(1-\alpha)} - \frac{1}{K}

For α=0.8\alpha = 0.8 and K=5K = 5: draft can be up to 60% the cost of target and still provide benefit.

Latency Breakdown

Understanding where time is spent helps optimization:

Code
Standard Autoregressive Generation (100 tokens):
├── Weight loading: 65% (memory bound)
├── Attention computation: 20%
├── FFN computation: 10%
└── Other (sampling, etc.): 5%
Total: ~3 seconds for 70B model

Speculative Decoding (100 tokens, 80% acceptance):
├── Draft generation (20 iterations × 5 tokens): 15%
├── Verification forward passes: 50%
├── Acceptance/rejection logic: 2%
├── Weight loading (amortized): 30%
└── Other: 3%
Total: ~1.2 seconds for 70B model (2.5× speedup)

The key insight: speculative decoding amortizes the weight loading cost (the dominant factor) across multiple tokens.

Specialized Techniques

Beyond the main approaches, several specialized techniques address specific scenarios:

Speculative Decoding for Mixture of Experts

MoE models like Mixtral activate only a subset of experts per token. Speculative decoding interacts interestingly with this:

Challenge: Draft and target might route to different experts, causing divergence

Solution: Route-aware drafting that conditions on expected expert selection

Utility-Driven Approach (Saxena et al., 2025):

  • Predicts which experts will be activated based on draft tokens
  • Adjusts speculation depth based on expected routing overlap
  • Achieves 15-20% higher acceptance rates than naive speculation on MoE models

Speculative Decoding for Long Contexts

Long contexts present unique challenges for speculative decoding:

KV Cache Management:

  • Both draft and target models need KV caches
  • For 128K context: draft cache (if transformer-based) can be significant
  • Solution: Use SSM-based draft models (constant state size)

Attention Cost:

  • Verification attention cost grows with context length
  • For very long contexts, verification cost can dominate
  • Solution: Use chunked verification or sliding window attention in draft

Draft Quality Degradation:

  • Draft accuracy may decrease with longer contexts
  • Earlier tokens harder to predict based on compressed state
  • Solution: Dynamic speculation depth based on context length

CTC-Based Drafting

Connectionist Temporal Classification (CTC), typically used for speech recognition, has been adapted for speculative decoding. A CTC model predicts multiple tokens in parallel, naturally producing variable-length outputs:

Mechanism:

  • CTC outputs a sequence with possible blanks
  • Collapse repeated tokens and remove blanks to get draft
  • Variable-length output naturally handles different generation speeds

Advantage: Very fast draft generation (single forward pass for multiple tokens) Disadvantage: CTC's conditional independence assumption limits accuracy

This approach is particularly effective for tasks with predictable structure, like code completion or form filling.

Multi-Model Speculation

Recent work explores using multiple draft models simultaneously:

Ensemble Drafting:

  1. Run several small models in parallel
  2. Combine their predictions (voting, ensemble)
  3. Verify the consensus predictions

Cascaded Drafting:

  1. Very fast model drafts first
  2. Medium model refines high-uncertainty positions
  3. Target model verifies final draft

Advantages:

  • Improved draft accuracy
  • Robustness to individual model failures
  • Can combine specialized drafts (e.g., code expert + language expert)

Disadvantages:

  • Increased complexity
  • Higher memory requirements
  • Coordination overhead

Retrieval-Augmented Speculative Decoding (RASD)

RASD (March 2025) uses retrieved examples to improve draft quality:

Mechanism:

  1. Retrieve similar examples from a database based on current context
  2. Use retrieved continuations to inform draft
  3. Combine retrieval-based and model-based drafts

Benefits:

  • Higher acceptance rates for domain-specific applications
  • Can leverage existing retrieval infrastructure
  • Particularly effective for repetitive domains (customer support, documentation)

Considerations:

  • Requires maintaining and updating retrieval database
  • Additional latency from retrieval
  • Privacy implications of retrieved content

Real-World Performance Analysis

Case Study: Code Generation

Code generation is often cited as the best use case for speculative decoding due to predictable patterns:

GitHub Copilot-style completion:

  • High repetition (variable names, function patterns)
  • Structured syntax
  • EAGLE-3 achieves 4-6× speedup on code tasks

Benchmark results (HumanEval, MBPP):

Model SetupLatency (ms/token)SpeedupAcceptance Rate
Llama-70B baseline451.0×
+ Llama-7B draft281.6×68%
+ EAGLE-3123.8×84%
+ Medusa222.0×61%

Case Study: Conversational AI

Conversational tasks are more challenging due to unpredictability:

Chat/Assistant workloads:

  • More diverse vocabulary
  • Context-dependent responses
  • Lower acceptance rates

MT-Bench results:

Model SetupLatency (ms/token)SpeedupAcceptance Rate
Vicuna-33B baseline381.0×
+ Vicuna-7B draft261.5×62%
+ EAGLE-3162.4×76%

Case Study: Document Summarization

Summarization involves processing long inputs and generating condensed outputs:

Characteristics:

  • Long context (10K+ tokens)
  • Moderate acceptance rates
  • Benefits from hybrid approaches

CNN/DailyMail results:

Model SetupTime per summarySpeedup
GPT-3.5 baseline8.2s1.0×
+ Speculative4.1s2.0×
Llama-70B baseline12.5s1.0×
+ EAGLE-34.8s2.6×

Production Deployment

Deploying speculative decoding in production requires careful consideration of system architecture, resource allocation, and failure handling.

Framework Support

Major inference frameworks now support speculative decoding:

vLLM offers:

  • Draft model speculation with configurable speculation depth
  • Medusa and EAGLE integration
  • PagedAttention for efficient KV cache management during speculation
  • Automatic speculation depth tuning

TensorRT-LLM provides:

  • Optimized CUDA kernels for draft model inference
  • Fused verification operations
  • Support for disaggregated draft/target serving
  • INT8/FP8 draft models with FP16 targets

SGLang includes:

  • EAGLE-3 native support
  • Speculative decoding with RadixAttention
  • SpecForge for accelerated speculation training

System Architecture

Production deployments typically use one of two architectures:

Co-located: Draft and target models on the same GPU

  • Simpler deployment
  • Memory contention between models
  • Works well when draft is small (<10% of target)

Disaggregated: Draft and target on separate GPUs/machines

  • Better resource utilization
  • Additional network latency
  • Preferred for large-scale deployments

The disaggregated approach enables sophisticated pipelining:

Code
Time →

Draft GPU:   [Draft 1][Draft 2][Draft 3][Draft 4]...
              ↓         ↓         ↓         ↓
Target GPU:   [Verify 1][Verify 2][Verify 3]...

The target GPU is kept fully utilized, with draft tokens always ready when needed.

Batching Considerations

Speculative decoding complicates batching because different requests may accept different numbers of tokens:

  • Request A: Accepts 5/5 speculated tokens
  • Request B: Accepts 2/5 speculated tokens

Naive batching would process both to the minimum (2), wasting Request A's accepted tokens. Solutions include:

Selective batching: Only batch requests with similar expected acceptance rates Padding: Accept variable-length outputs with padding Request routing: Send high-acceptance requests to speculative path, others to standard decoding

vLLM's implementation handles this transparently, but custom deployments should consider the impact on throughput.

Monitoring and Tuning

Key metrics for speculative decoding:

  1. Acceptance rate: Fraction of draft tokens accepted (target: >70%)
  2. Effective speedup: Wall-clock time vs. standard decoding (target: >2×)
  3. Speculation overhead: Time spent on rejected tokens
  4. Memory utilization: Draft model + target model + speculation buffers

Tuning recommendations:

  • Speculation depth (K): Start with 4-5, increase if acceptance rate is high
  • Draft model size: 10-20× smaller than target for optimal tradeoff
  • Tree width (for EAGLE-2): Wider trees for uncertain prompts
  • Batch size: Smaller batches benefit more from speculation

Failure Handling

Speculative decoding adds failure modes:

Draft model crashes: Fall back to standard decoding Verification timeout: Accept partially verified sequence Memory exhaustion: Reduce speculation depth dynamically Consistency errors: Log and investigate; should never happen with correct implementation

Production systems should monitor for these failures and have automatic fallback paths.

Benchmarking Speculative Decoding

Spec-Bench, introduced alongside EAGLE, provides standardized evaluation:

Tasks

  • MT-bench: Multi-turn conversation
  • HumanEval: Code generation
  • GSM8K: Mathematical reasoning
  • Alpaca: Instruction following
  • CNN/DailyMail: Summarization

Metrics

  • Wallclock speedup: End-to-end time vs. standard decoding
  • Acceptance rate: Fraction of draft tokens accepted
  • Token efficiency: Tokens generated per forward pass
  • Memory overhead: Additional memory vs. standard decoding

Results (EAGLE-3 vs. Baselines)

ModelStandardMedusaLookaheadEAGLE-3
Llama-2-7B1.0×1.8×1.5×2.4×
Llama-2-13B1.0×1.7×1.4×2.3×
Llama-2-70B1.0×1.6×1.3×2.1×
Vicuna-7B1.0×1.9×1.6×2.5×
Code Llama-7B1.0×2.1×1.8×2.8×

Code generation benefits most because code has predictable patterns that drafts model well. Conversational tasks benefit less due to higher unpredictability.

Advanced Topics

Speculative Decoding Theory

Recent theoretical work has characterized speculative decoding's optimality:

Token efficiency bound: For acceptance rate α\alpha and speculation depth KK, the expected tokens per iteration is:

E[tokens]=1αK+11αE[\text{tokens}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}

Optimal speculation depth: Given draft cost cdc_d and target cost ctc_t, the optimal KK satisfies:

K=argmaxKE[tokens]cdK+ctK^* = \arg\max_K \frac{E[\text{tokens}]}{c_d \cdot K + c_t}

For typical values (α=0.7\alpha = 0.7, cd/ct=0.1c_d/c_t = 0.1), optimal KK is 4-6.

Theoretical speedup limit: As α1\alpha \to 1, speedup approaches KK. The practical limit is determined by draft model accuracy.

Speculative Decoding for Long Contexts

Long-context generation (10K+ tokens) presents unique challenges:

  1. KV cache growth: Both draft and target caches grow with context
  2. Draft accuracy decay: Drafts may become less accurate as context grows
  3. Verification cost: Each verification processes the full context

Solutions:

  • Sliding window speculation: Only use recent context for drafting
  • Hierarchical drafts: Different draft strategies for different context ranges
  • Cached verification: Reuse KV cache across speculative iterations

Training Better Draft Models

Improving draft accuracy directly increases speedup. Approaches include:

Online distillation: Continuously update draft based on rejection patterns Rejection-aware training: Weight training loss by acceptance probability Multi-task drafts: Train draft on diverse prompts matching deployment distribution Reinforcement learning: Optimize draft for expected accepted tokens, not just accuracy

2025 Research Developments

Several new speculative decoding techniques have emerged in 2025:

Speculators v0.3.0 (December 2025): End-to-end training support for Eagle3 draft models with seamless vLLM integration. Includes offline data generation, single- and multi-layer draft model training, and support for both MoE and non-MoE verifiers.

SpecBundle & SpecForge v0.2 (December 2025): Collaboration between LMSYS, Ant, Meituan, Nex-AGI, and EigenAI releasing production-grade EAGLE-3 checkpoints trained on large-scale datasets. The Llama 4 Maverick draft model achieves 2.18× speedup on MT-Bench.

Fuzzy Speculative Decoding (February 2025): Relaxes the strict acceptance criteria to allow "close enough" matches, trading minimal quality degradation for significant speedup gains in specific domains.

DuoDecoding (March 2025): Hardware-aware heterogeneous speculative decoding that optimizes draft and target placement across different accelerator types (GPU + NPU, multi-GPU configurations).

RASD - Retrieval-Augmented Speculative Decoding (March 2025): Uses retrieved examples to improve draft quality for domain-specific applications, achieving higher acceptance rates when relevant context is available.

Falcon (Gao et al., 2025): Faster and parallel inference through enhanced semi-autoregressive drafting and custom-designed decoding trees.

Utility-Driven Speculative Decoding for MoE (Saxena et al., 2025): Optimizes speculation strategies specifically for Mixture-of-Experts models, accounting for expert routing overhead.

Production Framework Status (December 2025)

vLLM 0.9.1:

  • Native EAGLE-1 and EAGLE-3 support (since v0.8.5)
  • CUDA graphs for Eagle 1+3 reducing kernel launch overhead
  • Speculative decoding metrics: draft acceptance rate, per-position acceptance, mean acceptance length
  • Up to 2.5× inference speedup across diverse scenarios

SGLang with SpecForge:

  • Tight integration for training-to-deployment pipeline
  • Production benchmarks on H100: 1.81× throughput at batch size 2, 1.38× at batch size 64
  • Draft head overhead: ~0.25B parameters for 8B model, ~1B for 70B model

Future Directions

Active research areas include:

Adaptive speculation: Dynamically adjusting strategy based on input characteristics Hardware-aware speculation: Designing drafts for specific GPU architectures Speculation for fine-tuning: Using speculation during training, not just inference Cross-model speculation: Using one model family's draft for another family's target

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles

LLMsML Engineering

LLM Inference Optimization: From Quantization to Speculative Decoding

A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.

12 min read
LLMsML Engineering

vLLM in Production: The Complete Guide to High-Performance LLM Serving

A comprehensive guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.

9 min read
LLMsML Engineering

Text Generation & Decoding Strategies: A Complete Guide

A comprehensive guide to how LLMs actually generate text—from greedy decoding to beam search, temperature scaling, nucleus sampling, speculative decoding, and structured generation. Master the techniques that control LLM output quality, creativity, and speed.

16 min read
EducationLLMs

Transformer Architecture: A Complete Deep Dive

A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.

30 min read

LLM Cost Engineering: Token Budgeting, Caching, and Model Routing for Production

Comprehensive guide to reducing LLM costs by 60-80% in production. Covers prompt caching (OpenAI vs Anthropic), semantic caching with Redis and GPTCache, model routing and cascading, batch processing, and token optimization strategies.

19 min read