Speculative Decoding: Accelerating LLM Inference Without Sacrificing Quality
A comprehensive guide to speculative decoding techniques that accelerate LLM inference by 2-4× while maintaining exact output quality, covering draft models, EAGLE, Medusa, and production deployment strategies.
Table of Contents
Large language models generate text one token at a time, with each token requiring a full forward pass through billions of parameters. This autoregressive bottleneck means that generating a 500-token response requires 500 sequential forward passes, regardless of available compute. Speculative decoding breaks this limitation by predicting multiple tokens ahead and verifying them in parallel, achieving 2-4× speedups while mathematically guaranteeing identical outputs to standard decoding.
The Autoregressive Bottleneck
Understanding why speculative decoding works requires first understanding why standard LLM inference is slow. When a transformer generates text, it produces one token at a time through a sequential process:
- The model processes all previous tokens (prompt + generated so far)
- The final layer outputs a probability distribution over the vocabulary
- A token is sampled from this distribution
- The sampled token is appended, and the process repeats
Each step requires a complete forward pass through the model. For a 70B parameter model on an A100 GPU, a single forward pass takes approximately 30-50ms. Generating 100 tokens therefore takes 3-5 seconds—not because of computational limits, but because the sequential nature prevents parallelization.
The bottleneck is memory bandwidth, not compute. Modern GPUs can perform far more operations per second than they can load model weights from memory. A 70B model requires loading 140GB of weights (in FP16) for each forward pass. Even the A100's 2TB/s memory bandwidth limits throughput to roughly 14 forward passes per second for this weight movement alone.
This creates a paradox: GPUs sit idle waiting for memory transfers while the model generates tokens one by one. The theoretical throughput based on compute (FLOPS) is 10-100× higher than actual achieved throughput. Speculative decoding exploits this gap by doing useful work during what would otherwise be idle time.
The Core Insight
Speculative decoding is based on a simple observation: verifying whether a sequence of tokens is correct is much cheaper than generating them. If we can cheaply predict multiple future tokens, we can verify them all at once with the expensive model, accepting correct predictions and rejecting incorrect ones.
The process works as follows:
- Draft phase: A fast mechanism proposes candidate tokens
- Verification phase: The target model processes all candidates in parallel
- Acceptance phase: Correct predictions are accepted; the first incorrect prediction triggers regeneration from that point
The key mathematical property is that this process is lossless—the output distribution is identical to standard autoregressive decoding. This isn't an approximation or quality tradeoff; speculative decoding produces exactly the same outputs, just faster.
Why Verification is Cheap
When the target model verifies proposed tokens, it processes them as a single batch rather than sequential passes. The KV cache from verified tokens is computed once and reused, and the model's attention can process all positions in parallel.
For a sequence of length with proposed tokens, verification requires attention computation over positions. In contrast, sequential generation would require separate forward passes, each with growing attention complexity. The parallel verification amortizes the fixed costs (weight loading, kernel launches) across multiple tokens.
The speedup potential is bounded by the draft mechanism's accuracy. If the draft correctly predicts all tokens, we get × speedup. If predictions are wrong, we waste some verification compute but still make progress. The expected speedup depends on the acceptance rate :
For and , this gives approximately 2.8 tokens per verification step—a 2.8× speedup if draft and verification costs are balanced.
Speculative Sampling: The Theoretical Foundation
The acceptance/rejection mechanism must be carefully designed to preserve the target model's distribution. Naive acceptance (accept if draft matches greedy decoding) would bias outputs toward the draft model's preferences. Instead, speculative decoding uses a rejection sampling scheme.
Let be the draft model's probability for token and be the target model's probability. The acceptance probability for a drafted token is:
If the token is rejected, we sample from a residual distribution:
This scheme ensures the final token distribution equals the target model's distribution exactly. When (draft overestimates), we sometimes reject even when draft and target agree, compensating for oversampling. When (draft underestimates), we always accept but might resample from the residual.
The beauty of this scheme is that it requires only the probability values, not any architectural constraints on the draft model. Any mechanism that produces probability estimates can serve as a draft, as long as we can compute both and for verification.
Temperature and Top-p Considerations
Real deployments use sampling with temperature and nucleus (top-p) filtering. Speculative decoding extends naturally to these settings:
Temperature: Both draft and target probabilities are temperature-scaled before computing acceptance ratios. The scheme remains valid because temperature is applied consistently.
Top-p/Top-k: Filtering is applied to the target distribution before computing acceptance. If the drafted token falls outside the target's nucleus, it's automatically rejected.
Beam search: Speculative decoding can be combined with beam search by running multiple draft sequences in parallel and verifying all beams together.
The key constraint is that draft and target must use identical sampling parameters. Mismatched temperatures would break the theoretical guarantees.
Draft Model Approaches
The draft mechanism is the critical component determining speedup. An ideal draft is fast, accurate, and produces well-calibrated probabilities. Several approaches have been developed:
Independent Draft Models
The simplest approach uses a separate, smaller model from the same family. For example, using Llama-7B to draft for Llama-70B:
Advantages:
- Simple implementation
- Draft model can be optimized independently
- Works with any target model
Disadvantages:
- Requires loading two separate models
- Draft accuracy limited by capacity gap
- No sharing of target model's representations
Empirically, same-family draft models achieve 60-70% acceptance rates, enabling 1.5-2× speedups. The draft model should be 10-20× smaller than the target for the compute tradeoff to be favorable.
Self-Speculative Decoding
Rather than a separate model, self-speculative methods use the target model itself with early exit or layer skipping:
Early Exit: Run only the first N layers of the target model for drafting. A prediction head on layer N outputs draft probabilities. This achieves 70-80% acceptance rates with minimal overhead.
Layer Skipping: Skip every other layer during drafting, effectively halving the compute per draft token. The missing layers' contributions are approximated through residual connections.
Advantage: No additional model weights; draft naturally aligns with target Disadvantage: Requires architecture modifications; draft speed limited by target's structure
Knowledge Distillation Drafts
Train a small draft model specifically to mimic the target:
- Generate data from the target model
- Train the draft to match target outputs (not just ground truth)
- The draft learns the target's distribution, not just the training distribution
Distilled drafts achieve higher acceptance rates (75-85%) because they're explicitly trained to predict what the target would produce. The cost is additional training and the need to retrain when the target changes.
EAGLE: State-of-the-Art Speculative Decoding
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) represents the current state of the art for speculative decoding. Developed through three iterations (EAGLE-1, EAGLE-2, EAGLE-3), it achieves 2-3× speedups across diverse tasks while maintaining output equivalence.
Core Innovation
EAGLE's key insight is that the target model's internal representations contain rich information for predicting future tokens. Rather than training a separate draft model, EAGLE attaches a lightweight prediction head to the target model's hidden states.
The architecture works as follows:
- After each target model forward pass, extract the second-to-last layer's hidden states
- Feed these states through a small autoregressive head (typically 1-2 transformer layers)
- The head predicts the next token's hidden state, which is decoded to a token
- Repeat to generate multiple draft tokens
This approach leverages the target model's own feature representations, achieving much higher acceptance rates than independent draft models. EAGLE-1 reported 80%+ acceptance rates on many benchmarks.
EAGLE-2: Dynamic Draft Trees
EAGLE-2 extends the original with dynamic draft tree construction. Rather than generating a single sequence of draft tokens, EAGLE-2 generates a tree of possibilities:
[root]
/ \
[a] [b]
/ | \ |
[c][d][e] [f]
Each path through the tree represents a possible continuation. The target model verifies all paths in a single batched forward pass using a carefully constructed attention mask.
The tree structure is built dynamically based on the draft model's confidence:
- High confidence → single path (save compute)
- Low confidence → branch to explore alternatives
- Medium confidence → limited branching
This adaptive strategy achieves better speedups than fixed-width speculation by focusing compute where uncertainty is highest. EAGLE-2 achieves approximately 2× the speed of Medusa and 2.3× the speed of Lookahead decoding.
EAGLE-3: Production Optimization (NeurIPS 2025)
EAGLE-3, presented at NeurIPS 2025, focuses on production deployment with several key innovations:
Multi-Level Feature Fusion: EAGLE-3 removes the feature prediction constraint from earlier versions. Instead of relying solely on top-layer features, it replaces them with a fusion of low-, mid-, and high-level semantic features. This provides richer context for draft prediction:
Where represent early, middle, and late layers respectively.
Training-Time Testing: EAGLE-3 simulates the speculative decoding process during training, allowing the draft head to learn from realistic inference-time conditions rather than just teacher forcing.
Performance Improvements: EAGLE-3 achieves 3.0×-6.5× speedup compared to vanilla autoregressive generation, representing a 20-40% improvement over EAGLE-2.
Disaggregated Serving: EAGLE-3 supports separating the draft and target models across different GPUs or even different machines. This enables:
- Dedicated draft GPU running continuously
- Target GPU handling verification batches
- Pipelined execution hiding draft latency
Framework Integration: Official support for major inference frameworks:
- vLLM 0.8.5+: Native EAGLE-1 and EAGLE-3 support with CUDA graphs (v0.9.1+)
- TensorRT-LLM: Optimized kernels and disaggregated serving for Llama 4 Maverick
- SGLang: SpecForge for accelerated EAGLE-3 training
- Speculative decoding metrics for production monitoring
Mixed Precision: Draft head runs in FP8 or INT8 while target maintains FP16/BF16, reducing draft overhead further.
According to Spec-Bench (a comprehensive benchmark for speculative decoding), EAGLE-3 currently achieves the best speedups across different model sizes and tasks.
Medusa: Parallel Draft Heads
Medusa takes a different approach: instead of autoregressive draft generation, it predicts multiple future positions simultaneously using parallel heads.
Architecture
Medusa adds prediction heads to the target model, each predicting a different future position:
- Head 1: Predicts token at position
- Head 2: Predicts token at position
- ...
- Head K: Predicts token at position
Each head is a small MLP that takes the last hidden state and outputs vocabulary logits. The heads are trained jointly with the (frozen) target model to predict future tokens given current context.
Tree-Based Verification
Since each head makes independent predictions, we can consider multiple candidates per position. With top-3 predictions from each of 4 heads, we have possible sequences. Rather than verifying all 81, Medusa constructs a tree:
Position 1: [A, B, C]
Position 2: [D, E] given A, [F] given B, [G, H] given C
Position 3: ...
The tree is constructed to maximize expected acceptance while minimizing verification cost. Medusa uses a sparse attention mask to verify all tree paths in a single forward pass.
Comparison with EAGLE
| Aspect | EAGLE | Medusa |
|---|---|---|
| Draft Generation | Autoregressive | Parallel |
| Architecture | Feature extraction + AR head | Multiple independent heads |
| Training | On target model features | On target model predictions |
| Acceptance Rate | ~80% | ~60% |
| Draft Overhead | Higher | Lower |
| Speedup | 2-3× | 1.5-2× |
| Lossless | Yes | Configurable |
EAGLE achieves higher acceptance rates because its autoregressive draft can condition on previously drafted tokens. Medusa's parallel heads are faster but each prediction is independent, leading to lower accuracy.
Medusa offers a speed-quality tradeoff: "Medusa-1" maintains exact output equivalence using rejection sampling, while "Medusa-2" relaxes this for higher speed by accepting likely-but-not-identical tokens. The relaxed mode is suitable when exact reproducibility isn't required.
Lookahead Decoding
Lookahead decoding, developed by researchers at UC Berkeley, takes yet another approach: using n-gram patterns from the prompt to predict future tokens.
Jacobi Iteration Perspective
Lookahead frames autoregressive generation as solving a fixed-point equation:
This can be solved iteratively using Jacobi iteration, where each step refines the guess for future tokens based on the current guess. The iteration converges when the model's predictions match the current hypothesis.
N-gram Cache
The practical implementation builds a cache of n-gram patterns observed in the input. When generating, it looks up n-grams matching recent context and uses their continuations as draft candidates.
For example, if the prompt contains "the quick brown fox" and we've generated "the quick", we hypothesize "brown fox" as the continuation based on the cached pattern.
This approach works surprisingly well for:
- Code (repetitive patterns)
- Structured text (JSON, XML)
- Text with recurring phrases
It works less well for:
- Creative writing
- Conversations with novel content
- Highly varied text
Comparison
Lookahead is training-free—it doesn't require fine-tuning any components. This makes it easy to apply to any model. However, its speedups are more variable, depending heavily on input characteristics. EAGLE and Medusa provide more consistent improvements across diverse inputs.
Comprehensive Method Comparison
Understanding the tradeoffs between different speculative decoding approaches helps choose the right method for your use case.
Feature Comparison Matrix
| Feature | Independent Draft | Self-Speculative | EAGLE-3 | Medusa | Lookahead |
|---|---|---|---|---|---|
| Training required | Optional distillation | Architecture mod | Draft head training | Head training | None |
| Additional parameters | Full draft model | None | ~1B | ~0.5B | None |
| Acceptance rate | 60-70% | 70-80% | 80-90% | 55-65% | Variable |
| Typical speedup | 1.5-2× | 1.8-2.2× | 3-6.5× | 1.5-2× | 1.3-2× |
| Memory overhead | 10-50% | <5% | 5-10% | 5-10% | <1% |
| Best for | Any target | Single model | Maximum speed | Simplicity | Code/structured |
Cost-Benefit Analysis
Understanding the economics of speculative decoding:
Computation cost per generated token:
Let = draft model forward pass cost, = target model forward pass cost, = speculation depth, = acceptance rate.
Standard decoding cost per token:
Speculative decoding cost per token:
Speculative decoding is beneficial when , which simplifies to:
For and : draft can be up to 60% the cost of target and still provide benefit.
Latency Breakdown
Understanding where time is spent helps optimization:
Standard Autoregressive Generation (100 tokens):
├── Weight loading: 65% (memory bound)
├── Attention computation: 20%
├── FFN computation: 10%
└── Other (sampling, etc.): 5%
Total: ~3 seconds for 70B model
Speculative Decoding (100 tokens, 80% acceptance):
├── Draft generation (20 iterations × 5 tokens): 15%
├── Verification forward passes: 50%
├── Acceptance/rejection logic: 2%
├── Weight loading (amortized): 30%
└── Other: 3%
Total: ~1.2 seconds for 70B model (2.5× speedup)
The key insight: speculative decoding amortizes the weight loading cost (the dominant factor) across multiple tokens.
Specialized Techniques
Beyond the main approaches, several specialized techniques address specific scenarios:
Speculative Decoding for Mixture of Experts
MoE models like Mixtral activate only a subset of experts per token. Speculative decoding interacts interestingly with this:
Challenge: Draft and target might route to different experts, causing divergence
Solution: Route-aware drafting that conditions on expected expert selection
Utility-Driven Approach (Saxena et al., 2025):
- Predicts which experts will be activated based on draft tokens
- Adjusts speculation depth based on expected routing overlap
- Achieves 15-20% higher acceptance rates than naive speculation on MoE models
Speculative Decoding for Long Contexts
Long contexts present unique challenges for speculative decoding:
KV Cache Management:
- Both draft and target models need KV caches
- For 128K context: draft cache (if transformer-based) can be significant
- Solution: Use SSM-based draft models (constant state size)
Attention Cost:
- Verification attention cost grows with context length
- For very long contexts, verification cost can dominate
- Solution: Use chunked verification or sliding window attention in draft
Draft Quality Degradation:
- Draft accuracy may decrease with longer contexts
- Earlier tokens harder to predict based on compressed state
- Solution: Dynamic speculation depth based on context length
CTC-Based Drafting
Connectionist Temporal Classification (CTC), typically used for speech recognition, has been adapted for speculative decoding. A CTC model predicts multiple tokens in parallel, naturally producing variable-length outputs:
Mechanism:
- CTC outputs a sequence with possible blanks
- Collapse repeated tokens and remove blanks to get draft
- Variable-length output naturally handles different generation speeds
Advantage: Very fast draft generation (single forward pass for multiple tokens) Disadvantage: CTC's conditional independence assumption limits accuracy
This approach is particularly effective for tasks with predictable structure, like code completion or form filling.
Multi-Model Speculation
Recent work explores using multiple draft models simultaneously:
Ensemble Drafting:
- Run several small models in parallel
- Combine their predictions (voting, ensemble)
- Verify the consensus predictions
Cascaded Drafting:
- Very fast model drafts first
- Medium model refines high-uncertainty positions
- Target model verifies final draft
Advantages:
- Improved draft accuracy
- Robustness to individual model failures
- Can combine specialized drafts (e.g., code expert + language expert)
Disadvantages:
- Increased complexity
- Higher memory requirements
- Coordination overhead
Retrieval-Augmented Speculative Decoding (RASD)
RASD (March 2025) uses retrieved examples to improve draft quality:
Mechanism:
- Retrieve similar examples from a database based on current context
- Use retrieved continuations to inform draft
- Combine retrieval-based and model-based drafts
Benefits:
- Higher acceptance rates for domain-specific applications
- Can leverage existing retrieval infrastructure
- Particularly effective for repetitive domains (customer support, documentation)
Considerations:
- Requires maintaining and updating retrieval database
- Additional latency from retrieval
- Privacy implications of retrieved content
Real-World Performance Analysis
Case Study: Code Generation
Code generation is often cited as the best use case for speculative decoding due to predictable patterns:
GitHub Copilot-style completion:
- High repetition (variable names, function patterns)
- Structured syntax
- EAGLE-3 achieves 4-6× speedup on code tasks
Benchmark results (HumanEval, MBPP):
| Model Setup | Latency (ms/token) | Speedup | Acceptance Rate |
|---|---|---|---|
| Llama-70B baseline | 45 | 1.0× | — |
| + Llama-7B draft | 28 | 1.6× | 68% |
| + EAGLE-3 | 12 | 3.8× | 84% |
| + Medusa | 22 | 2.0× | 61% |
Case Study: Conversational AI
Conversational tasks are more challenging due to unpredictability:
Chat/Assistant workloads:
- More diverse vocabulary
- Context-dependent responses
- Lower acceptance rates
MT-Bench results:
| Model Setup | Latency (ms/token) | Speedup | Acceptance Rate |
|---|---|---|---|
| Vicuna-33B baseline | 38 | 1.0× | — |
| + Vicuna-7B draft | 26 | 1.5× | 62% |
| + EAGLE-3 | 16 | 2.4× | 76% |
Case Study: Document Summarization
Summarization involves processing long inputs and generating condensed outputs:
Characteristics:
- Long context (10K+ tokens)
- Moderate acceptance rates
- Benefits from hybrid approaches
CNN/DailyMail results:
| Model Setup | Time per summary | Speedup |
|---|---|---|
| GPT-3.5 baseline | 8.2s | 1.0× |
| + Speculative | 4.1s | 2.0× |
| Llama-70B baseline | 12.5s | 1.0× |
| + EAGLE-3 | 4.8s | 2.6× |
Production Deployment
Deploying speculative decoding in production requires careful consideration of system architecture, resource allocation, and failure handling.
Framework Support
Major inference frameworks now support speculative decoding:
vLLM offers:
- Draft model speculation with configurable speculation depth
- Medusa and EAGLE integration
- PagedAttention for efficient KV cache management during speculation
- Automatic speculation depth tuning
TensorRT-LLM provides:
- Optimized CUDA kernels for draft model inference
- Fused verification operations
- Support for disaggregated draft/target serving
- INT8/FP8 draft models with FP16 targets
SGLang includes:
- EAGLE-3 native support
- Speculative decoding with RadixAttention
- SpecForge for accelerated speculation training
System Architecture
Production deployments typically use one of two architectures:
Co-located: Draft and target models on the same GPU
- Simpler deployment
- Memory contention between models
- Works well when draft is small (<10% of target)
Disaggregated: Draft and target on separate GPUs/machines
- Better resource utilization
- Additional network latency
- Preferred for large-scale deployments
The disaggregated approach enables sophisticated pipelining:
Time →
Draft GPU: [Draft 1][Draft 2][Draft 3][Draft 4]...
↓ ↓ ↓ ↓
Target GPU: [Verify 1][Verify 2][Verify 3]...
The target GPU is kept fully utilized, with draft tokens always ready when needed.
Batching Considerations
Speculative decoding complicates batching because different requests may accept different numbers of tokens:
- Request A: Accepts 5/5 speculated tokens
- Request B: Accepts 2/5 speculated tokens
Naive batching would process both to the minimum (2), wasting Request A's accepted tokens. Solutions include:
Selective batching: Only batch requests with similar expected acceptance rates Padding: Accept variable-length outputs with padding Request routing: Send high-acceptance requests to speculative path, others to standard decoding
vLLM's implementation handles this transparently, but custom deployments should consider the impact on throughput.
Monitoring and Tuning
Key metrics for speculative decoding:
- Acceptance rate: Fraction of draft tokens accepted (target: >70%)
- Effective speedup: Wall-clock time vs. standard decoding (target: >2×)
- Speculation overhead: Time spent on rejected tokens
- Memory utilization: Draft model + target model + speculation buffers
Tuning recommendations:
- Speculation depth (K): Start with 4-5, increase if acceptance rate is high
- Draft model size: 10-20× smaller than target for optimal tradeoff
- Tree width (for EAGLE-2): Wider trees for uncertain prompts
- Batch size: Smaller batches benefit more from speculation
Failure Handling
Speculative decoding adds failure modes:
Draft model crashes: Fall back to standard decoding Verification timeout: Accept partially verified sequence Memory exhaustion: Reduce speculation depth dynamically Consistency errors: Log and investigate; should never happen with correct implementation
Production systems should monitor for these failures and have automatic fallback paths.
Benchmarking Speculative Decoding
Spec-Bench, introduced alongside EAGLE, provides standardized evaluation:
Tasks
- MT-bench: Multi-turn conversation
- HumanEval: Code generation
- GSM8K: Mathematical reasoning
- Alpaca: Instruction following
- CNN/DailyMail: Summarization
Metrics
- Wallclock speedup: End-to-end time vs. standard decoding
- Acceptance rate: Fraction of draft tokens accepted
- Token efficiency: Tokens generated per forward pass
- Memory overhead: Additional memory vs. standard decoding
Results (EAGLE-3 vs. Baselines)
| Model | Standard | Medusa | Lookahead | EAGLE-3 |
|---|---|---|---|---|
| Llama-2-7B | 1.0× | 1.8× | 1.5× | 2.4× |
| Llama-2-13B | 1.0× | 1.7× | 1.4× | 2.3× |
| Llama-2-70B | 1.0× | 1.6× | 1.3× | 2.1× |
| Vicuna-7B | 1.0× | 1.9× | 1.6× | 2.5× |
| Code Llama-7B | 1.0× | 2.1× | 1.8× | 2.8× |
Code generation benefits most because code has predictable patterns that drafts model well. Conversational tasks benefit less due to higher unpredictability.
Advanced Topics
Speculative Decoding Theory
Recent theoretical work has characterized speculative decoding's optimality:
Token efficiency bound: For acceptance rate and speculation depth , the expected tokens per iteration is:
Optimal speculation depth: Given draft cost and target cost , the optimal satisfies:
For typical values (, ), optimal is 4-6.
Theoretical speedup limit: As , speedup approaches . The practical limit is determined by draft model accuracy.
Speculative Decoding for Long Contexts
Long-context generation (10K+ tokens) presents unique challenges:
- KV cache growth: Both draft and target caches grow with context
- Draft accuracy decay: Drafts may become less accurate as context grows
- Verification cost: Each verification processes the full context
Solutions:
- Sliding window speculation: Only use recent context for drafting
- Hierarchical drafts: Different draft strategies for different context ranges
- Cached verification: Reuse KV cache across speculative iterations
Training Better Draft Models
Improving draft accuracy directly increases speedup. Approaches include:
Online distillation: Continuously update draft based on rejection patterns Rejection-aware training: Weight training loss by acceptance probability Multi-task drafts: Train draft on diverse prompts matching deployment distribution Reinforcement learning: Optimize draft for expected accepted tokens, not just accuracy
2025 Research Developments
Several new speculative decoding techniques have emerged in 2025:
Speculators v0.3.0 (December 2025): End-to-end training support for Eagle3 draft models with seamless vLLM integration. Includes offline data generation, single- and multi-layer draft model training, and support for both MoE and non-MoE verifiers.
SpecBundle & SpecForge v0.2 (December 2025): Collaboration between LMSYS, Ant, Meituan, Nex-AGI, and EigenAI releasing production-grade EAGLE-3 checkpoints trained on large-scale datasets. The Llama 4 Maverick draft model achieves 2.18× speedup on MT-Bench.
Fuzzy Speculative Decoding (February 2025): Relaxes the strict acceptance criteria to allow "close enough" matches, trading minimal quality degradation for significant speedup gains in specific domains.
DuoDecoding (March 2025): Hardware-aware heterogeneous speculative decoding that optimizes draft and target placement across different accelerator types (GPU + NPU, multi-GPU configurations).
RASD - Retrieval-Augmented Speculative Decoding (March 2025): Uses retrieved examples to improve draft quality for domain-specific applications, achieving higher acceptance rates when relevant context is available.
Falcon (Gao et al., 2025): Faster and parallel inference through enhanced semi-autoregressive drafting and custom-designed decoding trees.
Utility-Driven Speculative Decoding for MoE (Saxena et al., 2025): Optimizes speculation strategies specifically for Mixture-of-Experts models, accounting for expert routing overhead.
Production Framework Status (December 2025)
vLLM 0.9.1:
- Native EAGLE-1 and EAGLE-3 support (since v0.8.5)
- CUDA graphs for Eagle 1+3 reducing kernel launch overhead
- Speculative decoding metrics: draft acceptance rate, per-position acceptance, mean acceptance length
- Up to 2.5× inference speedup across diverse scenarios
SGLang with SpecForge:
- Tight integration for training-to-deployment pipeline
- Production benchmarks on H100: 1.81× throughput at batch size 2, 1.38× at batch size 64
- Draft head overhead: ~0.25B parameters for 8B model, ~1B for 70B model
Future Directions
Active research areas include:
Adaptive speculation: Dynamically adjusting strategy based on input characteristics Hardware-aware speculation: Designing drafts for specific GPU architectures Speculation for fine-tuning: Using speculation during training, not just inference Cross-model speculation: Using one model family's draft for another family's target
Related Articles
LLM Inference Optimization: From Quantization to Speculative Decoding
A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.
vLLM in Production: The Complete Guide to High-Performance LLM Serving
A comprehensive guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.
Text Generation & Decoding Strategies: A Complete Guide
A comprehensive guide to how LLMs actually generate text—from greedy decoding to beam search, temperature scaling, nucleus sampling, speculative decoding, and structured generation. Master the techniques that control LLM output quality, creativity, and speed.
Transformer Architecture: A Complete Deep Dive
A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.
LLM Cost Engineering: Token Budgeting, Caching, and Model Routing for Production
Comprehensive guide to reducing LLM costs by 60-80% in production. Covers prompt caching (OpenAI vs Anthropic), semantic caching with Redis and GPTCache, model routing and cascading, batch processing, and token optimization strategies.