LLM Inference Optimization: From Quantization to Speculative Decoding
A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.
Table of Contents
The Inference Challenge
Running LLMs in production is expensive and slow. A 70B parameter model requires ~140GB of GPU memory in FP16, costs dollars per thousand queries, and adds seconds of latency. Inference optimization isn't optional—it's essential for any serious deployment.
This post covers the techniques that production teams use to make LLM inference fast, efficient, and affordable: quantization, attention optimization, batching strategies, and deployment frameworks.
Quantization: Reducing Precision for Speed
Quantization reduces the precision of model weights and/or activations, dramatically reducing memory usage and often improving speed.
Why quantization works at all: Neural networks are surprisingly robust to reduced precision. Weights learned during training contain far more precision than the network needs to function correctly—FP32 captures 7 decimal places of precision, but the actual "information" in each weight is much coarser. This redundancy means we can round weights to fewer bits without significantly changing the network's outputs. The key insight: quantization errors often cancel out across millions of operations, rather than accumulating.
The memory-bandwidth bottleneck: LLM inference is "memory-bound"—GPUs spend more time waiting for data to load from memory than actually computing. Quantization helps twice: first, it reduces model size so weights load faster; second, it reduces KV cache size so attention computation loads faster. A model that fits in fast cache (SRAM) rather than slow VRAM (HBM) runs dramatically faster, even if each operation is theoretically less precise.
Trade-offs by use case: Production deployments optimize for throughput (tokens/dollar), not accuracy. A 2% quality loss with 3x cost reduction is often worthwhile—you can afford more queries, more retries, or longer contexts. Research deployments may prioritize quality. Interactive applications need low latency, so speculative decoding plus quantization often beats high-precision single-model inference.
Understanding Precision Formats
| Format | Bits | Memory | Use Case |
|---|---|---|---|
| FP32 | 32 | 4 bytes/param | Training (legacy) |
| FP16/BF16 | 16 | 2 bytes/param | Training, inference |
| FP8 | 8 | 1 byte/param | H100 inference |
| INT8 | 8 | 1 byte/param | Production inference |
| INT4 | 4 | 0.5 bytes/param | Memory-constrained |
Memory impact is dramatic. From research: "A 7-billion parameter model in FP32 format would require approximately 28 GB of RAM. Quantized to INT4, that same model could fit into roughly 3.5 GB."
Quantization Methods
Post-Training Quantization (PTQ): Quantize after training, no additional training needed.
Quantization-Aware Training (QAT): Train with quantization in the loop—better accuracy but requires training resources.
Weight-Only Quantization: Quantize only weights, keep activations in higher precision. Generally preserves accuracy better.
GPTQ: Layer-Wise Quantization
GPTQ quantizes each transformer layer independently to minimize cumulative error:
- Process one layer at a time
- Quantize weights while minimizing MSE vs. full-precision output
- Use Hessian updates to compensate for quantization error
- Uses mixed int4-fp16: weights as int4, activations as fp16
From research: "GPTQ quantizes each layer individually, minimizing the mean squared error (MSE) between quantized and full-precision weights."
AWQ: Activation-Aware Weight Quantization
AWQ improves on GPTQ by considering activation patterns:
Key insight: "A small fraction (less than 1%) of weights have a large impact on the output. These 'salient weights' are kept in high precision (e.g., FP16), while the rest are quantized to INT3 or INT4."
AWQ identifies critical weights based on activation magnitude and preserves them, reducing accuracy loss.
Comparing Methods
Research findings from over half a million evaluations:
| Method | Speed | Accuracy Retention | Best For |
|---|---|---|---|
| FP16 | Baseline | ~100% | When memory allows |
| INT8 | 1.5-2x | ~99% | Production default |
| GPTQ INT4 | 2-3x | 95-98% | Memory-constrained |
| AWQ INT4 | 2-3x | 96-99% | Quality-sensitive INT4 |
"AWQ generally shows less accuracy degradation compared to GPTQ" for weight-only quantization, especially for 70B+ models.
However, benchmark-specific results vary: "In benchmarks like IFEval, AWQ underperforms relative to GPTQ-INT4 even though both use 4-bit quantization."
Practical Recommendations
From NVIDIA's optimization guide and research:
-
Start with INT8: "INT8 is the default choice for most applications with 1.8x speedup and negligible quality loss."
-
Move to INT4 only when needed: "Move to INT4 only when latency requirements demand it."
-
AWQ for quality-sensitive applications: "AWQ provides better quality than standard INT4 at similar speeds."
-
Q5_K_M sweet spot: "The sweet spot appears to be Q5_K_M or Q8_0, where models retain ~95–99% of the original performance."
-
Test on your tasks: Accuracy loss varies significantly by task and model size.
Attention Optimization
Attention is the bottleneck. Standard attention has O(n²) memory and compute complexity in sequence length.
Why attention dominates inference cost: In a transformer, each layer has attention and MLP (feedforward) components. MLP is O(n) in sequence length—each token processes independently. But attention compares every token to every other token, creating an N×N matrix. At 8K context, that's 64 million pairwise comparisons per layer. At 128K context, it's 16 billion. The quadratic scaling makes long-context inference fundamentally expensive.
The memory hierarchy problem: Modern GPUs have a pyramid of memory: registers (fastest, tiny), SRAM (~20MB on H100, fast), HBM (~80GB on H100, slow). Standard attention computes the full N×N matrix and stores it in HBM to later multiply by values. But reading/writing HBM is slow—10x slower than SRAM. FlashAttention's insight is that the full matrix never needs to exist; we can compute attention in tiles that fit in SRAM.
FlashAttention: The Game Changer
FlashAttention revolutionized LLM inference by optimizing memory access patterns:
The Problem: Standard attention materializes the full N×N attention matrix in GPU memory.
The Solution: Compute attention in tiles, never materializing the full matrix.
From the FlashAttention paper: "Memory savings are proportional to sequence length—since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."
How It Works: "Tiling means that blocks of inputs are loaded from HBM (GPU memory) to SRAM (fast cache), attention is performed with respect to that block, and the output is updated in HBM. By not writing the large intermediate attention matrices to HBM, the amount of memory reads/writes is reduced, which brings 2-4x wallclock time speedup."
FlashAttention-2 and 3
FlashAttention-2 improvements:
- 2x faster than FlashAttention-1
- Supports head dimension up to 256
- Native support for Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)
FlashAttention-3:
- Optimized for Hopper GPUs (H100)
- Requires CUDA 12.3+, with 12.8 recommended
- Best performance on latest hardware
MQA and GQA
Reduce KV cache memory by sharing key-value heads:
Multi-Query Attention (MQA): All query heads share a single key-value head.
Grouped-Query Attention (GQA): Query heads grouped, each group shares key-value.
From research: "These are variants of attention where multiple heads of query attend to the same head of key and value, in order to reduce the size of KV cache during inference."
KV Cache Optimization
The KV cache stores key-value states from previous tokens, enabling autoregressive generation without recomputation.
Understanding why KV cache exists: LLMs generate tokens one at a time. To generate token 100, the model computes attention over tokens 1-99. To generate token 101, it computes attention over tokens 1-100. Without caching, you'd recompute the key and value projections for tokens 1-99 when generating token 100, then recompute them again (plus token 100) for token 101. KV cache stores these projections: compute once, reuse forever. This converts O(n²) per-token work to O(n).
Why KV cache becomes the memory bottleneck: Paradoxically, the optimization that makes generation fast creates a new bottleneck. Each layer stores keys and values for every token in the sequence. For a 70B model with 80 layers, 128 KV heads, and 128 head dimension, each token adds 80 × 128 × 128 × 2 = 2.6MB to the cache. At 8K context, that's 20GB per sequence. With batch size 32, KV cache alone needs 640GB—more than exists on any single GPU.
The batch size tradeoff: More concurrent requests mean higher throughput (better GPU utilization), but each request needs its own KV cache. Most production systems are memory-limited, not compute-limited. PagedAttention and KV cache quantization exist precisely to fit more concurrent requests in memory, increasing throughput without buying more GPUs.
The Problem
KV cache grows with:
- Sequence length
- Number of layers
- Number of KV heads
- Hidden dimension
For long sequences with large batch sizes, KV cache can exceed model weights in memory usage.
PagedAttention
vLLM's PagedAttention treats GPU memory like OS virtual memory:
"PagedAttention splits memory into small reusable pages, cutting memory waste by up to 90%."
Benefits:
- Non-contiguous storage: KV cache blocks stored anywhere in GPU memory
- Dynamic allocation: Memory allocated only as sequences grow
- Memory sharing: Identical prompt prefixes share KV cache blocks
KV Cache Compression
Techniques to reduce KV cache memory:
Quantization: Store KV cache in INT8 or INT4 Pruning: Remove less important cached states Sliding window: Only cache recent tokens (for models that support it)
Research highlight: "TurboAttention is a novel unified technique for enabling quantized execution of attention along with a cooperative KV cache compression mechanism which reduces latency, memory footprint, with negligible accuracy loss."
Batching Strategies
Static Batching (Bad)
Wait for batch to fill, process together, wait for all to complete.
Problem: Variable output lengths cause GPU underutilization. Fast completions wait for slow ones.
Continuous Batching (Good)
"Continuous batching dynamically mixes new requests with ongoing ones so your GPU is never idle."
How it works:
- Process batch for one iteration
- Remove completed sequences
- Add new sequences to fill batch
- Repeat
Results: "By leveraging vLLM, users can achieve 23x LLM inference throughput while reducing p50 latency."
Iteration-Level Scheduling
Advanced frameworks schedule at the iteration level:
- Preempt low-priority requests
- Prioritize latency-sensitive queries
- Balance throughput and latency
Speculative Decoding
Speculative decoding uses a small "draft" model to generate candidate tokens, then the large "target" model verifies them in parallel.
The key insight: LLM inference is bottlenecked by sequential token generation, not parallel computation. Generating 100 tokens requires 100 forward passes through the model—each waiting for the previous token. But a forward pass that processes 1 token versus 10 tokens takes nearly the same time (GPUs are parallel). Speculative decoding exploits this: have a fast model guess multiple tokens, then verify all guesses in one forward pass of the slow model.
Why verification is cheap: Computing "what token would the model generate?" requires autoregressive sampling—compute probabilities, sample one, feed it back, repeat. But computing "what probability does the model assign to these specific tokens?" is a single parallel forward pass. You provide the tokens; the model computes all their probabilities simultaneously. This asymmetry between generation (sequential) and verification (parallel) is what makes speculative decoding work.
The acceptance criterion: Not all guesses will match what the target model would have generated. Speculative decoding accepts a guess if the target model would have produced it with similar probability. When a guess is rejected, we discard it and all subsequent guesses (since they were conditioned on the rejected token). The math ensures the output distribution exactly matches what the target model would have produced alone—speculation is lossless, just faster.
How It Works
- Draft model generates k candidate tokens quickly
- Target model verifies all k tokens in one forward pass
- Accept verified tokens, reject and regenerate from first mismatch
- Repeat
This works because verification (parallel) is faster than generation (sequential) for large models.
Performance Gains
Research on AMD MI300X: "vLLM achieves up to 2.31x speedup when enabled with speculative decoding."
Production example: "In production systems handling customer service queries, speculative decoding with a 2B draft model and 13B target model reduced per-token latency from 42ms to 18ms—a 2.3x improvement."
Finding the Sweet Spot
Research findings: "Finding the 'sweet spot' of k=1 yielded consistent gains of 20%–54% lower cost per token across all profiles vs baseline."
Optimal k depends on:
- Draft model quality (higher acceptance rate → higher k)
- Target model size (larger → more benefit from parallelism)
- Hardware characteristics
Combining with Quantization
"The most impressive results came from combining N-gram speculative decoding with FP8 quantization."
However, there are challenges: "While quantization alone can accelerate LLMs under the right hardware and precision settings, combining it with speculative decoding is non-trivial. On GPUs like the NVIDIA L4, the added overhead from quantization often cancels out the benefits of speculation."
Deployment Frameworks
vLLM
vLLM is the most popular open-source LLM serving framework:
Key Features:
- PagedAttention for memory efficiency
- Continuous batching
- Optimized CUDA kernels
- Speculative decoding support
- Quantization support (FP8, AWQ, GPTQ)
Performance: "vLLM improves throughput by 2–4× over systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences."
Best For: "Flexibility and fast integration with Hugging Face models"
TensorRT-LLM
NVIDIA's optimized inference library:
Key Features:
- Custom attention kernels
- Inflight batching and paged KV caching
- Quantization down to FP4 and INT4
- Speculative decoding
- Tight integration with Triton Inference Server
Performance: "On H100 with FP8, TensorRT-LLM reaches over 10,000 output tokens/s at peak throughput for 64 concurrent requests."
Best For: "Maximum performance when deep in the NVIDIA ecosystem"
Comparison
| Aspect | vLLM | TensorRT-LLM |
|---|---|---|
| Hardware | Any CUDA GPU | NVIDIA enterprise GPUs |
| Ease of Use | Easier | Steeper learning curve |
| Performance | Excellent | Maximum on NVIDIA |
| Model Support | Broad | NVIDIA-optimized models |
"In practice, many dev teams mix these systems—for example using TensorRT-LLM for high volume proprietary chat, and vLLM or LMDeploy for experimental and open model workloads."
Other Frameworks
LMDeploy: Chinese-developed, excellent performance Text Generation Inference (TGI): Hugging Face's production server llama.cpp: CPU inference, great for edge deployment
Putting It All Together
Optimization Stack
Layer optimizations for maximum impact:
Layer 1: Model Selection
└── Choose appropriate model size for task
Layer 2: Quantization
└── INT8 default, INT4 if memory-constrained
Layer 3: Attention Optimization
└── FlashAttention-2/3, MQA/GQA
Layer 4: KV Cache Management
└── PagedAttention, quantized KV cache
Layer 5: Batching
└── Continuous batching
Layer 6: Speculative Decoding
└── If latency-critical and good draft model available
Layer 7: Hardware Optimization
└── Framework-specific (TensorRT-LLM on NVIDIA)
Example Configuration
Production deployment for a 70B model on 2x H100:
# vLLM configuration
llm = LLM(
model="meta-llama/Llama-3-70B-Instruct",
tensor_parallel_size=2,
quantization="awq", # INT4 to fit in memory
gpu_memory_utilization=0.9,
enable_prefix_caching=True, # KV cache sharing
max_model_len=8192,
)
# Serve with continuous batching
engine = AsyncLLMEngine.from_engine_args(engine_args)
Cost-Performance Tradeoffs
| Configuration | Throughput | Latency | Memory | Quality |
|---|---|---|---|---|
| FP16, no optimization | 1x | 1x | 140GB | 100% |
| INT8 + FlashAttention | 2x | 0.7x | 70GB | ~99% |
| INT4 + PagedAttention | 3x | 0.6x | 40GB | ~97% |
| INT4 + Speculative | 4x | 0.4x | 45GB | ~97% |
Monitoring and Profiling
Key Metrics
| Metric | Target | Notes |
|---|---|---|
| Time to First Token (TTFT) | < 500ms | User experience critical |
| Inter-Token Latency | < 50ms | Streaming smoothness |
| Throughput (tok/s) | Maximize | Cost efficiency |
| GPU Utilization | > 80% | Resource efficiency |
| Memory Usage | < 90% | Headroom for spikes |
Profiling Tools
- NVIDIA Nsight: Detailed GPU profiling
- PyTorch Profiler: Model-level insights
- vLLM metrics: Built-in Prometheus metrics
- TensorRT-LLM profiler: Framework-specific
Conclusion
LLM inference optimization is a multi-layered discipline. The key techniques:
- Quantization: INT8 default, INT4 when needed, AWQ for quality
- FlashAttention: Essential for any production deployment
- PagedAttention: Dramatic memory efficiency gains
- Continuous batching: 10-20x throughput improvement
- Speculative decoding: 2-3x latency reduction when applicable
Start with vLLM for flexibility, graduate to TensorRT-LLM for maximum NVIDIA performance. Layer optimizations based on your constraints: memory-limited → aggressive quantization; latency-limited → speculative decoding; throughput-limited → continuous batching optimization.
The field is evolving rapidly. Stay current with framework releases and new optimization techniques.
Frequently Asked Questions
Related Articles
Open-Source LLMs: The Complete 2025 Guide
A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.
Small Language Models: Edge Deployment and Knowledge Distillation
The rise of Small Language Models (SLMs)—from Llama 3.2 to Phi-4 to Qwen 2.5. Understanding knowledge distillation, quantization, and deploying AI at the edge.
LLM Frameworks: LangChain, LlamaIndex, LangGraph, and Beyond
A comprehensive comparison of LLM application frameworks—LangChain, LlamaIndex, LangGraph, Haystack, and alternatives. When to use each, how to combine them, and practical implementation patterns.