Skip to main content
Back to Blog

LLM Inference Optimization: From Quantization to Speculative Decoding

A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.

12 min read
Share:

The Inference Challenge

Running LLMs in production is expensive and slow. A 70B parameter model requires ~140GB of GPU memory in FP16, costs dollars per thousand queries, and adds seconds of latency. Inference optimization isn't optional—it's essential for any serious deployment.

This post covers the techniques that production teams use to make LLM inference fast, efficient, and affordable: quantization, attention optimization, batching strategies, and deployment frameworks.

Quantization: Reducing Precision for Speed

Quantization reduces the precision of model weights and/or activations, dramatically reducing memory usage and often improving speed.

Why quantization works at all: Neural networks are surprisingly robust to reduced precision. Weights learned during training contain far more precision than the network needs to function correctly—FP32 captures 7 decimal places of precision, but the actual "information" in each weight is much coarser. This redundancy means we can round weights to fewer bits without significantly changing the network's outputs. The key insight: quantization errors often cancel out across millions of operations, rather than accumulating.

The memory-bandwidth bottleneck: LLM inference is "memory-bound"—GPUs spend more time waiting for data to load from memory than actually computing. Quantization helps twice: first, it reduces model size so weights load faster; second, it reduces KV cache size so attention computation loads faster. A model that fits in fast cache (SRAM) rather than slow VRAM (HBM) runs dramatically faster, even if each operation is theoretically less precise.

Trade-offs by use case: Production deployments optimize for throughput (tokens/dollar), not accuracy. A 2% quality loss with 3x cost reduction is often worthwhile—you can afford more queries, more retries, or longer contexts. Research deployments may prioritize quality. Interactive applications need low latency, so speculative decoding plus quantization often beats high-precision single-model inference.

Understanding Precision Formats

FormatBitsMemoryUse Case
FP32324 bytes/paramTraining (legacy)
FP16/BF16162 bytes/paramTraining, inference
FP881 byte/paramH100 inference
INT881 byte/paramProduction inference
INT440.5 bytes/paramMemory-constrained

Memory impact is dramatic. From research: "A 7-billion parameter model in FP32 format would require approximately 28 GB of RAM. Quantized to INT4, that same model could fit into roughly 3.5 GB."

Quantization Methods

Post-Training Quantization (PTQ): Quantize after training, no additional training needed.

Quantization-Aware Training (QAT): Train with quantization in the loop—better accuracy but requires training resources.

Weight-Only Quantization: Quantize only weights, keep activations in higher precision. Generally preserves accuracy better.

GPTQ: Layer-Wise Quantization

GPTQ quantizes each transformer layer independently to minimize cumulative error:

  1. Process one layer at a time
  2. Quantize weights while minimizing MSE vs. full-precision output
  3. Use Hessian updates to compensate for quantization error
  4. Uses mixed int4-fp16: weights as int4, activations as fp16

From research: "GPTQ quantizes each layer individually, minimizing the mean squared error (MSE) between quantized and full-precision weights."

AWQ: Activation-Aware Weight Quantization

AWQ improves on GPTQ by considering activation patterns:

Key insight: "A small fraction (less than 1%) of weights have a large impact on the output. These 'salient weights' are kept in high precision (e.g., FP16), while the rest are quantized to INT3 or INT4."

AWQ identifies critical weights based on activation magnitude and preserves them, reducing accuracy loss.

Comparing Methods

Research findings from over half a million evaluations:

MethodSpeedAccuracy RetentionBest For
FP16Baseline~100%When memory allows
INT81.5-2x~99%Production default
GPTQ INT42-3x95-98%Memory-constrained
AWQ INT42-3x96-99%Quality-sensitive INT4

"AWQ generally shows less accuracy degradation compared to GPTQ" for weight-only quantization, especially for 70B+ models.

However, benchmark-specific results vary: "In benchmarks like IFEval, AWQ underperforms relative to GPTQ-INT4 even though both use 4-bit quantization."

Practical Recommendations

From NVIDIA's optimization guide and research:

  1. Start with INT8: "INT8 is the default choice for most applications with 1.8x speedup and negligible quality loss."

  2. Move to INT4 only when needed: "Move to INT4 only when latency requirements demand it."

  3. AWQ for quality-sensitive applications: "AWQ provides better quality than standard INT4 at similar speeds."

  4. Q5_K_M sweet spot: "The sweet spot appears to be Q5_K_M or Q8_0, where models retain ~95–99% of the original performance."

  5. Test on your tasks: Accuracy loss varies significantly by task and model size.

Attention Optimization

Attention is the bottleneck. Standard attention has O(n²) memory and compute complexity in sequence length.

Why attention dominates inference cost: In a transformer, each layer has attention and MLP (feedforward) components. MLP is O(n) in sequence length—each token processes independently. But attention compares every token to every other token, creating an N×N matrix. At 8K context, that's 64 million pairwise comparisons per layer. At 128K context, it's 16 billion. The quadratic scaling makes long-context inference fundamentally expensive.

The memory hierarchy problem: Modern GPUs have a pyramid of memory: registers (fastest, tiny), SRAM (~20MB on H100, fast), HBM (~80GB on H100, slow). Standard attention computes the full N×N matrix and stores it in HBM to later multiply by values. But reading/writing HBM is slow—10x slower than SRAM. FlashAttention's insight is that the full matrix never needs to exist; we can compute attention in tiles that fit in SRAM.

FlashAttention: The Game Changer

FlashAttention revolutionized LLM inference by optimizing memory access patterns:

The Problem: Standard attention materializes the full N×N attention matrix in GPU memory.

The Solution: Compute attention in tiles, never materializing the full matrix.

From the FlashAttention paper: "Memory savings are proportional to sequence length—since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."

How It Works: "Tiling means that blocks of inputs are loaded from HBM (GPU memory) to SRAM (fast cache), attention is performed with respect to that block, and the output is updated in HBM. By not writing the large intermediate attention matrices to HBM, the amount of memory reads/writes is reduced, which brings 2-4x wallclock time speedup."

FlashAttention-2 and 3

FlashAttention-2 improvements:

  • 2x faster than FlashAttention-1
  • Supports head dimension up to 256
  • Native support for Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)

FlashAttention-3:

  • Optimized for Hopper GPUs (H100)
  • Requires CUDA 12.3+, with 12.8 recommended
  • Best performance on latest hardware

MQA and GQA

Reduce KV cache memory by sharing key-value heads:

Multi-Query Attention (MQA): All query heads share a single key-value head.

Grouped-Query Attention (GQA): Query heads grouped, each group shares key-value.

From research: "These are variants of attention where multiple heads of query attend to the same head of key and value, in order to reduce the size of KV cache during inference."

KV Cache Optimization

The KV cache stores key-value states from previous tokens, enabling autoregressive generation without recomputation.

Understanding why KV cache exists: LLMs generate tokens one at a time. To generate token 100, the model computes attention over tokens 1-99. To generate token 101, it computes attention over tokens 1-100. Without caching, you'd recompute the key and value projections for tokens 1-99 when generating token 100, then recompute them again (plus token 100) for token 101. KV cache stores these projections: compute once, reuse forever. This converts O(n²) per-token work to O(n).

Why KV cache becomes the memory bottleneck: Paradoxically, the optimization that makes generation fast creates a new bottleneck. Each layer stores keys and values for every token in the sequence. For a 70B model with 80 layers, 128 KV heads, and 128 head dimension, each token adds 80 × 128 × 128 × 2 = 2.6MB to the cache. At 8K context, that's 20GB per sequence. With batch size 32, KV cache alone needs 640GB—more than exists on any single GPU.

The batch size tradeoff: More concurrent requests mean higher throughput (better GPU utilization), but each request needs its own KV cache. Most production systems are memory-limited, not compute-limited. PagedAttention and KV cache quantization exist precisely to fit more concurrent requests in memory, increasing throughput without buying more GPUs.

The Problem

KV cache grows with:

  • Sequence length
  • Number of layers
  • Number of KV heads
  • Hidden dimension

For long sequences with large batch sizes, KV cache can exceed model weights in memory usage.

PagedAttention

vLLM's PagedAttention treats GPU memory like OS virtual memory:

"PagedAttention splits memory into small reusable pages, cutting memory waste by up to 90%."

Benefits:

  • Non-contiguous storage: KV cache blocks stored anywhere in GPU memory
  • Dynamic allocation: Memory allocated only as sequences grow
  • Memory sharing: Identical prompt prefixes share KV cache blocks

KV Cache Compression

Techniques to reduce KV cache memory:

Quantization: Store KV cache in INT8 or INT4 Pruning: Remove less important cached states Sliding window: Only cache recent tokens (for models that support it)

Research highlight: "TurboAttention is a novel unified technique for enabling quantized execution of attention along with a cooperative KV cache compression mechanism which reduces latency, memory footprint, with negligible accuracy loss."

Batching Strategies

Static Batching (Bad)

Wait for batch to fill, process together, wait for all to complete.

Problem: Variable output lengths cause GPU underutilization. Fast completions wait for slow ones.

Continuous Batching (Good)

"Continuous batching dynamically mixes new requests with ongoing ones so your GPU is never idle."

How it works:

  1. Process batch for one iteration
  2. Remove completed sequences
  3. Add new sequences to fill batch
  4. Repeat

Results: "By leveraging vLLM, users can achieve 23x LLM inference throughput while reducing p50 latency."

Iteration-Level Scheduling

Advanced frameworks schedule at the iteration level:

  • Preempt low-priority requests
  • Prioritize latency-sensitive queries
  • Balance throughput and latency

Speculative Decoding

Speculative decoding uses a small "draft" model to generate candidate tokens, then the large "target" model verifies them in parallel.

The key insight: LLM inference is bottlenecked by sequential token generation, not parallel computation. Generating 100 tokens requires 100 forward passes through the model—each waiting for the previous token. But a forward pass that processes 1 token versus 10 tokens takes nearly the same time (GPUs are parallel). Speculative decoding exploits this: have a fast model guess multiple tokens, then verify all guesses in one forward pass of the slow model.

Why verification is cheap: Computing "what token would the model generate?" requires autoregressive sampling—compute probabilities, sample one, feed it back, repeat. But computing "what probability does the model assign to these specific tokens?" is a single parallel forward pass. You provide the tokens; the model computes all their probabilities simultaneously. This asymmetry between generation (sequential) and verification (parallel) is what makes speculative decoding work.

The acceptance criterion: Not all guesses will match what the target model would have generated. Speculative decoding accepts a guess if the target model would have produced it with similar probability. When a guess is rejected, we discard it and all subsequent guesses (since they were conditioned on the rejected token). The math ensures the output distribution exactly matches what the target model would have produced alone—speculation is lossless, just faster.

How It Works

  1. Draft model generates k candidate tokens quickly
  2. Target model verifies all k tokens in one forward pass
  3. Accept verified tokens, reject and regenerate from first mismatch
  4. Repeat

This works because verification (parallel) is faster than generation (sequential) for large models.

Performance Gains

Research on AMD MI300X: "vLLM achieves up to 2.31x speedup when enabled with speculative decoding."

Production example: "In production systems handling customer service queries, speculative decoding with a 2B draft model and 13B target model reduced per-token latency from 42ms to 18ms—a 2.3x improvement."

Finding the Sweet Spot

Research findings: "Finding the 'sweet spot' of k=1 yielded consistent gains of 20%–54% lower cost per token across all profiles vs baseline."

Optimal k depends on:

  • Draft model quality (higher acceptance rate → higher k)
  • Target model size (larger → more benefit from parallelism)
  • Hardware characteristics

Combining with Quantization

"The most impressive results came from combining N-gram speculative decoding with FP8 quantization."

However, there are challenges: "While quantization alone can accelerate LLMs under the right hardware and precision settings, combining it with speculative decoding is non-trivial. On GPUs like the NVIDIA L4, the added overhead from quantization often cancels out the benefits of speculation."

Deployment Frameworks

vLLM

vLLM is the most popular open-source LLM serving framework:

Key Features:

  • PagedAttention for memory efficiency
  • Continuous batching
  • Optimized CUDA kernels
  • Speculative decoding support
  • Quantization support (FP8, AWQ, GPTQ)

Performance: "vLLM improves throughput by 2–4× over systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences."

Best For: "Flexibility and fast integration with Hugging Face models"

TensorRT-LLM

NVIDIA's optimized inference library:

Key Features:

  • Custom attention kernels
  • Inflight batching and paged KV caching
  • Quantization down to FP4 and INT4
  • Speculative decoding
  • Tight integration with Triton Inference Server

Performance: "On H100 with FP8, TensorRT-LLM reaches over 10,000 output tokens/s at peak throughput for 64 concurrent requests."

Best For: "Maximum performance when deep in the NVIDIA ecosystem"

Comparison

AspectvLLMTensorRT-LLM
HardwareAny CUDA GPUNVIDIA enterprise GPUs
Ease of UseEasierSteeper learning curve
PerformanceExcellentMaximum on NVIDIA
Model SupportBroadNVIDIA-optimized models

"In practice, many dev teams mix these systems—for example using TensorRT-LLM for high volume proprietary chat, and vLLM or LMDeploy for experimental and open model workloads."

Other Frameworks

LMDeploy: Chinese-developed, excellent performance Text Generation Inference (TGI): Hugging Face's production server llama.cpp: CPU inference, great for edge deployment

Putting It All Together

Optimization Stack

Layer optimizations for maximum impact:

Code
Layer 1: Model Selection
         └── Choose appropriate model size for task

Layer 2: Quantization
         └── INT8 default, INT4 if memory-constrained

Layer 3: Attention Optimization
         └── FlashAttention-2/3, MQA/GQA

Layer 4: KV Cache Management
         └── PagedAttention, quantized KV cache

Layer 5: Batching
         └── Continuous batching

Layer 6: Speculative Decoding
         └── If latency-critical and good draft model available

Layer 7: Hardware Optimization
         └── Framework-specific (TensorRT-LLM on NVIDIA)

Example Configuration

Production deployment for a 70B model on 2x H100:

Python
# vLLM configuration
llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",
    tensor_parallel_size=2,
    quantization="awq",           # INT4 to fit in memory
    gpu_memory_utilization=0.9,
    enable_prefix_caching=True,   # KV cache sharing
    max_model_len=8192,
)

# Serve with continuous batching
engine = AsyncLLMEngine.from_engine_args(engine_args)

Cost-Performance Tradeoffs

ConfigurationThroughputLatencyMemoryQuality
FP16, no optimization1x1x140GB100%
INT8 + FlashAttention2x0.7x70GB~99%
INT4 + PagedAttention3x0.6x40GB~97%
INT4 + Speculative4x0.4x45GB~97%

Monitoring and Profiling

Key Metrics

MetricTargetNotes
Time to First Token (TTFT)< 500msUser experience critical
Inter-Token Latency< 50msStreaming smoothness
Throughput (tok/s)MaximizeCost efficiency
GPU Utilization> 80%Resource efficiency
Memory Usage< 90%Headroom for spikes

Profiling Tools

  • NVIDIA Nsight: Detailed GPU profiling
  • PyTorch Profiler: Model-level insights
  • vLLM metrics: Built-in Prometheus metrics
  • TensorRT-LLM profiler: Framework-specific

Conclusion

LLM inference optimization is a multi-layered discipline. The key techniques:

  1. Quantization: INT8 default, INT4 when needed, AWQ for quality
  2. FlashAttention: Essential for any production deployment
  3. PagedAttention: Dramatic memory efficiency gains
  4. Continuous batching: 10-20x throughput improvement
  5. Speculative decoding: 2-3x latency reduction when applicable

Start with vLLM for flexibility, graduate to TensorRT-LLM for maximum NVIDIA performance. Layer optimizations based on your constraints: memory-limited → aggressive quantization; latency-limited → speculative decoding; throughput-limited → continuous batching optimization.

The field is evolving rapidly. Stay current with framework releases and new optimization techniques.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles