What quantization should I start with?

INT8 is the safe default. From NVIDIA's guide: "INT8 is the default choice for most applications with 1.8x speedup and negligible quality loss." Move to INT4/AWQ only if you need further memory reduction or speed. Always benchmark on your specific tasks—accuracy loss varies.

Is FlashAttention always faster?

Yes for most cases, especially long sequences. It provides 2-4x speedup and linear memory scaling. Only exceptions: very short sequences where the overhead isn't amortized, or specific hardware without optimized kernels. Use FlashAttention-3 on H100s.

How do I choose between vLLM and TensorRT-LLM?

vLLM for: ease of use, Hugging Face integration, rapid experimentation, non-NVIDIA hardware. TensorRT-LLM for: maximum performance on NVIDIA GPUs, production deployments where every % matters, integration with NVIDIA ecosystem (Triton). Many teams use both for different workloads.

Does quantization affect output quality?

Yes, but less than you'd expect. Research shows: INT8 typically < 1% quality loss; INT4 with AWQ/GPTQ ~2-5% loss. Larger models (70B+) retain quality better than smaller ones under quantization. Test on your specific tasks—some are more sensitive than others.

What's the latency impact of speculative decoding?

2-3x improvement is typical. Production example: "42ms to 18ms per-token latency—a 2.3x improvement." However, benefits depend on draft model quality and acceptance rate. If the draft model is too different from target, rejection rate increases and benefits diminish.

How much memory does KV cache use?

It depends on sequence length and batch size. For a 70B model at 8K context with batch size 32, KV cache can exceed 50GB. PagedAttention reduces waste by up to 90%. Consider quantized KV cache (INT8) for further reduction. Memory profiling is essential for production sizing.

LLM Inference Optimization: From Quantization to Speculative Decoding | Enrico Piovano

Q: What's the latency impact of speculative decoding?

2-3x improvement is typical. Production example: "42ms to 18ms per-token latency—a 2.3x improvement." However, benefits depend on draft model quality and acceptance rate. If the draft model is too different from target, rejection rate increases and benefits diminish.

Q: How much memory does KV cache use?

It depends on sequence length and batch size. For a 70B model at 8K context with batch size 32, KV cache can exceed 50GB. PagedAttention reduces waste by up to 90%. Consider quantized KV cache (INT8) for further reduction. Memory profiling is essential for production sizing.

The Inference Challenge

Running LLMs in production is expensive and slow. A 70B parameter model requires ~140GB of GPU memory in FP16, costs dollars per thousand queries, and adds seconds of latency. Inference optimization isn't optional—it's essential for any serious deployment.

This post covers the techniques that production teams use to make LLM inference fast, efficient, and affordable: quantization, attention optimization, batching strategies, and deployment frameworks.

Quantization: Reducing Precision for Speed

Quantization reduces the precision of model weights and/or activations, dramatically reducing memory usage and often improving speed.

Why quantization works at all: Neural networks are surprisingly robust to reduced precision. Weights learned during training contain far more precision than the network needs to function correctly—FP32 captures 7 decimal places of precision, but the actual "information" in each weight is much coarser. This redundancy means we can round weights to fewer bits without significantly changing the network's outputs. The key insight: quantization errors often cancel out across millions of operations, rather than accumulating.

The memory-bandwidth bottleneck: LLM inference is "memory-bound"—GPUs spend more time waiting for data to load from memory than actually computing. Quantization helps twice: first, it reduces model size so weights load faster; second, it reduces KV cache size so attention computation loads faster. A model that fits in fast cache (SRAM) rather than slow VRAM (HBM) runs dramatically faster, even if each operation is theoretically less precise.

Trade-offs by use case: Production deployments optimize for throughput (tokens/dollar), not accuracy. A 2% quality loss with 3x cost reduction is often worthwhile—you can afford more queries, more retries, or longer contexts. Research deployments may prioritize quality. Interactive applications need low latency, so speculative decoding plus quantization often beats high-precision single-model inference.

Understanding Precision Formats

Format	Bits	Memory	Use Case
FP32	32	4 bytes/param	Training (legacy)
FP16/BF16	16	2 bytes/param	Training, inference
FP8	8	1 byte/param	H100 inference
INT8	8	1 byte/param	Production inference
INT4	4	0.5 bytes/param	Memory-constrained

Memory impact is dramatic. From research: "A 7-billion parameter model in FP32 format would require approximately 28 GB of RAM. Quantized to INT4, that same model could fit into roughly 3.5 GB."

Quantization Methods

Post-Training Quantization (PTQ): Quantize after training, no additional training needed.

Quantization-Aware Training (QAT): Train with quantization in the loop—better accuracy but requires training resources.

Weight-Only Quantization: Quantize only weights, keep activations in higher precision. Generally preserves accuracy better.

GPTQ: Layer-Wise Quantization

GPTQ quantizes each transformer layer independently to minimize cumulative error:

Process one layer at a time
Quantize weights while minimizing MSE vs. full-precision output
Use Hessian updates to compensate for quantization error
Uses mixed int4-fp16: weights as int4, activations as fp16

From research: "GPTQ quantizes each layer individually, minimizing the mean squared error (MSE) between quantized and full-precision weights."

AWQ: Activation-Aware Weight Quantization

AWQ improves on GPTQ by considering activation patterns:

Key insight: "A small fraction (less than 1%) of weights have a large impact on the output. These 'salient weights' are kept in high precision (e.g., FP16), while the rest are quantized to INT3 or INT4."

AWQ identifies critical weights based on activation magnitude and preserves them, reducing accuracy loss.

Comparing Methods

Research findings from over half a million evaluations:

Method	Speed	Accuracy Retention	Best For
FP16	Baseline	~100%	When memory allows
INT8	1.5-2x	~99%	Production default
GPTQ INT4	2-3x	95-98%	Memory-constrained
AWQ INT4	2-3x	96-99%	Quality-sensitive INT4

"AWQ generally shows less accuracy degradation compared to GPTQ" for weight-only quantization, especially for 70B+ models.

However, benchmark-specific results vary: "In benchmarks like IFEval, AWQ underperforms relative to GPTQ-INT4 even though both use 4-bit quantization."

Practical Recommendations

From NVIDIA's optimization guide and research:

Start with INT8: "INT8 is the default choice for most applications with 1.8x speedup and negligible quality loss."
Move to INT4 only when needed: "Move to INT4 only when latency requirements demand it."
AWQ for quality-sensitive applications: "AWQ provides better quality than standard INT4 at similar speeds."
Q5_K_M sweet spot: "The sweet spot appears to be Q5_K_M or Q8_0, where models retain ~95–99% of the original performance."
Test on your tasks: Accuracy loss varies significantly by task and model size.

Attention Optimization

Attention is the bottleneck. Standard attention has O(n²) memory and compute complexity in sequence length.

Why attention dominates inference cost: In a transformer, each layer has attention and MLP (feedforward) components. MLP is O(n) in sequence length—each token processes independently. But attention compares every token to every other token, creating an N×N matrix. At 8K context, that's 64 million pairwise comparisons per layer. At 128K context, it's 16 billion. The quadratic scaling makes long-context inference fundamentally expensive.

The memory hierarchy problem: Modern GPUs have a pyramid of memory: registers (fastest, tiny), SRAM (~20MB on H100, fast), HBM (~80GB on H100, slow). Standard attention computes the full N×N matrix and stores it in HBM to later multiply by values. But reading/writing HBM is slow—10x slower than SRAM. FlashAttention's insight is that the full matrix never needs to exist; we can compute attention in tiles that fit in SRAM.

FlashAttention: The Game Changer

FlashAttention revolutionized LLM inference by optimizing memory access patterns:

The Problem: Standard attention materializes the full N×N attention matrix in GPU memory.

The Solution: Compute attention in tiles, never materializing the full matrix.

From the FlashAttention paper: "Memory savings are proportional to sequence length—since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."

How It Works: "Tiling means that blocks of inputs are loaded from HBM (GPU memory) to SRAM (fast cache), attention is performed with respect to that block, and the output is updated in HBM. By not writing the large intermediate attention matrices to HBM, the amount of memory reads/writes is reduced, which brings 2-4x wallclock time speedup."

FlashAttention-2 and 3

FlashAttention-2 improvements:

2x faster than FlashAttention-1
Supports head dimension up to 256
Native support for Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)

FlashAttention-3:

Optimized for Hopper GPUs (H100)
Requires CUDA 12.3+, with 12.8 recommended
Best performance on latest hardware

MQA and GQA

Reduce KV cache memory by sharing key-value heads:

Multi-Query Attention (MQA): All query heads share a single key-value head.

Grouped-Query Attention (GQA): Query heads grouped, each group shares key-value.

From research: "These are variants of attention where multiple heads of query attend to the same head of key and value, in order to reduce the size of KV cache during inference."

KV Cache Optimization

The KV cache stores key-value states from previous tokens, enabling autoregressive generation without recomputation.

Understanding why KV cache exists: LLMs generate tokens one at a time. To generate token 100, the model computes attention over tokens 1-99. To generate token 101, it computes attention over tokens 1-100. Without caching, you'd recompute the key and value projections for tokens 1-99 when generating token 100, then recompute them again (plus token 100) for token 101. KV cache stores these projections: compute once, reuse forever. This converts O(n²) per-token work to O(n).

Why KV cache becomes the memory bottleneck: Paradoxically, the optimization that makes generation fast creates a new bottleneck. Each layer stores keys and values for every token in the sequence. For a 70B model with 80 layers, 128 KV heads, and 128 head dimension, each token adds 80 × 128 × 128 × 2 = 2.6MB to the cache. At 8K context, that's 20GB per sequence. With batch size 32, KV cache alone needs 640GB—more than exists on any single GPU.

The batch size tradeoff: More concurrent requests mean higher throughput (better GPU utilization), but each request needs its own KV cache. Most production systems are memory-limited, not compute-limited. PagedAttention and KV cache quantization exist precisely to fit more concurrent requests in memory, increasing throughput without buying more GPUs.

The Problem

KV cache grows with:

Sequence length
Number of layers
Number of KV heads
Hidden dimension

For long sequences with large batch sizes, KV cache can exceed model weights in memory usage.

PagedAttention

vLLM's PagedAttention treats GPU memory like OS virtual memory:

"PagedAttention splits memory into small reusable pages, cutting memory waste by up to 90%."

Benefits:

Non-contiguous storage: KV cache blocks stored anywhere in GPU memory
Dynamic allocation: Memory allocated only as sequences grow
Memory sharing: Identical prompt prefixes share KV cache blocks

KV Cache Compression

Techniques to reduce KV cache memory:

Quantization: Store KV cache in INT8 or INT4 Pruning: Remove less important cached states Sliding window: Only cache recent tokens (for models that support it)

Research highlight: "TurboAttention is a novel unified technique for enabling quantized execution of attention along with a cooperative KV cache compression mechanism which reduces latency, memory footprint, with negligible accuracy loss."

Batching Strategies

Static Batching (Bad)

Wait for batch to fill, process together, wait for all to complete.

Problem: Variable output lengths cause GPU underutilization. Fast completions wait for slow ones.

Continuous Batching (Good)

"Continuous batching dynamically mixes new requests with ongoing ones so your GPU is never idle."

How it works:

Process batch for one iteration
Remove completed sequences
Add new sequences to fill batch
Repeat

Results: "By leveraging vLLM, users can achieve 23x LLM inference throughput while reducing p50 latency."

Iteration-Level Scheduling

Advanced frameworks schedule at the iteration level:

Preempt low-priority requests
Prioritize latency-sensitive queries
Balance throughput and latency

Speculative Decoding

Speculative decoding uses a small "draft" model to generate candidate tokens, then the large "target" model verifies them in parallel.

The key insight: LLM inference is bottlenecked by sequential token generation, not parallel computation. Generating 100 tokens requires 100 forward passes through the model—each waiting for the previous token. But a forward pass that processes 1 token versus 10 tokens takes nearly the same time (GPUs are parallel). Speculative decoding exploits this: have a fast model guess multiple tokens, then verify all guesses in one forward pass of the slow model.

Why verification is cheap: Computing "what token would the model generate?" requires autoregressive sampling—compute probabilities, sample one, feed it back, repeat. But computing "what probability does the model assign to these specific tokens?" is a single parallel forward pass. You provide the tokens; the model computes all their probabilities simultaneously. This asymmetry between generation (sequential) and verification (parallel) is what makes speculative decoding work.

The acceptance criterion: Not all guesses will match what the target model would have generated. Speculative decoding accepts a guess if the target model would have produced it with similar probability. When a guess is rejected, we discard it and all subsequent guesses (since they were conditioned on the rejected token). The math ensures the output distribution exactly matches what the target model would have produced alone—speculation is lossless, just faster.

How It Works

Draft model generates k candidate tokens quickly
Target model verifies all k tokens in one forward pass
Accept verified tokens, reject and regenerate from first mismatch
Repeat

This works because verification (parallel) is faster than generation (sequential) for large models.

Performance Gains

Research on AMD MI300X: "vLLM achieves up to 2.31x speedup when enabled with speculative decoding."

Production example: "In production systems handling customer service queries, speculative decoding with a 2B draft model and 13B target model reduced per-token latency from 42ms to 18ms—a 2.3x improvement."

Finding the Sweet Spot

Research findings: "Finding the 'sweet spot' of k=1 yielded consistent gains of 20%–54% lower cost per token across all profiles vs baseline."

Optimal k depends on:

Draft model quality (higher acceptance rate → higher k)
Target model size (larger → more benefit from parallelism)
Hardware characteristics

Combining with Quantization

"The most impressive results came from combining N-gram speculative decoding with FP8 quantization."

However, there are challenges: "While quantization alone can accelerate LLMs under the right hardware and precision settings, combining it with speculative decoding is non-trivial. On GPUs like the NVIDIA L4, the added overhead from quantization often cancels out the benefits of speculation."

Deployment Frameworks

vLLM

vLLM is the most popular open-source LLM serving framework:

Key Features:

PagedAttention for memory efficiency
Continuous batching
Optimized CUDA kernels
Speculative decoding support
Quantization support (FP8, AWQ, GPTQ)

Performance: "vLLM improves throughput by 2–4× over systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences."

Best For: "Flexibility and fast integration with Hugging Face models"

TensorRT-LLM

NVIDIA's optimized inference library:

Key Features:

Custom attention kernels
Inflight batching and paged KV caching
Quantization down to FP4 and INT4
Speculative decoding
Tight integration with Triton Inference Server

Performance: "On H100 with FP8, TensorRT-LLM reaches over 10,000 output tokens/s at peak throughput for 64 concurrent requests."

Best For: "Maximum performance when deep in the NVIDIA ecosystem"

Comparison

Aspect	vLLM	TensorRT-LLM
Hardware	Any CUDA GPU	NVIDIA enterprise GPUs
Ease of Use	Easier	Steeper learning curve
Performance	Excellent	Maximum on NVIDIA
Model Support	Broad	NVIDIA-optimized models

"In practice, many dev teams mix these systems—for example using TensorRT-LLM for high volume proprietary chat, and vLLM or LMDeploy for experimental and open model workloads."

Other Frameworks

LMDeploy: Chinese-developed, excellent performance Text Generation Inference (TGI): Hugging Face's production server llama.cpp: CPU inference, great for edge deployment

Putting It All Together

Optimization Stack

Layer optimizations for maximum impact:

Code

Layer 1: Model Selection
         └── Choose appropriate model size for task

Layer 2: Quantization
         └── INT8 default, INT4 if memory-constrained

Layer 3: Attention Optimization
         └── FlashAttention-2/3, MQA/GQA

Layer 4: KV Cache Management
         └── PagedAttention, quantized KV cache

Layer 5: Batching
         └── Continuous batching

Layer 6: Speculative Decoding
         └── If latency-critical and good draft model available

Layer 7: Hardware Optimization
         └── Framework-specific (TensorRT-LLM on NVIDIA)

Example Configuration

Production deployment for a 70B model on 2x H100:

Python

# vLLM configuration
llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",
    tensor_parallel_size=2,
    quantization="awq",           # INT4 to fit in memory
    gpu_memory_utilization=0.9,
    enable_prefix_caching=True,   # KV cache sharing
    max_model_len=8192,
)

# Serve with continuous batching
engine = AsyncLLMEngine.from_engine_args(engine_args)

Cost-Performance Tradeoffs

Configuration	Throughput	Latency	Memory	Quality
FP16, no optimization	1x	1x	140GB	100%
INT8 + FlashAttention	2x	0.7x	70GB	~99%
INT4 + PagedAttention	3x	0.6x	40GB	~97%
INT4 + Speculative	4x	0.4x	45GB	~97%

Monitoring and Profiling

Key Metrics

Metric	Target	Notes
Time to First Token (TTFT)	< 500ms	User experience critical
Inter-Token Latency	< 50ms	Streaming smoothness
Throughput (tok/s)	Maximize	Cost efficiency
GPU Utilization	> 80%	Resource efficiency
Memory Usage	< 90%	Headroom for spikes

Profiling Tools

NVIDIA Nsight: Detailed GPU profiling
PyTorch Profiler: Model-level insights
vLLM metrics: Built-in Prometheus metrics
TensorRT-LLM profiler: Framework-specific

Conclusion

LLM inference optimization is a multi-layered discipline. The key techniques:

Quantization: INT8 default, INT4 when needed, AWQ for quality
FlashAttention: Essential for any production deployment
PagedAttention: Dramatic memory efficiency gains
Continuous batching: 10-20x throughput improvement
Speculative decoding: 2-3x latency reduction when applicable

Start with vLLM for flexibility, graduate to TensorRT-LLM for maximum NVIDIA performance. Layer optimizations based on your constraints: memory-limited → aggressive quantization; latency-limited → speculative decoding; throughput-limited → continuous batching optimization.

The field is evolving rapidly. Stay current with framework releases and new optimization techniques.

Table of Contents