vLLM in Production: The Complete Guide to High-Performance LLM Serving
A comprehensive guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.
Table of Contents
Introduction
vLLM has become the de facto standard for open-source LLM inference. Originally developed at UC Berkeley's Sky Computing Lab, it introduced PagedAttention—a breakthrough that treats GPU memory like virtual memory, cutting memory waste by up to 90%. Today, vLLM powers production deployments serving billions of tokens daily at companies from startups to enterprises.
This guide goes beyond the basics. We'll cover vLLM's architecture, production configuration, Kubernetes deployment, monitoring, and the hard-won lessons from running vLLM at scale.
Why vLLM?
Before diving in, let's understand why vLLM dominates:
| Feature | vLLM | TensorRT-LLM | SGLang |
|---|---|---|---|
| Ease of Setup | ★★★★★ | ★★☆☆☆ | ★★★★☆ |
| Hardware Support | NVIDIA, AMD, Intel, TPU | NVIDIA only | NVIDIA, AMD |
| Model Support | 100+ architectures | 25-40 models | 60+ models |
| Throughput | Excellent | Maximum on NVIDIA | Excellent |
| Community | Largest, most active | NVIDIA-backed | Growing fast |
vLLM wins on flexibility and ease of use. If you're deep in the NVIDIA ecosystem and need every last percent of performance, TensorRT-LLM may be worth the complexity. For most teams, vLLM is the right choice.
Architecture Deep Dive
Understanding vLLM's internals helps you configure and debug it effectively.
PagedAttention: The Core Innovation
Standard LLM serving pre-allocates contiguous memory for the maximum possible sequence length. This wastes enormous GPU memory—most sequences don't reach maximum length.
PagedAttention solves this by treating KV cache like virtual memory:
Traditional Allocation:
┌─────────────────────────────────────────────────┐
│ Sequence 1 KV Cache (pre-allocated max length) │
│ [used][used][used][████ wasted space ████████] │
├─────────────────────────────────────────────────┤
│ Sequence 2 KV Cache (pre-allocated max length) │
│ [used][████████ wasted space ████████████████] │
└─────────────────────────────────────────────────┘
PagedAttention:
┌────────┬────────┬────────┬────────┬────────┐
│ Seq1-0 │ Seq2-0 │ Seq1-1 │ Seq1-2 │ Seq2-1 │ <- Blocks allocated on demand
└────────┴────────┴────────┴────────┴────────┘
↑ Non-contiguous, dynamically allocated blocks
Key benefits:
- Non-contiguous storage: KV cache blocks can be anywhere in GPU memory
- Dynamic allocation: Memory allocated only as sequences grow
- Memory sharing: Identical prompt prefixes share KV cache blocks
- Near-zero waste: Eliminates internal fragmentation
Result: 2-4x higher throughput on the same hardware compared to naive implementations.
The Scheduler
vLLM's scheduler implements continuous batching with iteration-level scheduling:
- Prefill phase: New requests have their prompts processed in parallel
- Decode phase: Tokens generated one at a time, autoregressively
- Iteration-level scheduling: After each iteration, scheduler can:
- Add new requests to the batch
- Remove completed sequences
- Preempt low-priority requests if memory pressure is high
This keeps GPU utilization high even with variable-length outputs.
Chunked Prefill
For long prompts (32K+ tokens), prefill can block everything. Chunked prefill splits the prompt into chunks:
Without Chunked Prefill:
[========= 50K token prefill blocks all decode =========][decode][decode]
With Chunked Prefill:
[chunk1][decode][chunk2][decode][chunk3][decode][chunk4][decode]
Enable with --enable-chunked-prefill. Essential for long-context models.
Prefix Caching
When multiple requests share the same prefix (system prompt, few-shot examples), vLLM can reuse the computed KV cache:
# These requests share the same system prompt
# vLLM computes KV cache for system prompt once, reuses for all
Request 1: [System Prompt] + "What is 2+2?"
Request 2: [System Prompt] + "Explain quantum computing"
Request 3: [System Prompt] + "Write a poem about AI"
↑
KV cache computed once, shared across requests
Enable with --enable-prefix-caching. Can dramatically improve throughput for chat applications.
Configuration Guide
Configuration is where most teams either unlock vLLM's full potential or leave performance on the table. The defaults are sensible for quick experiments, but production deployments require tuning based on your specific hardware, model, and workload characteristics.
The core trade-off is memory vs. throughput: allocating more GPU memory to KV cache enables larger batches and higher throughput, but leaves less margin for spikes and increases OOM risk. The right balance depends on your latency requirements and traffic patterns.
Essential Parameters
These parameters control memory allocation, parallelism, and optimization features. Start with these values, then tune based on your monitoring data.
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
# Memory configuration
gpu_memory_utilization=0.90, # Use 90% of GPU memory
max_model_len=8192, # Maximum sequence length
# Parallelism
tensor_parallel_size=2, # Split across 2 GPUs
# Performance optimizations
enable_prefix_caching=True, # Reuse KV cache for shared prefixes
enable_chunked_prefill=True, # Don't block decode during long prefills
# Quantization (optional)
quantization="awq", # Use AWQ INT4 quantization
)
Understanding each parameter:
-
gpu_memory_utilization=0.90: Pre-allocate 90% of GPU memory for KV cache and model weights. The remaining 10% is headroom for PyTorch/CUDA overhead. Increase to 0.95 for maximum throughput; decrease to 0.80 if you see occasional OOM errors. Monitor
vllm:gpu_cache_usage_percto tune. -
max_model_len=8192: Maximum tokens (prompt + response) for any request. Directly impacts KV cache size—higher values mean fewer concurrent requests. Set to your actual maximum need, not the model's theoretical limit. A 70B model at max_model_len=32K uses ~4x the KV cache memory as max_model_len=8K.
-
tensor_parallel_size=2: Split model weights and KV cache across 2 GPUs. Required when model doesn't fit on one GPU; also useful to increase KV cache capacity. Requires NVLink for best performance. Use pipeline parallelism (
pipeline_parallel_size) for cross-node or non-NVLink setups. -
enable_prefix_caching=True: Reuse computed KV cache for identical prompt prefixes. Essential for chat applications where system prompts repeat. Can reduce compute by 50%+ for long system prompts.
-
enable_chunked_prefill=True: Process long prompts in chunks rather than all-at-once. Prevents long prompts from blocking token generation for other requests. Essential for models with 32K+ context.
-
quantization="awq": 4-bit quantization reduces memory 4x with ~2-3% quality loss. Enables larger models or more concurrent requests. FP8 is better quality but requires H100/Ada GPUs.
Parameter Deep Dive
gpu_memory_utilization
Controls how much GPU memory vLLM pre-allocates for KV cache.
| Value | Use Case |
|---|---|
| 0.80 | Conservative, leaves headroom for spikes |
| 0.90 | Production default, good balance |
| 0.95 | Aggressive, maximum throughput |
| 0.98+ | Not recommended, OOM risk |
If you encounter frequent preemptions, increase this value. If you hit OOM errors, decrease it.
max_model_len
Maximum sequence length (prompt + generation). Directly impacts memory usage.
Memory impact formula (approximate):
where = layers, = KV heads, = head dim, = max sequence length, = batch size.
For a 70B model with 80 layers, 8 KV heads, head_dim 128, at FP16:
- max_model_len=4096: ~2.6GB per sequence
- max_model_len=8192: ~5.2GB per sequence
- max_model_len=32768: ~21GB per sequence
Set this to the maximum you actually need, not the model's maximum capability.
tensor_parallel_size
Splits model across multiple GPUs. Each GPU holds a shard of the weights and KV cache.
# Single GPU (if model fits)
tensor_parallel_size=1
# Model too large for one GPU, or need more KV cache space
tensor_parallel_size=2 # Requires 2 GPUs
tensor_parallel_size=4 # Requires 4 GPUs
tensor_parallel_size=8 # Requires 8 GPUs
When to use pipeline parallelism instead:
- GPUs without NVLink (e.g., L40S, consumer GPUs)
- Cross-node inference
# Pipeline parallelism for non-NVLink GPUs
pipeline_parallel_size=2
Batching Parameters
# Maximum sequences in a batch
max_num_seqs=256 # Default, good for most cases
# Maximum tokens per iteration (prefill + decode)
max_num_batched_tokens=8192 # Increase for higher throughput on large GPUs
For throughput optimization on large GPUs (A100, H100):
max_num_batched_tokens=16384 # or higher
Quantization Options
vLLM supports multiple quantization methods:
| Method | Bits | Memory Reduction | Quality Impact | Best For |
|---|---|---|---|---|
| None (FP16/BF16) | 16 | Baseline | None | Quality-critical |
| FP8 | 8 | 2x | Minimal | H100/Ada GPUs |
| AWQ | 4 | 4x | Small (~2-3%) | Memory-constrained |
| GPTQ | 4 | 4x | Small (~2-5%) | Pre-quantized models |
| INT8 | 8 | 2x | Minimal | General production |
# FP8 on H100 (recommended for H100s)
llm = LLM(model="...", quantization="fp8")
# AWQ for memory savings
llm = LLM(model="...", quantization="awq")
# Load pre-quantized GPTQ model
llm = LLM(model="TheBloke/Llama-2-70B-GPTQ")
Speculative Decoding
Speculative decoding trades compute for latency. Instead of generating one token at a time with the slow large model, a fast draft model generates several candidate tokens, then the large model verifies them in parallel. Since verification is parallelizable (unlike autoregressive generation), you get multiple tokens from one large-model forward pass.
The technique works because similar models have similar token distributions—the draft model often guesses correctly, and when it doesn't, the large model corrects it with minimal wasted work.
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="meta-llama/Llama-3.2-1B-Instruct",
num_speculative_tokens=5, # Number of tokens to speculate
)
Expected speedup: 1.5-2.5x for latency-sensitive applications. Works best when:
- Draft model is fast (1-3B parameters)
- Draft and target models have similar token distributions
- Tasks have predictable outputs (code, structured data)
Production Deployment
Docker Deployment
The simplest production deployment:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.90 \
--max-model-len 8192
For multi-GPU:
docker run --runtime nvidia --gpus '"device=0,1"' \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90
Kubernetes Deployment
Kubernetes deployment requires careful resource planning. vLLM needs exclusive access to GPUs, persistent storage for model weights (to avoid re-downloading on restarts), and proper health checks. The key insight: model loading is slow (minutes for large models), so optimize for pod stability over rapid scaling.
This deployment uses a PersistentVolumeClaim for model weights—new pods mount the same cached weights instead of downloading. This is crucial for scale-up speed.
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--gpu-memory-utilization"
- "0.90"
- "--max-model-len"
- "8192"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
nodeSelector:
nvidia.com/gpu.present: "true"
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-server
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
Key deployment decisions:
-
nodeSelector (line 30): Ensures pods land on GPU nodes. Use more specific selectors (
nvidia.com/gpu.product: "A100-SXM4-80GB") for heterogeneous clusters. -
PersistentVolumeClaim (lines 25-27): Shared model storage is critical. Without it, each pod downloads the full model on startup—minutes for 70B models. Use ReadWriteMany storage (NFS, EFS, GCS FUSE) for multi-replica deployments.
-
replicas: 1 (line 8): Start with one replica, use HPA to scale. LLM serving is GPU-bound, so scaling horizontally requires proportionally more GPUs.
-
ipc=host equivalent: Not shown in K8s but needed for some configurations. If you hit IPC issues, add
shareProcessNamespace: trueor increase SHM size.
Horizontal Pod Autoscaling
Scale based on request queue length:
# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "5"
For HPA to work, you need:
- Prometheus scraping vLLM metrics
- Prometheus Adapter exposing metrics to Kubernetes
- ReadWriteMany storage for model weights (new pods share the same PVC)
Using ReadWriteMany is crucial—new pods don't need to re-download model weights, reducing scale-up time from minutes to seconds.
KEDA Autoscaling
For more sophisticated autoscaling:
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
spec:
scaleTargetRef:
name: vllm-server
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_num_requests_waiting
threshold: "10"
query: sum(vllm_num_requests_waiting)
Monitoring and Observability
Prometheus Metrics
vLLM exposes metrics at /metrics by default. Key metrics to monitor:
Throughput Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
vllm:prompt_tokens_total | Prefill tokens processed | Track trend |
vllm:generation_tokens_total | Generated tokens | Track trend |
vllm:request_success_total | Successful requests | Track trend |
Latency Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
vllm:time_to_first_token_seconds | Time to first token (TTFT) | p95 > 2s |
vllm:time_per_output_token_seconds | Inter-token latency | p95 > 100ms |
vllm:e2e_request_latency_seconds | End-to-end latency | p95 > 10s |
Resource Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
vllm:num_requests_running | Currently processing | Capacity planning |
vllm:num_requests_waiting | Queue depth | > 50 (scale up) |
vllm:gpu_cache_usage_perc | KV cache utilization | > 95% |
vllm:cpu_prefix_cache_hit_rate | Prefix cache effectiveness | Track trend |
Grafana Dashboard
Essential panels for your vLLM dashboard:
- Request Rate:
rate(vllm:request_success_total[5m]) - Queue Depth:
vllm:num_requests_waiting - TTFT p95:
histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket[5m])) - Throughput:
rate(vllm:generation_tokens_total[5m]) - GPU Cache Usage:
vllm:gpu_cache_usage_perc - Error Rate:
rate(vllm:request_failure_total[5m]) / rate(vllm:request_success_total[5m])
Alerting Rules
# prometheus-rules.yaml
groups:
- name: vllm-alerts
rules:
- alert: HighQueueDepth
expr: vllm:num_requests_waiting > 50
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM request queue is building up"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: critical
annotations:
summary: "vLLM p95 latency exceeds 10s"
- alert: HighCacheUsage
expr: vllm:gpu_cache_usage_perc > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM GPU cache near capacity"
Troubleshooting
Out of Memory (OOM) Errors
Symptoms: CUDA OOM errors, process crashes
Solutions:
- Reduce
gpu_memory_utilization(try 0.85 or 0.80) - Reduce
max_model_len - Use quantization (AWQ, GPTQ, FP8)
- Increase
tensor_parallel_size(add more GPUs) - Enable chunked prefill for long sequences
# OOM-safe configuration
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
gpu_memory_utilization=0.85,
max_model_len=4096,
quantization="awq",
tensor_parallel_size=2,
enable_chunked_prefill=True,
)
High Latency
Symptoms: TTFT > 1s, slow token generation
Diagnosis:
- Check queue depth (
vllm:num_requests_waiting) - Check GPU cache usage (
vllm:gpu_cache_usage_perc) - Check if prefill is blocking decode (enable chunked prefill)
Solutions:
- Scale horizontally (add more replicas)
- Enable prefix caching for repeated prefixes
- Reduce
max_model_lenif sequences are shorter - Use speculative decoding for latency-sensitive workloads
Low Throughput
Symptoms: GPU utilization < 80%, tokens/s below expected
Diagnosis:
- Check batch size (
vllm:num_requests_running) - Check for preemptions in logs
- Verify continuous batching is working
Solutions:
- Increase
max_num_batched_tokens - Increase
gpu_memory_utilization - Ensure enough concurrent requests to fill batches
- Check for network bottlenecks in distributed setups
Preemption Issues
Symptoms: Requests being preempted, retried
Solutions:
- Increase
gpu_memory_utilization - Reduce
max_model_len - Use quantization to free up KV cache space
- Scale horizontally to reduce per-instance load
Performance Benchmarking
Benchmarking Setup
Use the official vLLM benchmark scripts:
# Throughput benchmark
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-len 512 \
--output-len 128 \
--num-prompts 1000
# Latency benchmark
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-len 512 \
--output-len 128 \
--batch-size 1
Expected Performance
Rough benchmarks on common hardware (Llama 3.1 8B, 512 input / 128 output):
| GPU | Throughput (tok/s) | TTFT (ms) | ITL (ms) |
|---|---|---|---|
| A100 80GB | 4,000-5,000 | 50-100 | 15-25 |
| H100 80GB | 6,000-8,000 | 30-60 | 10-15 |
| L40S 48GB | 2,500-3,500 | 80-150 | 25-40 |
| A10G 24GB | 1,500-2,000 | 100-200 | 30-50 |
Performance varies significantly with:
- Quantization method
- Batch size and concurrency
- Sequence lengths
- Model architecture
Always benchmark on your specific workload.
vLLM vs Alternatives
vLLM vs SGLang
SGLang excels at structured generation and prefix-heavy workloads:
| Aspect | vLLM | SGLang |
|---|---|---|
| Raw throughput | Excellent | Excellent |
| Prefix reuse | Good (prefix caching) | Best (RadixAttention) |
| Structured output | Basic | Excellent (grammar support) |
| Community size | Largest | Growing fast |
| Production maturity | Battle-tested | Newer |
Choose SGLang if: Heavy agent/tool use, lots of prefix reuse, need constrained decoding.
Choose vLLM if: General-purpose serving, maximum compatibility, largest community.
vLLM vs TensorRT-LLM
TensorRT-LLM offers maximum performance on NVIDIA hardware:
| Aspect | vLLM | TensorRT-LLM |
|---|---|---|
| Setup complexity | Low | High |
| Performance | Excellent | Maximum on NVIDIA |
| Hardware support | Broad | NVIDIA only |
| Model support | 100+ | 25-40 |
| Update frequency | Weekly | Monthly |
Choose TensorRT-LLM if: All-NVIDIA stack, need absolute maximum performance, have engineering resources.
Choose vLLM if: Need flexibility, broad model support, faster iteration.
Best Practices Summary
- Start with sensible defaults:
gpu_memory_utilization=0.90, enable prefix caching - Right-size max_model_len: Set to actual maximum needed, not model maximum
- Enable chunked prefill: Essential for long-context workloads
- Monitor queue depth: Scale horizontally when consistently > 10-20
- Use ReadWriteMany storage: Critical for fast horizontal scaling
- Benchmark your workload: Don't rely on generic benchmarks
- Set up alerting: Queue depth, latency, cache usage
- Consider quantization: AWQ or FP8 for significant memory savings with minimal quality loss
Conclusion
vLLM has earned its position as the leading open-source LLM serving framework through a combination of performance, flexibility, and ease of use. PagedAttention, continuous batching, and the constant stream of optimizations make it the right choice for most production deployments.
Start with Docker for development, graduate to Kubernetes for production. Monitor religiously, and scale horizontally when needed. The techniques in this guide will help you serve millions of requests reliably.
Frequently Asked Questions
Related Articles
LLM Inference Optimization: From Quantization to Speculative Decoding
A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.
Open-Source LLMs: The Complete 2025 Guide
A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.
Small Language Models: Edge Deployment and Knowledge Distillation
The rise of Small Language Models (SLMs)—from Llama 3.2 to Phi-4 to Qwen 2.5. Understanding knowledge distillation, quantization, and deploying AI at the edge.
LLM Frameworks: LangChain, LlamaIndex, LangGraph, and Beyond
A comprehensive comparison of LLM application frameworks—LangChain, LlamaIndex, LangGraph, Haystack, and alternatives. When to use each, how to combine them, and practical implementation patterns.