Skip to main content
Back to Blog

vLLM in Production: The Complete Guide to High-Performance LLM Serving

A comprehensive guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.

9 min read
Share:

Introduction

vLLM has become the de facto standard for open-source LLM inference. Originally developed at UC Berkeley's Sky Computing Lab, it introduced PagedAttention—a breakthrough that treats GPU memory like virtual memory, cutting memory waste by up to 90%. Today, vLLM powers production deployments serving billions of tokens daily at companies from startups to enterprises.

This guide goes beyond the basics. We'll cover vLLM's architecture, production configuration, Kubernetes deployment, monitoring, and the hard-won lessons from running vLLM at scale.

Why vLLM?

Before diving in, let's understand why vLLM dominates:

FeaturevLLMTensorRT-LLMSGLang
Ease of Setup★★★★★★★☆☆☆★★★★☆
Hardware SupportNVIDIA, AMD, Intel, TPUNVIDIA onlyNVIDIA, AMD
Model Support100+ architectures25-40 models60+ models
ThroughputExcellentMaximum on NVIDIAExcellent
CommunityLargest, most activeNVIDIA-backedGrowing fast

vLLM wins on flexibility and ease of use. If you're deep in the NVIDIA ecosystem and need every last percent of performance, TensorRT-LLM may be worth the complexity. For most teams, vLLM is the right choice.

Architecture Deep Dive

Understanding vLLM's internals helps you configure and debug it effectively.

PagedAttention: The Core Innovation

Standard LLM serving pre-allocates contiguous memory for the maximum possible sequence length. This wastes enormous GPU memory—most sequences don't reach maximum length.

PagedAttention solves this by treating KV cache like virtual memory:

Code
Traditional Allocation:
┌─────────────────────────────────────────────────┐
│ Sequence 1 KV Cache (pre-allocated max length)  │
│ [used][used][used][████ wasted space ████████]  │
├─────────────────────────────────────────────────┤
│ Sequence 2 KV Cache (pre-allocated max length)  │
│ [used][████████ wasted space ████████████████]  │
└─────────────────────────────────────────────────┘

PagedAttention:
┌────────┬────────┬────────┬────────┬────────┐
│ Seq1-0 │ Seq2-0 │ Seq1-1 │ Seq1-2 │ Seq2-1 │  <- Blocks allocated on demand
└────────┴────────┴────────┴────────┴────────┘
     ↑ Non-contiguous, dynamically allocated blocks

Key benefits:

  • Non-contiguous storage: KV cache blocks can be anywhere in GPU memory
  • Dynamic allocation: Memory allocated only as sequences grow
  • Memory sharing: Identical prompt prefixes share KV cache blocks
  • Near-zero waste: Eliminates internal fragmentation

Result: 2-4x higher throughput on the same hardware compared to naive implementations.

The Scheduler

vLLM's scheduler implements continuous batching with iteration-level scheduling:

  1. Prefill phase: New requests have their prompts processed in parallel
  2. Decode phase: Tokens generated one at a time, autoregressively
  3. Iteration-level scheduling: After each iteration, scheduler can:
    • Add new requests to the batch
    • Remove completed sequences
    • Preempt low-priority requests if memory pressure is high

This keeps GPU utilization high even with variable-length outputs.

Chunked Prefill

For long prompts (32K+ tokens), prefill can block everything. Chunked prefill splits the prompt into chunks:

Code
Without Chunked Prefill:
[========= 50K token prefill blocks all decode =========][decode][decode]

With Chunked Prefill:
[chunk1][decode][chunk2][decode][chunk3][decode][chunk4][decode]

Enable with --enable-chunked-prefill. Essential for long-context models.

Prefix Caching

When multiple requests share the same prefix (system prompt, few-shot examples), vLLM can reuse the computed KV cache:

Python
# These requests share the same system prompt
# vLLM computes KV cache for system prompt once, reuses for all

Request 1: [System Prompt] + "What is 2+2?"
Request 2: [System Prompt] + "Explain quantum computing"
Request 3: [System Prompt] + "Write a poem about AI"
           ↑
           KV cache computed once, shared across requests

Enable with --enable-prefix-caching. Can dramatically improve throughput for chat applications.

Configuration Guide

Configuration is where most teams either unlock vLLM's full potential or leave performance on the table. The defaults are sensible for quick experiments, but production deployments require tuning based on your specific hardware, model, and workload characteristics.

The core trade-off is memory vs. throughput: allocating more GPU memory to KV cache enables larger batches and higher throughput, but leaves less margin for spikes and increases OOM risk. The right balance depends on your latency requirements and traffic patterns.

Essential Parameters

These parameters control memory allocation, parallelism, and optimization features. Start with these values, then tune based on your monitoring data.

Python
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",

    # Memory configuration
    gpu_memory_utilization=0.90,      # Use 90% of GPU memory
    max_model_len=8192,               # Maximum sequence length

    # Parallelism
    tensor_parallel_size=2,           # Split across 2 GPUs

    # Performance optimizations
    enable_prefix_caching=True,       # Reuse KV cache for shared prefixes
    enable_chunked_prefill=True,      # Don't block decode during long prefills

    # Quantization (optional)
    quantization="awq",               # Use AWQ INT4 quantization
)

Understanding each parameter:

  • gpu_memory_utilization=0.90: Pre-allocate 90% of GPU memory for KV cache and model weights. The remaining 10% is headroom for PyTorch/CUDA overhead. Increase to 0.95 for maximum throughput; decrease to 0.80 if you see occasional OOM errors. Monitor vllm:gpu_cache_usage_perc to tune.

  • max_model_len=8192: Maximum tokens (prompt + response) for any request. Directly impacts KV cache size—higher values mean fewer concurrent requests. Set to your actual maximum need, not the model's theoretical limit. A 70B model at max_model_len=32K uses ~4x the KV cache memory as max_model_len=8K.

  • tensor_parallel_size=2: Split model weights and KV cache across 2 GPUs. Required when model doesn't fit on one GPU; also useful to increase KV cache capacity. Requires NVLink for best performance. Use pipeline parallelism (pipeline_parallel_size) for cross-node or non-NVLink setups.

  • enable_prefix_caching=True: Reuse computed KV cache for identical prompt prefixes. Essential for chat applications where system prompts repeat. Can reduce compute by 50%+ for long system prompts.

  • enable_chunked_prefill=True: Process long prompts in chunks rather than all-at-once. Prevents long prompts from blocking token generation for other requests. Essential for models with 32K+ context.

  • quantization="awq": 4-bit quantization reduces memory 4x with ~2-3% quality loss. Enables larger models or more concurrent requests. FP8 is better quality but requires H100/Ada GPUs.

Parameter Deep Dive

gpu_memory_utilization

Controls how much GPU memory vLLM pre-allocates for KV cache.

ValueUse Case
0.80Conservative, leaves headroom for spikes
0.90Production default, good balance
0.95Aggressive, maximum throughput
0.98+Not recommended, OOM risk

If you encounter frequent preemptions, increase this value. If you hit OOM errors, decrease it.

max_model_len

Maximum sequence length (prompt + generation). Directly impacts memory usage.

Memory impact formula (approximate):

KV Cache=2×L×Hkv×dh×Smax×B×bytes\text{KV Cache} = 2 \times L \times H_{kv} \times d_h \times S_{\max} \times B \times \text{bytes}

where LL = layers, HkvH_{kv} = KV heads, dhd_h = head dim, SmaxS_{\max} = max sequence length, BB = batch size.

For a 70B model with 80 layers, 8 KV heads, head_dim 128, at FP16:

  • max_model_len=4096: ~2.6GB per sequence
  • max_model_len=8192: ~5.2GB per sequence
  • max_model_len=32768: ~21GB per sequence

Set this to the maximum you actually need, not the model's maximum capability.

tensor_parallel_size

Splits model across multiple GPUs. Each GPU holds a shard of the weights and KV cache.

Python
# Single GPU (if model fits)
tensor_parallel_size=1

# Model too large for one GPU, or need more KV cache space
tensor_parallel_size=2  # Requires 2 GPUs
tensor_parallel_size=4  # Requires 4 GPUs
tensor_parallel_size=8  # Requires 8 GPUs

When to use pipeline parallelism instead:

  • GPUs without NVLink (e.g., L40S, consumer GPUs)
  • Cross-node inference
Python
# Pipeline parallelism for non-NVLink GPUs
pipeline_parallel_size=2

Batching Parameters

Python
# Maximum sequences in a batch
max_num_seqs=256              # Default, good for most cases

# Maximum tokens per iteration (prefill + decode)
max_num_batched_tokens=8192   # Increase for higher throughput on large GPUs

For throughput optimization on large GPUs (A100, H100):

Python
max_num_batched_tokens=16384  # or higher

Quantization Options

vLLM supports multiple quantization methods:

MethodBitsMemory ReductionQuality ImpactBest For
None (FP16/BF16)16BaselineNoneQuality-critical
FP882xMinimalH100/Ada GPUs
AWQ44xSmall (~2-3%)Memory-constrained
GPTQ44xSmall (~2-5%)Pre-quantized models
INT882xMinimalGeneral production
Python
# FP8 on H100 (recommended for H100s)
llm = LLM(model="...", quantization="fp8")

# AWQ for memory savings
llm = LLM(model="...", quantization="awq")

# Load pre-quantized GPTQ model
llm = LLM(model="TheBloke/Llama-2-70B-GPTQ")

Speculative Decoding

Speculative decoding trades compute for latency. Instead of generating one token at a time with the slow large model, a fast draft model generates several candidate tokens, then the large model verifies them in parallel. Since verification is parallelizable (unlike autoregressive generation), you get multiple tokens from one large-model forward pass.

The technique works because similar models have similar token distributions—the draft model often guesses correctly, and when it doesn't, the large model corrects it with minimal wasted work.

Python
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",
    num_speculative_tokens=5,  # Number of tokens to speculate
)

Expected speedup: 1.5-2.5x for latency-sensitive applications. Works best when:

  • Draft model is fast (1-3B parameters)
  • Draft and target models have similar token distributions
  • Tasks have predictable outputs (code, structured data)

Production Deployment

Docker Deployment

The simplest production deployment:

Bash
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192

For multi-GPU:

Bash
docker run --runtime nvidia --gpus '"device=0,1"' \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.90

Kubernetes Deployment

Kubernetes deployment requires careful resource planning. vLLM needs exclusive access to GPUs, persistent storage for model weights (to avoid re-downloading on restarts), and proper health checks. The key insight: model loading is slow (minutes for large models), so optimize for pod stability over rapid scaling.

This deployment uses a PersistentVolumeClaim for model weights—new pods mount the same cached weights instead of downloading. This is crucial for scale-up speed.

YAML
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--max-model-len"
        - "8192"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      nodeSelector:
        nvidia.com/gpu.present: "true"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-server
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

Key deployment decisions:

  • nodeSelector (line 30): Ensures pods land on GPU nodes. Use more specific selectors (nvidia.com/gpu.product: "A100-SXM4-80GB") for heterogeneous clusters.

  • PersistentVolumeClaim (lines 25-27): Shared model storage is critical. Without it, each pod downloads the full model on startup—minutes for 70B models. Use ReadWriteMany storage (NFS, EFS, GCS FUSE) for multi-replica deployments.

  • replicas: 1 (line 8): Start with one replica, use HPA to scale. LLM serving is GPU-bound, so scaling horizontally requires proportionally more GPUs.

  • ipc=host equivalent: Not shown in K8s but needed for some configurations. If you hit IPC issues, add shareProcessNamespace: true or increase SHM size.

Horizontal Pod Autoscaling

Scale based on request queue length:

YAML
# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_waiting
      target:
        type: AverageValue
        averageValue: "5"

For HPA to work, you need:

  1. Prometheus scraping vLLM metrics
  2. Prometheus Adapter exposing metrics to Kubernetes
  3. ReadWriteMany storage for model weights (new pods share the same PVC)

Using ReadWriteMany is crucial—new pods don't need to re-download model weights, reducing scale-up time from minutes to seconds.

KEDA Autoscaling

For more sophisticated autoscaling:

YAML
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
spec:
  scaleTargetRef:
    name: vllm-server
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_num_requests_waiting
      threshold: "10"
      query: sum(vllm_num_requests_waiting)

Monitoring and Observability

Prometheus Metrics

vLLM exposes metrics at /metrics by default. Key metrics to monitor:

Throughput Metrics

MetricDescriptionAlert Threshold
vllm:prompt_tokens_totalPrefill tokens processedTrack trend
vllm:generation_tokens_totalGenerated tokensTrack trend
vllm:request_success_totalSuccessful requestsTrack trend

Latency Metrics

MetricDescriptionAlert Threshold
vllm:time_to_first_token_secondsTime to first token (TTFT)p95 > 2s
vllm:time_per_output_token_secondsInter-token latencyp95 > 100ms
vllm:e2e_request_latency_secondsEnd-to-end latencyp95 > 10s

Resource Metrics

MetricDescriptionAlert Threshold
vllm:num_requests_runningCurrently processingCapacity planning
vllm:num_requests_waitingQueue depth> 50 (scale up)
vllm:gpu_cache_usage_percKV cache utilization> 95%
vllm:cpu_prefix_cache_hit_ratePrefix cache effectivenessTrack trend

Grafana Dashboard

Essential panels for your vLLM dashboard:

  1. Request Rate: rate(vllm:request_success_total[5m])
  2. Queue Depth: vllm:num_requests_waiting
  3. TTFT p95: histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket[5m]))
  4. Throughput: rate(vllm:generation_tokens_total[5m])
  5. GPU Cache Usage: vllm:gpu_cache_usage_perc
  6. Error Rate: rate(vllm:request_failure_total[5m]) / rate(vllm:request_success_total[5m])

Alerting Rules

YAML
# prometheus-rules.yaml
groups:
- name: vllm-alerts
  rules:
  - alert: HighQueueDepth
    expr: vllm:num_requests_waiting > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM request queue is building up"

  - alert: HighLatency
    expr: histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m])) > 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "vLLM p95 latency exceeds 10s"

  - alert: HighCacheUsage
    expr: vllm:gpu_cache_usage_perc > 0.95
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM GPU cache near capacity"

Troubleshooting

Out of Memory (OOM) Errors

Symptoms: CUDA OOM errors, process crashes

Solutions:

  1. Reduce gpu_memory_utilization (try 0.85 or 0.80)
  2. Reduce max_model_len
  3. Use quantization (AWQ, GPTQ, FP8)
  4. Increase tensor_parallel_size (add more GPUs)
  5. Enable chunked prefill for long sequences
Python
# OOM-safe configuration
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
    quantization="awq",
    tensor_parallel_size=2,
    enable_chunked_prefill=True,
)

High Latency

Symptoms: TTFT > 1s, slow token generation

Diagnosis:

  1. Check queue depth (vllm:num_requests_waiting)
  2. Check GPU cache usage (vllm:gpu_cache_usage_perc)
  3. Check if prefill is blocking decode (enable chunked prefill)

Solutions:

  1. Scale horizontally (add more replicas)
  2. Enable prefix caching for repeated prefixes
  3. Reduce max_model_len if sequences are shorter
  4. Use speculative decoding for latency-sensitive workloads

Low Throughput

Symptoms: GPU utilization < 80%, tokens/s below expected

Diagnosis:

  1. Check batch size (vllm:num_requests_running)
  2. Check for preemptions in logs
  3. Verify continuous batching is working

Solutions:

  1. Increase max_num_batched_tokens
  2. Increase gpu_memory_utilization
  3. Ensure enough concurrent requests to fill batches
  4. Check for network bottlenecks in distributed setups

Preemption Issues

Symptoms: Requests being preempted, retried

Solutions:

  1. Increase gpu_memory_utilization
  2. Reduce max_model_len
  3. Use quantization to free up KV cache space
  4. Scale horizontally to reduce per-instance load

Performance Benchmarking

Benchmarking Setup

Use the official vLLM benchmark scripts:

Bash
# Throughput benchmark
python benchmarks/benchmark_throughput.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --input-len 512 \
    --output-len 128 \
    --num-prompts 1000

# Latency benchmark
python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --input-len 512 \
    --output-len 128 \
    --batch-size 1

Expected Performance

Rough benchmarks on common hardware (Llama 3.1 8B, 512 input / 128 output):

GPUThroughput (tok/s)TTFT (ms)ITL (ms)
A100 80GB4,000-5,00050-10015-25
H100 80GB6,000-8,00030-6010-15
L40S 48GB2,500-3,50080-15025-40
A10G 24GB1,500-2,000100-20030-50

Performance varies significantly with:

  • Quantization method
  • Batch size and concurrency
  • Sequence lengths
  • Model architecture

Always benchmark on your specific workload.

vLLM vs Alternatives

vLLM vs SGLang

SGLang excels at structured generation and prefix-heavy workloads:

AspectvLLMSGLang
Raw throughputExcellentExcellent
Prefix reuseGood (prefix caching)Best (RadixAttention)
Structured outputBasicExcellent (grammar support)
Community sizeLargestGrowing fast
Production maturityBattle-testedNewer

Choose SGLang if: Heavy agent/tool use, lots of prefix reuse, need constrained decoding.

Choose vLLM if: General-purpose serving, maximum compatibility, largest community.

vLLM vs TensorRT-LLM

TensorRT-LLM offers maximum performance on NVIDIA hardware:

AspectvLLMTensorRT-LLM
Setup complexityLowHigh
PerformanceExcellentMaximum on NVIDIA
Hardware supportBroadNVIDIA only
Model support100+25-40
Update frequencyWeeklyMonthly

Choose TensorRT-LLM if: All-NVIDIA stack, need absolute maximum performance, have engineering resources.

Choose vLLM if: Need flexibility, broad model support, faster iteration.

Best Practices Summary

  1. Start with sensible defaults: gpu_memory_utilization=0.90, enable prefix caching
  2. Right-size max_model_len: Set to actual maximum needed, not model maximum
  3. Enable chunked prefill: Essential for long-context workloads
  4. Monitor queue depth: Scale horizontally when consistently > 10-20
  5. Use ReadWriteMany storage: Critical for fast horizontal scaling
  6. Benchmark your workload: Don't rely on generic benchmarks
  7. Set up alerting: Queue depth, latency, cache usage
  8. Consider quantization: AWQ or FP8 for significant memory savings with minimal quality loss

Conclusion

vLLM has earned its position as the leading open-source LLM serving framework through a combination of performance, flexibility, and ease of use. PagedAttention, continuous batching, and the constant stream of optimizations make it the right choice for most production deployments.

Start with Docker for development, graduate to Kubernetes for production. Monitor religiously, and scale horizontally when needed. The techniques in this guide will help you serve millions of requests reliably.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles