How does vLLM determine block size?

Block size is configurable (typically 16 or 32 tokens) and affects the granularity of memory management. Smaller blocks reduce internal fragmentation but increase metadata overhead. The default of 16 balances these concerns for typical workloads.

What happens when KV cache is exhausted?

When allocate_slots returns None due to insufficient memory, the scheduler preempts requests until memory is available. If even that's not enough, new requests simply wait in the queue until running requests complete.

How does prefix caching interact with LoRA?

Cached blocks include LoRA adapter ID in their hash key. Blocks computed with adapter A won't be reused for requests using adapter B, even if they share the same prefix tokens. This prevents incorrect cached values from being used.

Can I use prefix caching with tensor parallelism?

Yes, prefix caching works with tensor parallelism. Each worker maintains its portion of the cache (for its attention heads), and the scheduling decisions are replicated across workers to ensure consistency.

How does vLLM handle out-of-vocabulary tokens?

Out-of-vocabulary tokens are handled by the tokenizer before reaching vLLM's inference code. The model operates on token IDs, and any OOV handling happens during tokenization using the tokenizer's configured strategy (typically unknown token substitution).

What's the difference between v0 and v1 engines?

The v1 engine is a rewrite with cleaner abstractions, better support for features like disaggregated serving, and improved performance. v0 is maintained for compatibility but v1 is recommended for new deployments. The primary differences are in engine internals—the user API remains compatible.

How does speculative decoding affect memory usage?

Speculative decoding requires additional KV cache blocks for the speculative tokens (num_lookahead_tokens extra blocks per sequence). It also runs a draft model which consumes additional GPU memory for weights and activations. The memory overhead is typically modest compared to the main model.

vLLM Internals: A Deep Dive into the Architecture of High-Performance LLM Inference | Enrico Piovano

Introduction

vLLM has emerged as the leading open-source framework for high-performance LLM inference, powering production deployments at scale across the industry. While its user-facing APIs are straightforward, the internal machinery that enables its exceptional throughput involves sophisticated memory management, intelligent scheduling, and carefully optimized data structures.

This post explores vLLM's internal architecture by examining its actual implementation. We'll trace the flow from incoming requests through the scheduler, understand how PagedAttention manages KV cache memory at the block level, and see how continuous batching maximizes GPU utilization. This is a companion to the production deployment guide, focusing specifically on how vLLM works under the hood rather than how to configure and deploy it.

High-Level Architecture Overview

vLLM's architecture separates concerns across several key components that work together to achieve high throughput. At the highest level, the system consists of an Engine that orchestrates everything, a Scheduler that decides which requests to process in each iteration, KV Cache Managers that handle memory allocation, Workers that execute model inference on GPUs, and Model Runners that prepare inputs and execute the forward pass.

The Engine Core serves as the central coordinator. It receives requests from clients, passes them to the scheduler for batching decisions, dispatches scheduled batches to workers for execution, and processes outputs to return to clients. The design follows a producer-consumer pattern where the scheduler produces batches and workers consume them, with careful attention to minimizing pipeline bubbles.

Request flow begins when a client submits a prompt. The engine wraps this in a Request object that tracks all state including token IDs, computed tokens, allocated blocks, and generation progress. The scheduler maintains queues of waiting and running requests, making decisions each iteration about which requests to advance based on available memory and fairness policies.

PagedAttention: Block-Based KV Cache Management

The core innovation that enables vLLM's efficiency is PagedAttention, which treats GPU memory for KV cache like virtual memory in operating systems. Rather than pre-allocating contiguous memory for maximum sequence length, PagedAttention allocates fixed-size blocks on demand and allows non-contiguous storage.

The KVCacheBlock Data Structure

At the lowest level, KV cache is managed through KVCacheBlock objects. Each block represents a fixed number of tokens worth of key-value cache memory. The KVCacheBlock dataclass maintains the block ID (ranging from 0 to num_gpu_blocks minus 1), a reference count tracking how many sequences use this block, an optional block hash for prefix caching, and pointers for doubly-linked list management.

The reference count is crucial for memory management. When a block's ref_cnt reaches zero, it becomes available for reallocation. However, if prefix caching is enabled, these zero-refcount blocks aren't immediately reclaimed—they remain as eviction candidates that can be reused if another request has a matching prefix.

The block_hash field enables prefix caching. When a block is full and its content matches a cacheable prefix, it receives a hash that allows future requests with identical prefixes to reference the same block. This hash combines the content hash with a group ID to support multiple KV cache groups in architectures like MLA (Multi-head Latent Attention).

The FreeKVCacheBlockQueue

Managing free blocks efficiently is critical for performance. vLLM implements a custom FreeKVCacheBlockQueue that maintains a doubly-linked list of available blocks. This custom implementation serves a specific purpose: it allows O(1) removal of blocks from anywhere in the queue, not just the ends.

Why does this matter? When a cached block is hit by another request with the same prefix, that block needs to be removed from the free queue (since it's now in use) regardless of its position. Standard Python deques only support efficient removal from ends. The custom implementation manipulates prev_free_block and next_free_block pointers directly on KVCacheBlock objects, avoiding Python object allocation during queue operations.

The queue uses fake head and tail sentinel blocks to eliminate edge case handling. All real blocks in the queue are guaranteed to have valid prev and next pointers, simplifying the manipulation code. This attention to low-level detail reflects vLLM's overall philosophy of optimizing every operation that occurs on the critical path.

Block eviction follows LRU (Least Recently Used) policy with a twist: when freeing blocks from a completed request, blocks are freed in reverse order so that tail blocks (which contain later tokens) are evicted first. This preserves prefix caching effectiveness since prefixes are more likely to be shared across requests.

The BlockPool

The BlockPool sits above individual blocks and provides the primary interface for block allocation and caching. It maintains the pool of all KV cache blocks, the free block queue, and a hash-to-block mapping for prefix cache lookups.

When allocating new blocks, the BlockPool pops blocks from the front of the free queue. If prefix caching is enabled and the popped block has a cached hash, it must first be evicted from the cache—its hash is removed from the lookup table and the block's hash metadata is cleared. This eviction-on-allocation approach is more efficient than eagerly evicting blocks when they're freed.

The cache lookup mechanism handles duplicate blocks elegantly. Multiple blocks can have the same content hash (from different requests that happened to produce identical KV cache content). The BlockHashToBlockMap uses a union type internally: a single KVCacheBlock for the common case of one block per hash, and a dictionary mapping block_id to KVCacheBlock when duplicates exist. This avoids dictionary allocation overhead in the common case while still handling duplicates correctly.

The KVCacheManager

Above the BlockPool, the KVCacheManager provides the scheduler-facing interface for memory management. It handles the translation between logical request state and physical block allocation.

The key method is allocate_slots, which allocates memory for new tokens being added to a request. This method handles several concerns: it first frees any blocks that should be removed due to sliding window attention, then calculates how many new blocks are needed considering already-computed tokens and prefix cache hits, checks if enough free blocks are available, touches cached blocks to prevent eviction, and finally allocates the required new blocks.

The allocate_slots method returns None if insufficient memory is available, signaling to the scheduler that preemption is needed. This design keeps memory pressure decisions in the scheduler while keeping memory mechanics in the cache manager.

Prefix caching integration happens through the get_computed_blocks method. Given a request, it finds the longest prefix that's already cached, returning both the cached blocks and the number of computed tokens. The scheduler uses this information to skip redundant computation—if 1000 tokens of a 2000-token prompt are already cached, only 1000 tokens need prefilling.

The KVCacheCoordinator

For complex scenarios involving multiple KV cache groups (like hybrid attention patterns or MLA), the KVCacheCoordinator manages multiple SingleTypeKVCacheManagers. Each manager handles blocks for a specific cache type while the coordinator provides a unified interface.

The coordinator handles cross-group operations like finding cache hits that span multiple groups and ensuring blocks are allocated consistently across groups. This abstraction allows vLLM to support architectures like DeepSeek's MLA that require different cache formats for different attention heads.

The Continuous Batching Scheduler

vLLM's scheduler implements continuous batching, which is fundamentally different from static batching. Rather than waiting to fill a batch before processing, continuous batching can add and remove requests at every iteration, maximizing GPU utilization.

Scheduling Algorithm

The scheduler maintains two primary queues: waiting (requests awaiting processing) and running (requests currently being processed). Each scheduling step produces a SchedulerOutput describing which requests to process and how many tokens from each.

The algorithm proceeds in phases. First, it schedules running requests. For each running request, the scheduler determines how many new tokens to compute (typically one in decode phase, possibly more for chunked prefill), attempts to allocate KV cache blocks for these tokens, and if allocation fails, preempts lower-priority requests until allocation succeeds or the request itself must be preempted.

This design allows running requests priority over waiting requests—a request that has already consumed resources continues to completion rather than being starved by new arrivals.

Next, waiting requests are scheduled with remaining budget. After handling running requests, if there's remaining token budget and memory capacity, waiting requests are added to the batch. For each waiting request, the scheduler checks for prefix cache hits, calculates how many tokens need to be computed, allocates blocks for those tokens, and adds the request to the running queue.

The scheduler respects several constraints: max_num_seqs limits concurrent requests, max_num_batched_tokens limits total tokens per iteration, and available KV cache memory limits how many blocks can be allocated.

Preemption Strategy

When memory pressure requires preemption, vLLM implements a policy-based approach. With FCFS (First Come First Served) policy, the most recently added request is preempted. With priority scheduling, the lowest-priority request is preempted.

Preemption involves freeing all blocks allocated to the request and moving it back to the waiting queue with PREEMPTED status. The request's num_computed_tokens is reset to zero—unlike some systems that save partial computation, vLLM recomputes from scratch. This simplifies memory management and leverages prefix caching to recover much of the work.

The scheduler tracks preemption counts per request, which can be exposed in metrics to identify problematic workload patterns (like many very long requests competing with short requests).

Chunked Prefill Integration

For long prompts, computing all prefill tokens in one iteration would block decode for existing requests. Chunked prefill addresses this by limiting how many prefill tokens are processed per iteration.

The scheduler enforces this through long_prefill_token_threshold. If a new request would exceed this threshold, the scheduler only schedules up to the threshold tokens, leaving the rest for subsequent iterations. The request stays in the running queue but continues prefilling rather than decoding until all prompt tokens are processed.

This creates interleaving: one iteration might process 4096 tokens of a long prompt plus decode tokens for 50 short requests, then the next iteration processes another 4096 prefill tokens plus more decodes. GPU utilization stays high even with very long prompts in the mix.

Encoder Input Scheduling

For multimodal models with encoder components (like vision encoders for images), the scheduler also manages encoder inputs. These have their own budget (max_num_encoder_input_tokens) and cache (EncoderCacheManager).

Encoder inputs must be scheduled when their corresponding decoder positions are being processed. The scheduler tracks which encoder inputs are needed for each token range and schedules them appropriately, deallocating encoder cache after the relevant decoder tokens have passed.

The V1 Engine Architecture

vLLM's v1 engine represents a significant architectural evolution focused on cleaner separation of concerns and better support for advanced features like disaggregated serving and speculative decoding.

EngineCore

The EngineCore is the central coordinator in v1. It initializes the executor (which manages workers), profiles memory to determine KV cache capacity, creates the scheduler, and drives the main step loop.

The step function follows a clean flow: it checks if there are any requests to process, calls the scheduler to get a SchedulerOutput, dispatches this to the executor for model execution, and processes outputs to update request state. The engine supports both blocking and async execution modes.

A key optimization is batch queuing for pipeline parallelism. When using multiple pipeline stages, the engine can have multiple batches in flight simultaneously to hide pipeline bubble latency. The batch_queue holds (Future, SchedulerOutput) pairs, allowing the engine to continue scheduling while previous batches execute.

The Executor and Workers

The Executor manages a pool of workers, typically one per GPU. It provides a collective_rpc interface for calling methods on all workers simultaneously and handles the distributed communication setup.

Workers are responsible for actual model execution. Each GPUWorker owns a GPUModelRunner that handles the details of preparing inputs, running the forward pass, and collecting outputs. The separation allows different worker types (GPU, CPU, TPU) while sharing common scheduling and memory management.

The GPUModelRunner

The model runner is where the rubber meets the road. It manages the model instance, KV cache tensors, attention metadata builders, and the sampling pipeline.

When executing a batch, the runner goes through several stages. First, it prepares input tensors from the scheduler output, including token IDs, position IDs, and attention metadata. The attention metadata is particularly complex—it must describe which tokens are prefill versus decode, block tables mapping logical positions to physical blocks, and any special handling for prefix caching or sliding windows.

The forward pass runs the model with these prepared inputs. For most architectures, this is a standard transformer forward pass, but the attention layers interact with the KV cache through vLLM's Attention module rather than standard PyTorch attention.

After the forward pass, sampling produces next tokens. vLLM's sampler handles various strategies (greedy, sampling with temperature, beam search) and supports advanced features like guided generation with grammar constraints.

Attention Layer Implementation

The Attention class wraps the model's attention computation and KV cache interaction. During forward, it receives query, key, and value tensors, stores keys and values in the KV cache, performs attention computation using a pluggable backend, and returns the attention output.

The backend selection is sophisticated. vLLM supports multiple attention backends (FlashAttention, FlashInfer, xFormers, Triton) with automatic selection based on hardware capabilities and configuration. Each backend has different performance characteristics and feature support—FlashAttention is generally fastest on NVIDIA GPUs but FlashInfer offers features like tree attention for speculative decoding.

The attention metadata passed to backends describes the batch structure: which positions are prefill versus decode, sequence lengths, block tables for KV cache lookup, and per-head information for MLA. The metadata builder constructs this from the scheduler output, a process that can be surprisingly expensive and is heavily optimized.

MLA (Multi-Head Latent Attention) Support

vLLM supports MLA, the efficient attention variant used by DeepSeek models. MLA compresses keys and values into a latent space, reducing KV cache size dramatically while maintaining quality.

The MLAAttention class handles this differently from standard attention. Instead of separate K and V tensors, it receives compressed kv_c_normed and positional k_pe tensors. The KV cache stores these compressed representations rather than full K/V, and specialized backends (like TRITON_MLA or FlashInfer with MLA support) handle the attention computation.

Speculative Decoding Integration

vLLM's v1 engine has first-class support for speculative decoding, where a smaller draft model proposes multiple tokens that the main model verifies in parallel.

Proposer Architecture

Draft proposals are generated by proposer classes. The NgramProposer uses n-gram statistics from the context for zero-cost speculation. EagleProposer runs a small draft model that shares some layers with the main model. MedusaProposer uses additional heads on the main model to predict multiple future tokens.

The proposer interface is simple: given current context, propose up to num_speculative_tokens candidates. The scheduler tracks these proposals and includes them in the batch for verification.

Rejection Sampling

After the main model runs, rejection sampling determines which speculative tokens to accept. The RejectionSampler compares draft probabilities against target probabilities, accepting tokens that meet the acceptance criterion.

The output format is designed for efficiency: sampled token IDs are packed into a fixed-size tensor with invalid entries marked by a sentinel value. The parse_output method handles unpacking, identifying where each sequence's tokens end.

Integration with Scheduling

Speculative decoding interacts with scheduling through the num_lookahead_tokens parameter. The scheduler allocates extra KV cache blocks beyond what's immediately needed to accommodate speculative tokens. If speculation is rejected, these extra blocks are simply not used on the next iteration.

The scheduler also needs to handle the case where accepted tokens vary per sequence. After verification, some sequences might accept all speculative tokens while others accept none. The scheduler tracks num_computed_tokens per request and adjusts accordingly.

Prefix Caching Deep Dive

Prefix caching is one of vLLM's most impactful optimizations for workloads with shared system prompts or few-shot examples.

Block Hashing

Blocks are cached using content-based hashing. The hash function considers token IDs in the block, the hash of the previous block (creating a chain), and any extra keys like LoRA adapter IDs or multimodal content hashes.

The chained hashing is important: it means block 5 of sequence A and block 5 of sequence B will only have the same hash if blocks 0-4 are also identical. This prevents false cache hits where blocks happen to contain the same tokens but are reached through different prefixes.

Hash computation is configurable—users can choose between fast hashing (xxhash) for most cases or cryptographic hashing (sha256) when reproducibility across processes matters.

Cache Hit Flow

When a new request arrives, the scheduler calls get_computed_blocks. This method iterates through the request's block hashes, looking each up in the cache. The search continues until a miss is found or the end of the prompt is reached.

For a cache hit, the existing blocks' reference counts are incremented (preventing eviction), and the number of computed tokens is returned. The scheduler then only needs to prefill the non-cached suffix.

Eviction and Memory Pressure

Cached blocks with zero reference count are eviction candidates. They remain in the free block queue but are also in the hash-to-block mapping. When new blocks are allocated, if an eviction candidate is popped, its cache entry is removed.

This lazy eviction approach is efficient because many eviction candidates may be hit again before their memory is needed. Only when memory pressure requires new blocks are cached blocks actually evicted.

The cache can be reset entirely through reset_prefix_cache, useful when model weights change (like in RLHF) and cached KV values are no longer valid.

Distributed Execution

vLLM supports both tensor parallelism and pipeline parallelism for large models that don't fit on a single GPU.

Tensor Parallelism

With tensor parallelism, each layer's computation is split across GPUs. Attention heads are divided among workers, and all-reduce operations combine results. The scheduler runs on rank 0 and broadcasts decisions to all workers.

KV cache is distributed accordingly—each worker stores only the KV cache for its attention heads. Block allocation is coordinated so all workers allocate the same block IDs for the same sequences.

Pipeline Parallelism

Pipeline parallelism splits model layers across workers. Worker 0 might run layers 0-15, worker 1 runs layers 16-31, and so on. Activations pass between workers as micro-batches flow through the pipeline.

The batch queue mentioned earlier is crucial for pipeline parallelism. It allows overlapping execution of multiple micro-batches, hiding the pipeline bubble overhead. The scheduler continues producing batches while earlier batches flow through the pipeline.

KV Connector for Disaggregated Serving

vLLM's KVConnector system enables disaggregated serving where prefill and decode run on different instances. The connector handles transferring computed KV cache from prefill nodes to decode nodes.

This requires coordination between scheduler and connector: the scheduler tracks which tokens were computed locally versus remotely, the connector manages async KV transfers, and requests wait in WAITING_FOR_REMOTE_KVS status until their KV cache arrives.

Performance Optimizations

Beyond the architectural innovations, vLLM includes numerous low-level optimizations.

CUDA Graphs

For decode batches with stable shapes, vLLM captures CUDA graphs to eliminate kernel launch overhead. This is particularly impactful for small batch decode where kernel launches can dominate execution time.

The graph capture handles attention metadata updates carefully—some metadata must be refreshed each iteration while graph execution itself is static. vLLM's CUDAGraphWrapper manages this complexity.

Torch Compile Integration

vLLM supports torch.compile for fusing operations and generating optimized kernels. The compilation config controls which parts of the model are compiled and what optimization level to use.

Attention operations are typically not compiled (they use specialized kernels) but surrounding linear layers and activations benefit significantly. The forward context system ensures compiled code can access runtime state like KV cache tensors.

Memory Management

Beyond block-based allocation, vLLM carefully manages GPU memory throughout. Memory profiling at startup determines safe KV cache sizes. Pinned host memory accelerates CPU-GPU transfers. Pre-allocated buffers avoid allocation during inference.

The WorkSpace class manages temporary buffers needed during execution. Rather than allocating fresh tensors each forward pass, the model runner reuses workspace buffers, reducing memory fragmentation and allocation overhead.

Observability and Debugging

Understanding what's happening inside vLLM is essential for production operation.

Metrics and Stats

The scheduler exposes SchedulerStats including queue lengths, cache hit rates, and preemption counts. These feed into Prometheus metrics for monitoring. KV cache metrics track block utilization, residency times, and eviction patterns.

The perf_metrics system tracks GPU utilization metrics like MFU (Model FLOPS Utilization), helping identify whether workloads are compute or memory bound.

Tracing

vLLM integrates with tracing systems through the tracing module. Each request can carry trace context, and important operations are instrumented with spans. This enables end-to-end latency breakdown and identifying bottlenecks.

Debug Logging

The extensive logging throughout vLLM's codebase (controlled by log level) provides visibility into scheduling decisions, cache operations, and worker execution. The dump_engine_exception function captures detailed state on failures for post-mortem debugging.

Table of Contents