What GPU should I use for vLLM?

For production, NVIDIA A100 (40GB or 80GB) or H100 offer the best price-performance. H100 supports FP8 quantization natively for additional speedups. For cost-sensitive deployments, A10G or L40S work well for smaller models. vLLM also supports AMD MI300X and Intel GPUs, though NVIDIA has the most mature support.

How do I serve multiple models?

Run separate vLLM instances for each model. Use a load balancer or API gateway to route requests based on the model parameter. The vLLM production stack includes a router component for this purpose. Avoid serving multiple models in the same instance—it complicates memory management.

Can I use vLLM with LoRA adapters?

Yes. vLLM supports multi-LoRA serving—you can load hundreds of LoRA adapters and serve them from the same base model with minimal overhead. Use `--lora-modules` to specify adapters. Each request can specify which adapter to use.

How do I handle model updates?

Use blue-green or rolling deployments. In Kubernetes, update the image tag or model path in your deployment spec. With ReadWriteMany storage, new pods pick up the new model quickly. For zero-downtime, run both versions briefly and shift traffic gradually.

What's the minimum GPU memory needed?

Depends on the model. Rough guidelines: 7B model needs ~14GB (FP16) or ~4GB (INT4). 13B needs ~26GB (FP16) or ~7GB (INT4). 70B needs ~140GB (FP16) or ~35GB (INT4). Use quantization aggressively for memory-constrained deployments.

How do I debug slow responses?

Check these in order: (1) Queue depth—if high, scale up. (2) GPU cache usage—if >95%, reduce max_model_len or add GPUs. (3) Prefill time—enable chunked prefill for long prompts. (4) Network latency—especially in distributed setups. Use vLLM's Prometheus metrics to pinpoint bottlenecks.

Is vLLM production-ready?

Yes. vLLM serves production workloads at numerous companies including those processing millions of requests daily. The project has strong governance (now part of the Linux Foundation), frequent releases, and an active community. For enterprise support, Red Hat offers commercial support through their AI Inference Server product.

How does prefix caching work with conversations?

Each conversation turn shares the system prompt and previous messages as a prefix. vLLM caches the KV states for these shared prefixes. When a new message arrives, only the new content needs prefill—the prefix KV cache is reused. This dramatically improves multi-turn chat performance.

vLLM in Production: The Complete Guide to High-Performance LLM Serving | Enrico Piovano

Introduction

vLLM has become the de facto standard for open-source LLM inference. Originally developed at UC Berkeley's Sky Computing Lab, it introduced PagedAttention—a breakthrough that treats GPU memory like virtual memory, cutting memory waste by up to 90%. Today, vLLM powers production deployments serving billions of tokens daily at companies from startups to enterprises.

This guide goes beyond the basics. We'll cover vLLM's architecture, production configuration, Kubernetes deployment, monitoring, and the hard-won lessons from running vLLM at scale.

Why vLLM?

Before diving in, let's understand why vLLM dominates:

Feature	vLLM	TensorRT-LLM	SGLang
Ease of Setup	★★★★★	★★☆☆☆	★★★★☆
Hardware Support	NVIDIA, AMD, Intel, TPU	NVIDIA only	NVIDIA, AMD
Model Support	100+ architectures	25-40 models	60+ models
Throughput	Excellent	Maximum on NVIDIA	Excellent
Community	Largest, most active	NVIDIA-backed	Growing fast

vLLM wins on flexibility and ease of use. If you're deep in the NVIDIA ecosystem and need every last percent of performance, TensorRT-LLM may be worth the complexity. For most teams, vLLM is the right choice.

Architecture Deep Dive

Understanding vLLM's internals helps you configure and debug it effectively.

PagedAttention: The Core Innovation

Standard LLM serving pre-allocates contiguous memory for the maximum possible sequence length. This wastes enormous GPU memory—most sequences don't reach maximum length.

PagedAttention solves this by treating KV cache like virtual memory:

Code

Traditional Allocation:
┌─────────────────────────────────────────────────┐
│ Sequence 1 KV Cache (pre-allocated max length)  │
│ [used][used][used][████ wasted space ████████]  │
├─────────────────────────────────────────────────┤
│ Sequence 2 KV Cache (pre-allocated max length)  │
│ [used][████████ wasted space ████████████████]  │
└─────────────────────────────────────────────────┘

PagedAttention:
┌────────┬────────┬────────┬────────┬────────┐
│ Seq1-0 │ Seq2-0 │ Seq1-1 │ Seq1-2 │ Seq2-1 │  <- Blocks allocated on demand
└────────┴────────┴────────┴────────┴────────┘
     ↑ Non-contiguous, dynamically allocated blocks

Key benefits:

Non-contiguous storage: KV cache blocks can be anywhere in GPU memory
Dynamic allocation: Memory allocated only as sequences grow
Memory sharing: Identical prompt prefixes share KV cache blocks
Near-zero waste: Eliminates internal fragmentation

Result: 2-4x higher throughput on the same hardware compared to naive implementations.

The Scheduler

vLLM's scheduler implements continuous batching with iteration-level scheduling:

Prefill phase: New requests have their prompts processed in parallel
Decode phase: Tokens generated one at a time, autoregressively
Iteration-level scheduling: After each iteration, scheduler can:
- Add new requests to the batch
- Remove completed sequences
- Preempt low-priority requests if memory pressure is high

This keeps GPU utilization high even with variable-length outputs.

Chunked Prefill

For long prompts (32K+ tokens), prefill can block everything. Chunked prefill splits the prompt into chunks:

Code

Without Chunked Prefill:
[========= 50K token prefill blocks all decode =========][decode][decode]

With Chunked Prefill:
[chunk1][decode][chunk2][decode][chunk3][decode][chunk4][decode]

Enable with --enable-chunked-prefill. Essential for long-context models.

Prefix Caching

When multiple requests share the same prefix (system prompt, few-shot examples), vLLM can reuse the computed KV cache:

Python

# These requests share the same system prompt
# vLLM computes KV cache for system prompt once, reuses for all

Request 1: [System Prompt] + "What is 2+2?"
Request 2: [System Prompt] + "Explain quantum computing"
Request 3: [System Prompt] + "Write a poem about AI"
           ↑
           KV cache computed once, shared across requests

Enable with --enable-prefix-caching. Can dramatically improve throughput for chat applications.

Configuration Guide

Configuration is where most teams either unlock vLLM's full potential or leave performance on the table. The defaults are sensible for quick experiments, but production deployments require tuning based on your specific hardware, model, and workload characteristics.

The core trade-off is memory vs. throughput: allocating more GPU memory to KV cache enables larger batches and higher throughput, but leaves less margin for spikes and increases OOM risk. The right balance depends on your latency requirements and traffic patterns.

Essential Parameters

These parameters control memory allocation, parallelism, and optimization features. Start with these values, then tune based on your monitoring data.

Python

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",

    # Memory configuration
    gpu_memory_utilization=0.90,      # Use 90% of GPU memory
    max_model_len=8192,               # Maximum sequence length

    # Parallelism
    tensor_parallel_size=2,           # Split across 2 GPUs

    # Performance optimizations
    enable_prefix_caching=True,       # Reuse KV cache for shared prefixes
    enable_chunked_prefill=True,      # Don't block decode during long prefills

    # Quantization (optional)
    quantization="awq",               # Use AWQ INT4 quantization
)

Understanding each parameter:

gpu_memory_utilization=0.90: Pre-allocate 90% of GPU memory for KV cache and model weights. The remaining 10% is headroom for PyTorch/CUDA overhead. Increase to 0.95 for maximum throughput; decrease to 0.80 if you see occasional OOM errors. Monitor vllm:gpu_cache_usage_perc to tune.
max_model_len=8192: Maximum tokens (prompt + response) for any request. Directly impacts KV cache size—higher values mean fewer concurrent requests. Set to your actual maximum need, not the model's theoretical limit. A 70B model at max_model_len=32K uses ~4x the KV cache memory as max_model_len=8K.
tensor_parallel_size=2: Split model weights and KV cache across 2 GPUs. Required when model doesn't fit on one GPU; also useful to increase KV cache capacity. Requires NVLink for best performance. Use pipeline parallelism (pipeline_parallel_size) for cross-node or non-NVLink setups.
enable_prefix_caching=True: Reuse computed KV cache for identical prompt prefixes. Essential for chat applications where system prompts repeat. Can reduce compute by 50%+ for long system prompts.
enable_chunked_prefill=True: Process long prompts in chunks rather than all-at-once. Prevents long prompts from blocking token generation for other requests. Essential for models with 32K+ context.
quantization="awq": 4-bit quantization reduces memory 4x with ~2-3% quality loss. Enables larger models or more concurrent requests. FP8 is better quality but requires H100/Ada GPUs.

Parameter Deep Dive

gpu_memory_utilization

Controls how much GPU memory vLLM pre-allocates for KV cache.

Value	Use Case
0.80	Conservative, leaves headroom for spikes
0.90	Production default, good balance
0.95	Aggressive, maximum throughput
0.98+	Not recommended, OOM risk

If you encounter frequent preemptions, increase this value. If you hit OOM errors, decrease it.

max_model_len

Maximum sequence length (prompt + generation). Directly impacts memory usage.

Memory impact formula (approximate):

$\text{KV Cache} = 2 \times L \times H_{kv} \times d_h \times S_{\max} \times B \times \text{bytes}$

where $L$ = layers, $H_{kv}$ = KV heads, $d_h$ = head dim, $S_{\max}$ = max sequence length, $B$ = batch size.

For a 70B model with 80 layers, 8 KV heads, head_dim 128, at FP16:

max_model_len=4096: ~2.6GB per sequence
max_model_len=8192: ~5.2GB per sequence
max_model_len=32768: ~21GB per sequence

Set this to the maximum you actually need, not the model's maximum capability.

tensor_parallel_size

Splits model across multiple GPUs. Each GPU holds a shard of the weights and KV cache.

Python

# Single GPU (if model fits)
tensor_parallel_size=1

# Model too large for one GPU, or need more KV cache space
tensor_parallel_size=2  # Requires 2 GPUs
tensor_parallel_size=4  # Requires 4 GPUs
tensor_parallel_size=8  # Requires 8 GPUs

When to use pipeline parallelism instead:

GPUs without NVLink (e.g., L40S, consumer GPUs)
Cross-node inference

Python

# Pipeline parallelism for non-NVLink GPUs
pipeline_parallel_size=2

Batching Parameters

Python

# Maximum sequences in a batch
max_num_seqs=256              # Default, good for most cases

# Maximum tokens per iteration (prefill + decode)
max_num_batched_tokens=8192   # Increase for higher throughput on large GPUs

For throughput optimization on large GPUs (A100, H100):

Python

max_num_batched_tokens=16384  # or higher

Quantization Options

vLLM supports multiple quantization methods:

Method	Bits	Memory Reduction	Quality Impact	Best For
None (FP16/BF16)	16	Baseline	None	Quality-critical
FP8	8	2x	Minimal	H100/Ada GPUs
AWQ	4	4x	Small (~2-3%)	Memory-constrained
GPTQ	4	4x	Small (~2-5%)	Pre-quantized models
INT8	8	2x	Minimal	General production

Python

# FP8 on H100 (recommended for H100s)
llm = LLM(model="...", quantization="fp8")

# AWQ for memory savings
llm = LLM(model="...", quantization="awq")

# Load pre-quantized GPTQ model
llm = LLM(model="TheBloke/Llama-2-70B-GPTQ")

Speculative Decoding

Speculative decoding trades compute for latency. Instead of generating one token at a time with the slow large model, a fast draft model generates several candidate tokens, then the large model verifies them in parallel. Since verification is parallelizable (unlike autoregressive generation), you get multiple tokens from one large-model forward pass.

The technique works because similar models have similar token distributions—the draft model often guesses correctly, and when it doesn't, the large model corrects it with minimal wasted work.

Python

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",
    num_speculative_tokens=5,  # Number of tokens to speculate
)

Expected speedup: 1.5-2.5x for latency-sensitive applications. Works best when:

Draft model is fast (1-3B parameters)
Draft and target models have similar token distributions
Tasks have predictable outputs (code, structured data)

Production Deployment

Docker Deployment

The simplest production deployment:

Bash

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192

For multi-GPU:

Bash

docker run --runtime nvidia --gpus '"device=0,1"' \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.90

Kubernetes Deployment

Kubernetes deployment requires careful resource planning. vLLM needs exclusive access to GPUs, persistent storage for model weights (to avoid re-downloading on restarts), and proper health checks. The key insight: model loading is slow (minutes for large models), so optimize for pod stability over rapid scaling.

This deployment uses a PersistentVolumeClaim for model weights—new pods mount the same cached weights instead of downloading. This is crucial for scale-up speed.

YAML

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--max-model-len"
        - "8192"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      nodeSelector:
        nvidia.com/gpu.present: "true"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-server
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

Key deployment decisions:

nodeSelector (line 30): Ensures pods land on GPU nodes. Use more specific selectors (nvidia.com/gpu.product: "A100-SXM4-80GB") for heterogeneous clusters.
PersistentVolumeClaim (lines 25-27): Shared model storage is critical. Without it, each pod downloads the full model on startup—minutes for 70B models. Use ReadWriteMany storage (NFS, EFS, GCS FUSE) for multi-replica deployments.
replicas: 1 (line 8): Start with one replica, use HPA to scale. LLM serving is GPU-bound, so scaling horizontally requires proportionally more GPUs.
ipc=host equivalent: Not shown in K8s but needed for some configurations. If you hit IPC issues, add shareProcessNamespace: true or increase SHM size.

Horizontal Pod Autoscaling

Scale based on request queue length:

YAML

# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_waiting
      target:
        type: AverageValue
        averageValue: "5"

For HPA to work, you need:

Prometheus scraping vLLM metrics
Prometheus Adapter exposing metrics to Kubernetes
ReadWriteMany storage for model weights (new pods share the same PVC)

Using ReadWriteMany is crucial—new pods don't need to re-download model weights, reducing scale-up time from minutes to seconds.

KEDA Autoscaling

For more sophisticated autoscaling:

YAML

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
spec:
  scaleTargetRef:
    name: vllm-server
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_num_requests_waiting
      threshold: "10"
      query: sum(vllm_num_requests_waiting)

Monitoring and Observability

Prometheus Metrics

vLLM exposes metrics at /metrics by default. Key metrics to monitor:

Throughput Metrics

Metric	Description	Alert Threshold
`vllm:prompt_tokens_total`	Prefill tokens processed	Track trend
`vllm:generation_tokens_total`	Generated tokens	Track trend
`vllm:request_success_total`	Successful requests	Track trend

Latency Metrics

Metric	Description	Alert Threshold
`vllm:time_to_first_token_seconds`	Time to first token (TTFT)	p95 > 2s
`vllm:time_per_output_token_seconds`	Inter-token latency	p95 > 100ms
`vllm:e2e_request_latency_seconds`	End-to-end latency	p95 > 10s

Resource Metrics

Metric	Description	Alert Threshold
`vllm:num_requests_running`	Currently processing	Capacity planning
`vllm:num_requests_waiting`	Queue depth	> 50 (scale up)
`vllm:gpu_cache_usage_perc`	KV cache utilization	> 95%
`vllm:cpu_prefix_cache_hit_rate`	Prefix cache effectiveness	Track trend

Grafana Dashboard

Essential panels for your vLLM dashboard:

Request Rate: rate(vllm:request_success_total[5m])
Queue Depth: vllm:num_requests_waiting
TTFT p95: histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket[5m]))
Throughput: rate(vllm:generation_tokens_total[5m])
GPU Cache Usage: vllm:gpu_cache_usage_perc
Error Rate: rate(vllm:request_failure_total[5m]) / rate(vllm:request_success_total[5m])

Alerting Rules

YAML

# prometheus-rules.yaml
groups:
- name: vllm-alerts
  rules:
  - alert: HighQueueDepth
    expr: vllm:num_requests_waiting > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM request queue is building up"

  - alert: HighLatency
    expr: histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m])) > 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "vLLM p95 latency exceeds 10s"

  - alert: HighCacheUsage
    expr: vllm:gpu_cache_usage_perc > 0.95
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM GPU cache near capacity"

Troubleshooting

Out of Memory (OOM) Errors

Symptoms: CUDA OOM errors, process crashes

Solutions:

Reduce gpu_memory_utilization (try 0.85 or 0.80)
Reduce max_model_len
Use quantization (AWQ, GPTQ, FP8)
Increase tensor_parallel_size (add more GPUs)
Enable chunked prefill for long sequences

Python

# OOM-safe configuration
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
    quantization="awq",
    tensor_parallel_size=2,
    enable_chunked_prefill=True,
)

High Latency

Symptoms: TTFT > 1s, slow token generation

Diagnosis:

Check queue depth (vllm:num_requests_waiting)
Check GPU cache usage (vllm:gpu_cache_usage_perc)
Check if prefill is blocking decode (enable chunked prefill)

Solutions:

Scale horizontally (add more replicas)
Enable prefix caching for repeated prefixes
Reduce max_model_len if sequences are shorter
Use speculative decoding for latency-sensitive workloads

Low Throughput

Symptoms: GPU utilization < 80%, tokens/s below expected

Diagnosis:

Check batch size (vllm:num_requests_running)
Check for preemptions in logs
Verify continuous batching is working

Solutions:

Increase max_num_batched_tokens
Increase gpu_memory_utilization
Ensure enough concurrent requests to fill batches
Check for network bottlenecks in distributed setups

Preemption Issues

Symptoms: Requests being preempted, retried

Solutions:

Increase gpu_memory_utilization
Reduce max_model_len
Use quantization to free up KV cache space
Scale horizontally to reduce per-instance load

Performance Benchmarking

Benchmarking Setup

Use the official vLLM benchmark scripts:

Bash

# Throughput benchmark
python benchmarks/benchmark_throughput.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --input-len 512 \
    --output-len 128 \
    --num-prompts 1000

# Latency benchmark
python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --input-len 512 \
    --output-len 128 \
    --batch-size 1

Expected Performance

Rough benchmarks on common hardware (Llama 3.1 8B, 512 input / 128 output):

GPU	Throughput (tok/s)	TTFT (ms)	ITL (ms)
A100 80GB	4,000-5,000	50-100	15-25
H100 80GB	6,000-8,000	30-60	10-15
L40S 48GB	2,500-3,500	80-150	25-40
A10G 24GB	1,500-2,000	100-200	30-50

Performance varies significantly with:

Quantization method
Batch size and concurrency
Sequence lengths
Model architecture

Always benchmark on your specific workload.

vLLM vs Alternatives

vLLM vs SGLang

SGLang excels at structured generation and prefix-heavy workloads:

Aspect	vLLM	SGLang
Raw throughput	Excellent	Excellent
Prefix reuse	Good (prefix caching)	Best (RadixAttention)
Structured output	Basic	Excellent (grammar support)
Community size	Largest	Growing fast
Production maturity	Battle-tested	Newer

Choose SGLang if: Heavy agent/tool use, lots of prefix reuse, need constrained decoding.

Choose vLLM if: General-purpose serving, maximum compatibility, largest community.

vLLM vs TensorRT-LLM

TensorRT-LLM offers maximum performance on NVIDIA hardware:

Aspect	vLLM	TensorRT-LLM
Setup complexity	Low	High
Performance	Excellent	Maximum on NVIDIA
Hardware support	Broad	NVIDIA only
Model support	100+	25-40
Update frequency	Weekly	Monthly

Choose TensorRT-LLM if: All-NVIDIA stack, need absolute maximum performance, have engineering resources.

Choose vLLM if: Need flexibility, broad model support, faster iteration.

Best Practices Summary

Start with sensible defaults: gpu_memory_utilization=0.90, enable prefix caching
Right-size max_model_len: Set to actual maximum needed, not model maximum
Enable chunked prefill: Essential for long-context workloads
Monitor queue depth: Scale horizontally when consistently > 10-20
Use ReadWriteMany storage: Critical for fast horizontal scaling
Benchmark your workload: Don't rely on generic benchmarks
Set up alerting: Queue depth, latency, cache usage
Consider quantization: AWQ or FP8 for significant memory savings with minimal quality loss

Conclusion

vLLM has earned its position as the leading open-source LLM serving framework through a combination of performance, flexibility, and ease of use. PagedAttention, continuous batching, and the constant stream of optimizations make it the right choice for most production deployments.

Start with Docker for development, graduate to Kubernetes for production. Monitor religiously, and scale horizontally when needed. The techniques in this guide will help you serve millions of requests reliably.

Table of Contents

Introduction

Why vLLM?

Architecture Deep Dive

PagedAttention: The Core Innovation

The Scheduler

Chunked Prefill

Prefix Caching

Configuration Guide

Essential Parameters

Parameter Deep Dive

gpu_memory_utilization

max_model_len

tensor_parallel_size

Batching Parameters

Quantization Options

Speculative Decoding

Production Deployment

Docker Deployment

Kubernetes Deployment

Horizontal Pod Autoscaling

KEDA Autoscaling

Monitoring and Observability

Prometheus Metrics

Throughput Metrics

Latency Metrics

Resource Metrics

Grafana Dashboard

Alerting Rules

Troubleshooting

Out of Memory (OOM) Errors

High Latency

Low Throughput

Preemption Issues

Performance Benchmarking

Benchmarking Setup

Expected Performance

vLLM vs Alternatives

vLLM vs SGLang

vLLM vs TensorRT-LLM

Best Practices Summary

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

LLM Inference Optimization: From Quantization to Speculative Decoding

Open-Source LLMs: The Complete 2025 Guide

Small Language Models: Edge Deployment and Knowledge Distillation

LLM Frameworks: LangChain, LlamaIndex, LangGraph, and Beyond