Hardware Optimization for LLMs: CUDA Kernels, TPU vs GPU, and Accelerator Architecture
A comprehensive guide to hardware optimization for large language models covering GPU architecture, CUDA kernel optimization, TPU comparisons, memory hierarchies, and practical strategies for maximizing throughput on modern AI accelerators.
Table of Contents
The performance of large language models is fundamentally constrained by hardware. A 70B parameter model requires 140GB just to store weights in FP16—more than any single consumer GPU can hold. Training requires thousands of GPUs coordinated across data centers. Inference serves billions of queries daily, with cost directly tied to hardware efficiency. Understanding accelerator architecture and optimization techniques is essential for anyone deploying LLMs at scale.
The Hardware Landscape
Before diving into optimization, we need to understand the accelerator options available and their fundamental tradeoffs.
GPUs: The Default Choice
NVIDIA GPUs dominate LLM training and inference for good reason. The CUDA ecosystem, developed over 15 years, provides mature libraries, extensive tooling, and a vast community. When something goes wrong, you can find help.
The current flagship is the H100 (Hopper architecture), with the H200 (high-bandwidth memory variant) and B100/B200 (Blackwell architecture) representing the cutting edge:
| GPU | FP16/BF16 TFLOPS | Sparse FP4 PFLOPS | Memory | Bandwidth | Power |
|---|---|---|---|---|---|
| A100 80GB | 312 | — | 80GB HBM2e | 2.0 TB/s | 400W |
| H100 80GB | 989 | — | 80GB HBM3 | 3.35 TB/s | 700W |
| H200 | 989 | — | 141GB HBM3e | 4.8 TB/s | 700W |
| B100 | 1,750 | 14 | 192GB HBM3e | 8.0 TB/s | 700W |
| B200 | 2,250 | 18 | 192GB HBM3e | 8.0 TB/s | 1000W |
B200 Performance: The B200 delivers up to 20 petaFLOPS of sparse FP4 AI compute per card. Built on TSMC's 4NP process, it packs 208 billion transistors across a dual-die design. Key specs:
- Memory: 192GB HBM3e (180GB usable in cloud)—2.4× H100 capacity
- Bandwidth: 8 TB/s memory bandwidth—2× Hopper
- Interconnect: NVLink 5 at 1.8 TB/s bidirectional
- Tensor Cores: 6th-gen with FP4, FP6, FP8, BF16, TF32 support
DGX B200: 8× B200 GPUs delivering 3× training and 15× inference performance over DGX H100.
GB200 NVL72 (Rack-Scale System): The flagship configuration connects 36 Grace CPUs and 72 Blackwell GPUs in liquid-cooled design:
- 72-GPU NVLink domain acts as single massive GPU
- 130 TB/s of low-latency GPU-to-GPU communication
- 30× faster real-time trillion-parameter LLM inference vs. H100
- 10× greater performance for MoE architectures
- Up to 25× reduction in cost and energy consumption
- Each Grace Blackwell Superchip: 10 PFLOPS FP8, 372GB HBM3e
Real-World Benchmarks (December 2025):
- Training: Up to 57% faster than H100
- Inference (Gemma 27B): ~10% speedup observed
- Inference (DeepSeek 671B): On par with H100 (early software ecosystem)
- Self-hosted B200 can be up to 10× cheaper than cloud H100
Note the trend: memory bandwidth is growing faster than raw compute. This reflects the recognition that LLM workloads are memory-bound—the bottleneck is moving data, not computing on it.
TPUs: Google's Alternative
Google's Tensor Processing Units (TPUs) take a different approach: specialized hardware designed specifically for tensor operations, available exclusively through Google Cloud.
TPUs use a systolic array architecture—a grid of processing elements that data flows through rhythmically. This is more efficient than GPUs' general-purpose SMs for regular tensor operations but less flexible for arbitrary computation.
Current TPU generations:
| TPU | Peak FP8 TFLOPS | Memory | Bandwidth | ICI Bandwidth | Per-Chip Power |
|---|---|---|---|---|---|
| v4 | 275 | 32GB HBM2e | 1.2 TB/s | 600 GB/s | ~200W |
| v5e | 197 | 16GB HBM2 | 820 GB/s | 1.6 TB/s | ~150W |
| v5p | 459 | 95GB HBM2e | 2.8 TB/s | 4.8 TB/s | ~250W |
| v6 (Trillium) | 918 | 32GB HBM2e | 1.6 TB/s | 3.2 TB/s | ~200W |
| v7 (Ironwood) | 4,614 | 192GB HBM3e | 7.37 TB/s | 1.2 TB/s | ~400W |
TPU v6 (Trillium): Achieves 4.7× peak compute improvement over v5e, with 2× HBM capacity/bandwidth and 2× ICI bandwidth. Over 67% more energy-efficient than v5e. Scales to 256 TPUs per pod.
TPU v7 (Ironwood): Google's latest generation, now generally available, specifically designed for inference at scale. Architecture details:
- Each chip contains two TensorCores and four SparseCores across two chiplets
- Single primary compute die (~700mm²) on TSMC N3P with CoWoS
- ~1kW power consumption (liquid-cooled)
- Inter-Chip Interconnect (ICI) network at 9.6 Tb/s
Key specs vs. Trillium (v6e):
- 5× compute performance
- 6× HBM capacity (192GB vs 32GB)
- 4.5× HBM bandwidth (7.4 TB/s)
- 2× performance per watt
- 10× peak performance over TPU v5p
Scale and economics:
- Scales to 9,216 chips per superpod (1.77 PB shared HBM)
- 44% lower TCO than GB200 NVL72 per Google's analysis
- Anthropic committed to deploying over 1 million Ironwood chips beginning 2026
- Nearly closes the gap to B200 on FLOPs, memory, and bandwidth (albeit 1 year later GA)
TPUs excel in specific scenarios:
Strengths:
- Superior performance-per-watt (29× better than CPU, competitive with GPU)
- Native bfloat16 support (Google invented the format for TPUs)
- Tight integration with JAX/XLA for automatic optimization
- Aggressive pricing for certain workloads
Weaknesses:
- Requires XLA-compatible code (no arbitrary CUDA kernels)
- Dynamic shapes and control flow are problematic
- Exclusive to Google Cloud (vendor lock-in)
- Smaller ecosystem and community
The Economics
Hardware choice has significant cost implications:
Training costs (estimated for a hypothetical 70B model):
- H100 cluster: ~$2M compute for training
- TPU v5p pod: ~$1.5M compute for equivalent training
- Savings require XLA-compatible architecture
Inference costs (per million tokens, 70B model):
- H100: $0.50-1.00
- A100: $0.80-1.50
- TPU v5e: $0.30-0.60
- H200: $0.40-0.80
TPUs can be 2-4× cheaper for inference when models fit their constraints. GPUs offer more flexibility but at higher cost.
The Hybrid Strategy
Many organizations use both:
Training: H100 clusters for flexibility in model development, rapid iteration, and debugging.
Inference: TPU v5e/v6 for production serving where models are stable and cost optimization matters.
This "follow Meta's model" approach balances research agility with production efficiency, achieving 40-50% total compute savings while maintaining development velocity.
GPU Architecture Deep Dive
To optimize for GPUs, we need to understand their architecture. Modern NVIDIA GPUs consist of:
Streaming Multiprocessors (SMs)
The GPU's fundamental compute unit. An H100 has 132 SMs, each containing:
- 128 FP32 CUDA cores
- 64 FP64 CUDA cores (double precision)
- 4 Tensor Cores (matrix acceleration)
- 256KB register file
- 256KB L1 cache / shared memory (configurable)
- Warp schedulers
SMs execute warps—groups of 32 threads that execute in lockstep (SIMT: Single Instruction, Multiple Thread). All threads in a warp execute the same instruction simultaneously. Divergent branches cause serialization and performance loss.
Tensor Cores
Tensor Cores are specialized matrix multiplication units that accelerate the operations dominating transformer workloads. They operate on small matrix tiles:
- H100: 16×16 matrices in various precisions
- Mixed precision: FP16/BF16 inputs with FP32 accumulation
- Sparse support: 2× throughput for structured sparsity
To use Tensor Cores effectively:
- Matrix dimensions must be multiples of 8 (FP16) or 16 (INT8)
- Data must be properly aligned in memory
- Libraries like cuBLAS and cuDNN handle this automatically
Tensor Core utilization is the key metric for LLM performance. Achieving 70-80% utilization on H100 is considered excellent; many naive implementations achieve only 30-40%.
Memory Hierarchy
GPU memory is hierarchical, with dramatic differences in bandwidth and latency:
Bandwidth Latency Size
Registers ~20 TB/s ~1 cycle 256KB/SM
L1/Shared Memory ~20 TB/s ~30 cycles 256KB/SM
L2 Cache ~5 TB/s ~100 cycles 50MB
HBM (Global) 3.35 TB/s ~500 cycles 80GB
The 1000× bandwidth difference between registers and HBM dominates performance considerations. Optimized kernels maximize data reuse at higher cache levels.
Memory Bandwidth: The Real Bottleneck
For a 70B model in FP16:
- Weight size: 140GB
- Single forward pass: Load 140GB of weights
- H100 bandwidth: 3.35 TB/s
- Theoretical minimum time: 140GB / 3.35 TB/s = 42ms
This is just for weight loading—before any computation. The actual compute (matrix multiplications) could complete in under 5ms if data were already in registers. We're spending 8× longer moving data than computing.
This is why LLM inference is memory-bound, not compute-bound. Optimization focuses on:
- Reducing memory movement (quantization, caching)
- Increasing arithmetic intensity (batching)
- Hiding memory latency (pipelining)
CUDA Kernel Optimization
CUDA kernels are the functions that execute on GPUs. For LLM workloads, key kernels include:
- Matrix multiplication (GEMM): Attention projections, feedforward layers
- Softmax: Attention score normalization
- Layer normalization: Per-layer feature normalization
- Activation functions: GELU, SiLU
- Attention: Combined QKV projection, attention, output
Kernel Fusion
Kernel fusion combines multiple operations into a single kernel, eliminating intermediate memory writes:
Unfused (naive):
y = matmul(x, W1) # Write to HBM
y = gelu(y) # Read from HBM, write back
y = matmul(y, W2) # Read from HBM
Fused:
y = fused_mlp(x, W1, W2) # One kernel, no intermediate HBM
The fused version avoids two HBM round-trips. For a 4096-dim hidden state with 8192-dim intermediate, this saves:
- 2 × 4096 × batch_size × 2 bytes per token
- At batch_size=1024: 16MB saved per layer
Across 80 layers, this is 1.3GB of avoided memory traffic per forward pass. FlashAttention achieves its speedups largely through fusion.
Persistent Kernels
Traditional CUDA launches a separate kernel for each operation. Kernel launch overhead (~5-10μs) accumulates across hundreds of operations per forward pass.
Persistent kernels execute the entire forward pass in a single kernel launch:
- Load model weights into shared memory (partitioned across SMs)
- Process tokens without returning to CPU
- Synchronize between layers using global memory barriers
- Output final logits
This approach achieves up to 6.7× reduction in kernel launch latency and GPU idle time. The tradeoff is implementation complexity and reduced flexibility.
Memory Access Patterns
GPUs achieve peak bandwidth only with coalesced memory access—consecutive threads accessing consecutive memory locations:
Coalesced (good): Thread 0 reads address 0, thread 1 reads address 1, ...
Strided (bad): Thread 0 reads address 0, thread 1 reads address 128, ...
Strided access wastes bandwidth because memory is transferred in chunks (cache lines). Requesting scattered addresses within a chunk loads the entire chunk but uses only part of it.
For attention operations, the key/value cache is particularly problematic. Autoregressive generation accesses K/V vectors for all previous tokens—potentially millions of scattered accesses. Optimized implementations use:
- Paged attention (vLLM): Organize K/V cache in contiguous blocks
- Chunked access: Process K/V in cache-friendly chunks
- Memory layout optimization: Store K/V in formats that enable coalesced access
Quantization-Aware Kernels
Quantized models (INT8, INT4) require specialized kernels:
Standard GEMM: FP16 × FP16 → FP16 Quantized GEMM: INT8 × INT8 → INT32 → FP16
The quantized version:
- Loads compressed weights (2-4× smaller)
- Dequantizes on-the-fly
- Computes with integer arithmetic
- Converts output to FP16
For INT4 quantization, a single memory access loads 2 weights. Combined with reduced memory bandwidth requirements, INT4 achieves 2-4× speedup on memory-bound workloads despite the dequantization overhead.
Custom kernels for grouped quantization (different scales per group of weights) further optimize quality-efficiency tradeoffs.
FlashAttention: A Case Study in Optimization
FlashAttention exemplifies modern CUDA optimization, achieving 2-4× speedups through careful memory management.
The Problem
Standard attention computes:
For sequence length , this creates an attention matrix. At 128K tokens, this matrix has 16 billion elements—too large to fit in GPU memory.
Naive implementations materialize this full matrix, causing:
- Massive memory allocation ( space)
- Repeated HBM round-trips
- Memory bandwidth bottleneck
The Solution
FlashAttention computes attention in tiles, never materializing the full attention matrix:
- Divide Q, K, V into blocks that fit in shared memory
- For each Q block:
a. Load Q block to shared memory
b. For each K, V block:
- Load K, V to shared memory
- Compute partial attention (QK^T for this tile)
- Update running softmax statistics
- Accumulate output contribution c. Write final output to HBM
The key insight is maintaining running softmax statistics (max value and sum of exponentials) that allow computing the correct softmax incrementally without seeing all attention scores simultaneously.
Performance Impact
FlashAttention-3 on H100 achieves:
- 85% of theoretical peak FLOPS utilization
- 1.5-2× speedup over FlashAttention-2
- Linear memory complexity ( instead of )
- Native support for FP8 computation
The optimization goes beyond algorithmic improvement—FlashAttention includes hand-tuned assembly for memory access patterns, warp scheduling, and Tensor Core utilization specific to each GPU architecture.
Lessons for Optimization
FlashAttention's success demonstrates key principles:
- Memory is the bottleneck: Reducing HBM accesses matters more than reducing FLOPS
- Tiling enables scale: Breaking problems into cache-sized pieces enables efficient memory use
- Numerically-equivalent alternatives exist: The same mathematical result can be computed different ways with vastly different efficiency
- Hardware specificity matters: Optimal kernels differ across GPU generations
TPU Optimization
TPU optimization differs fundamentally from CUDA. Rather than writing kernels, you write high-level code that the XLA compiler optimizes.
The XLA Paradigm
XLA (Accelerated Linear Algebra) is a domain-specific compiler for tensor operations. It:
- Takes a computational graph (from JAX, TensorFlow, or PyTorch)
- Analyzes the entire computation
- Applies optimizations (fusion, layout changes, parallelization)
- Generates efficient code for the target accelerator
This approach trades control for automation. You can't write custom TPU kernels, but XLA often finds optimizations humans would miss.
TPU-Friendly Patterns
XLA works best with:
Static shapes: Known dimensions enable aggressive optimization. Dynamic shapes force conservative code generation.
Regular tensor operations: Matrix multiplications, convolutions, and elementwise operations map directly to TPU systolic arrays.
Batch dimensions: TPUs excel at batched operations. Small batches underutilize the hardware.
No Python control flow in hot paths: Use jax.lax.cond and jax.lax.scan instead of Python if/for.
TPU-hostile patterns:
Dynamic shapes: Variable sequence lengths, ragged tensors Sparse operations: TPU systolic arrays expect dense computation Custom operations: Anything not in XLA's operation set Fine-grained control flow: Many small conditional branches
JAX for TPU
JAX is the preferred framework for TPU development:
Model code (JAX)
↓
jit compilation (trace to XLA graph)
↓
XLA optimization (fusion, scheduling)
↓
TPU executable (optimized for hardware)
JAX's functional style aligns with XLA's graph-based optimization. The jax.jit decorator compiles functions, and jax.pmap handles data parallelism across TPU cores.
GSPMD and Parallelism
TPUs use GSPMD (General and Scalable Parallelization for ML) for distributed training:
- Annotate which tensors to shard and how
- GSPMD automatically inserts communication operations
- XLA optimizes the distributed computation
This declarative approach simplifies distributed training compared to manual GPU parallelization, but requires models to fit GSPMD's partitioning model.
Practical Optimization Strategies
Beyond low-level kernel optimization, several high-level strategies improve hardware utilization:
Batching for Throughput
Single-request inference vastly underutilizes hardware. A 70B model on H100:
- Single request: ~20 tokens/second
- Batch of 32: ~400 tokens/second
- Batch of 128: ~800 tokens/second
The improvement comes from amortizing weight loading across more tokens. Each batch increases arithmetic intensity (computation per byte loaded).
Continuous batching (used by vLLM, TGI) dynamically adds/removes requests from the batch as they complete, maximizing utilization without waiting for all requests to finish.
Mixed Precision Training
Training in FP32 wastes half the available compute:
FP32 training:
- 140GB model weights
- 140GB gradients
- ~300GB optimizer states
- Total: ~580GB
Mixed precision (BF16 weights, FP32 optimizer):
- 70GB model weights
- 70GB gradients
- ~150GB optimizer states
- Total: ~290GB
Plus, Tensor Cores achieve 2× throughput on FP16/BF16 versus FP32. Mixed precision is strictly better for LLM training with proper loss scaling.
Quantization for Inference
Quantization reduces memory requirements and increases throughput:
| Precision | Memory | Tokens/s (70B, H100) | Quality Impact |
|---|---|---|---|
| FP16 | 140GB | 20/req | Baseline |
| INT8 | 70GB | 35/req | Minimal |
| INT4 | 35GB | 50/req | Slight |
| FP8 | 70GB | 40/req | Minimal |
INT4 with GPTQ or AWQ achieves 2-3× inference speedup with <1% quality loss on most benchmarks. FP8 (supported on H100+) provides a middle ground with native Tensor Core support.
KV Cache Optimization
The key-value cache grows linearly with sequence length, consuming significant memory during generation:
Standard caching:
- 70B model, 128K context: ~80GB KV cache
- Limits batch size severely
PagedAttention (vLLM):
- Manages KV cache in non-contiguous blocks
- Reduces memory fragmentation from 60-80% to under 4%
- Enables larger batches and longer contexts
Sliding window attention:
- Only cache recent tokens (e.g., last 4K)
- Reduces memory but loses long-range information
Cross-attention caching:
- Cache encoder outputs for encoder-decoder models
- Single encoder pass serves multiple decoder steps
Model Parallelism
Large models require distribution across GPUs:
Tensor Parallelism: Split individual operations across GPUs
- Matrix multiplication split column-wise or row-wise
- Requires fast interconnect (NVLink)
- Typical: 4-8 GPUs
Pipeline Parallelism: Split layers across GPUs
- Each GPU handles a subset of layers
- Micro-batching hides pipeline bubbles
- Typical: 4-16 GPUs
Data Parallelism: Replicate model, split data
- Each GPU processes different batches
- Gradients synchronized across replicas
- Scales to thousands of GPUs
Combined approaches: Production systems typically use all three:
- TP within nodes (fast NVLink)
- PP across node groups
- DP across the cluster
Speculative Decoding
Speculative decoding addresses the autoregressive bottleneck:
- Small draft model proposes multiple tokens
- Large target model verifies in parallel
- Accept correct predictions, reject and regenerate otherwise
This achieves 2-3× speedup by converting sequential generation into parallel verification. See the dedicated post on speculative decoding for details.
Inference Frameworks
Production inference uses specialized frameworks that implement these optimizations:
vLLM
vLLM pioneered PagedAttention and continuous batching:
- Memory-efficient KV cache management
- Optimized CUDA kernels (FlashAttention, custom attention)
- Speculative decoding support
- Distributed inference with tensor parallelism
Best for: High-throughput serving, memory-constrained environments
TensorRT-LLM
NVIDIA's optimized inference framework:
- Kernel fusion and optimization
- INT8/FP8 quantization with custom kernels
- Multi-GPU support (TP, PP)
- Speculative decoding with EAGLE-3
Best for: Maximum performance on NVIDIA hardware, production deployments
SGLang
High-performance serving with unique features:
- RadixAttention for efficient KV cache sharing
- Constrained decoding with CUDA-accelerated FSMs
- Speculative decoding
- OpenAI-compatible API
Best for: Complex prompting patterns, structured generation
Framework Comparison (70B model, H100)
| Framework | Throughput | Latency (P50) | Memory | Ease of Use |
|---|---|---|---|---|
| vLLM | High | Low | Excellent | Good |
| TensorRT-LLM | Highest | Lowest | Good | Complex |
| SGLang | High | Low | Good | Good |
| HuggingFace TGI | Medium | Medium | Good | Excellent |
AI CUDA Engineer: LLM-Generated Kernels
A fascinating development is using LLMs to write CUDA kernels. Sakana AI's "AI CUDA Engineer" uses frontier models to automatically optimize PyTorch code:
Process:
- Input: PyTorch function
- LLM generates candidate CUDA kernels
- Kernels are compiled and benchmarked
- Best performing kernel is selected
- Iterative refinement based on profiling
Results:
- 10-100× speedups over naive PyTorch
- Competitive with hand-tuned implementations
- Discovers novel optimization strategies
This approach is particularly valuable for custom operations where hand-tuning expertise is unavailable. The LLM leverages patterns from millions of CUDA kernels in its training data.
Energy Efficiency and Sustainability
Hardware efficiency increasingly considers energy:
Performance per Watt
| Accelerator | Peak FP16 TFLOPS | Power | TFLOPS/Watt |
|---|---|---|---|
| A100 | 312 | 400W | 0.78 |
| H100 | 989 | 700W | 1.41 |
| TPU v5p | 459 | 250W | 1.84 |
| B100 | 1800 | 700W | 2.57 |
TPUs achieve better TFLOPS/Watt through specialization—they sacrifice flexibility for efficiency. For workloads that fit TPU constraints, this translates to lower carbon footprint and operating cost.
Optimization Impact
Efficiency improvements compound:
- FlashAttention: 2× efficiency (fewer HBM accesses)
- INT4 quantization: 2× efficiency (smaller weights)
- Continuous batching: 3× efficiency (better utilization)
- Combined: 12× efficiency versus naive baseline
A 70B model serving can achieve:
- Naive: 20 tokens/second/GPU
- Optimized: 200+ tokens/second/GPU
This 10× improvement directly translates to 10× fewer GPUs, 10× less energy, and 10× lower cost.
Future Hardware Trends
The hardware landscape continues evolving:
Near-term (2025-2026)
NVIDIA Blackwell (B100, B200):
- 2× compute over Hopper
- Native FP4 support
- 8 TB/s HBM bandwidth
- Enhanced sparsity support
AMD MI350X:
- Competitive with H100 on paper
- Growing ROCm ecosystem
- Potential cost advantage
Intel Gaudi 3:
- Strong price/performance
- Growing software support
- Enterprise focus
Medium-term (2026-2028)
In-memory computing: Processing near or in memory to eliminate bandwidth bottleneck. IBM, Samsung, and others have research prototypes.
Photonic accelerators: Using light for computation offers fundamental efficiency advantages. Lightmatter and others are commercializing photonic chips.
Neuromorphic chips: Brain-inspired architectures with potential for sparse, event-driven computation.
Implications for LLMs
Future models will likely require:
- Hardware-aware architecture design: Models optimized for specific accelerators, not just generic transformers
- Heterogeneous deployment: Different hardware for different workloads (training vs. inference, dense vs. sparse)
- Continued software optimization: Hardware advances are meaningless without software to exploit them
Related Articles
vLLM in Production: The Complete Guide to High-Performance LLM Serving
A comprehensive guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.
LLM Inference Optimization: From Quantization to Speculative Decoding
A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.
Speculative Decoding: Accelerating LLM Inference Without Sacrificing Quality
A comprehensive guide to speculative decoding techniques that accelerate LLM inference by 2-4× while maintaining exact output quality, covering draft models, EAGLE, Medusa, and production deployment strategies.
Distributed Training: How to Train 70B+ Parameter Models
A comprehensive deep dive into distributed training—how to train models that don't fit on a single GPU. Understand data parallelism, tensor parallelism, pipeline parallelism, ZeRO optimization, and the engineering behind training frontier LLMs.
Attention Mechanisms: From Self-Attention to FlashAttention
A comprehensive deep dive into attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.
Mixture of Experts: Scaling LLMs Beyond Dense Models
A comprehensive deep dive into Mixture of Experts (MoE) architecture—how models like Mixtral and GPT-4 achieve massive capacity without proportional compute costs. Understand routing mechanisms, expert specialization, load balancing, and why MoE represents the future of LLM scaling.