How much speedup can I expect from optimization?

Starting from naive PyTorch inference, you can typically achieve: 2-3× from FlashAttention, 2× from quantization (INT8/INT4), 3-5× from continuous batching, 2-3× from speculative decoding (EAGLE-3). These multiply to 20-60× total improvement. Moving from HuggingFace transformers to vLLM or TensorRT-LLM captures most of this automatically.

How do I know if I'm memory-bound or compute-bound?

Profile your workload. NVIDIA Nsight Systems shows kernel timing; if matmul kernels are waiting for memory, you're memory-bound. As a rule of thumb: batch size 1 is almost always memory-bound (weights dominate); large batches (128+) may become compute-bound. Most LLM inference is memory-bound due to weight loading.

What's the difference between B100 and B200?

Both are NVIDIA Blackwell architecture, but B200 is the higher-end variant: - **B200**: 2,250 TFLOPS FP16/BF16, 18 PFLOPS sparse FP4, 1000W TDP, dual-die design - **B100**: 1,750 TFLOPS FP16/BF16, 14 PFLOPS sparse FP4, 700W TDP, fits H100 chassis Most cloud providers have skipped B100 in favor of B200. B100 is mainly for upgrades to existing H100 infrastructure.

How does TPU v7 (Ironwood) compare to NVIDIA B200?

Both are optimized for inference at scale: - **Ironwood**: 4,614 TFLOPS FP8, 192GB HBM3e, 7.37 TB/s bandwidth, scales to 9,216 chips (42.5 Exaflops) - **B200**: 2,250 TFLOPS FP16, 192GB HBM3e, 8.0 TB/s bandwidth Ironwood excels in throughput and power efficiency; B200 offers more flexibility and broader ecosystem. Ironwood is Google Cloud exclusive; B200 is available across cloud providers.

Back to Blog

LLMs Infrastructure Optimization Production Deep Learning

Hardware Optimization for LLMs: CUDA Kernels, TPU vs GPU, and Accelerator Architecture

Q: Should I use GPUs or TPUs for my LLM workload?

For most teams, start with GPUs. The ecosystem is more mature, debugging is easier, and you're not locked into a single cloud provider. Consider TPUs if: (1) you're using JAX and can write XLA-compatible code, (2) you're deploying at scale where cost savings justify the learning curve, or (3) you're on Google Cloud anyway. Many large organizations use GPUs for development and TPUs for production inference. Note: Anthropic has committed to over 1 million TPU v7 (Ironwood) chips for Claude training and serving.

Q: What's the best quantization for inference?

For most use cases, INT8 (W8A8) or INT4 (W4A16) provides the best quality-speed tradeoff. INT8 has minimal quality impact and 1.5-2× speedup. INT4 has slight quality degradation but 2-3× speedup. FP8 on H100+/B200 is a good middle ground if you need activation quantization. FP4 on Blackwell (B100/B200) offers even higher throughput for inference. Test on your specific tasks—some are more sensitive to quantization than others.

Q: Is it worth writing custom CUDA kernels?

Usually not. vLLM, TensorRT-LLM, and FlashAttention already include highly optimized kernels. Custom kernels make sense if: (1) you have a novel operation not in existing libraries, (2) you've identified a specific bottleneck through profiling, (3) you have CUDA expertise or are willing to invest significant time. For most teams, using optimized frameworks is the right choice. Alternatively, consider using Sakana AI's "AI CUDA Engineer" to auto-generate optimized kernels.

Q: How do I reduce inference costs?

In priority order: (1) Use quantization (INT8/INT4)—often 2× cost reduction with minimal quality impact. (2) Maximize batch size through continuous batching—can be 5× improvement. (3) Use speculative decoding for latency-sensitive workloads—2-3× speedup. (4) Consider smaller models—a well-tuned 8B model often beats a naive 70B deployment. (5) Evaluate TPUs if at scale and willing to adapt code.

Field guide to hardware optimization for large language models covering GPU architecture, CUDA kernel optimization, TPU comparisons, memory hierarchies, and practical strategies for maximizing throughput on modern AI accelerators.

October 19, 202513 min read

The performance of large language models is fundamentally constrained by hardware. A 70B parameter model requires 140GB just to store weights in FP16—more than any single consumer GPU can hold. Training requires thousands of GPUs coordinated across data centers. Inference serves billions of queries daily, with cost directly tied to hardware efficiency. Understanding accelerator architecture and optimization techniques is essential for anyone deploying LLMs at scale.

The Hardware Landscape

Before diving into optimization, we need to understand the accelerator options available and their fundamental tradeoffs.

GPUs: The Default Choice

NVIDIA GPUs dominate LLM training and inference for good reason. The CUDA ecosystem, developed over 15 years, provides mature libraries, extensive tooling, and a vast community. When something goes wrong, you can find help.

The current flagship is the H100 (Hopper architecture), with the H200 (high-bandwidth memory variant) and B100/B200 (Blackwell architecture) representing the cutting edge:

GPU	FP16/BF16 TFLOPS	Sparse FP4 PFLOPS	Memory	Bandwidth	Power
A100 80GB	312	—	80GB HBM2e	2.0 TB/s	400W
H100 80GB	989	—	80GB HBM3	3.35 TB/s	700W
H200	989	—	141GB HBM3e	4.8 TB/s	700W
B100	1,750	14	192GB HBM3e	8.0 TB/s	700W
B200	2,250	18	192GB HBM3e	8.0 TB/s	1000W

B200 Performance: The B200 delivers up to 20 petaFLOPS of sparse FP4 AI compute per card. Built on TSMC's 4NP process, it packs 208 billion transistors across a dual-die design. Key specs:

Memory: 192GB HBM3e (180GB usable in cloud)—2.4× H100 capacity
Bandwidth: 8 TB/s memory bandwidth—2× Hopper
Interconnect: NVLink 5 at 1.8 TB/s bidirectional
Tensor Cores: 6th-gen with FP4, FP6, FP8, BF16, TF32 support

DGX B200: 8× B200 GPUs delivering 3× training and 15× inference performance over DGX H100.

GB200 NVL72 (Rack-Scale System): The flagship configuration connects 36 Grace CPUs and 72 Blackwell GPUs in liquid-cooled design:

72-GPU NVLink domain acts as single massive GPU
130 TB/s of low-latency GPU-to-GPU communication
30× faster real-time trillion-parameter LLM inference vs. H100
10× greater performance for MoE architectures
Up to 25× reduction in cost and energy consumption
Each Grace Blackwell Superchip: 10 PFLOPS FP8, 372GB HBM3e

Real-World Benchmarks (December 2025):

Training: Up to 57% faster than H100
Inference (Gemma 27B): ~10% speedup observed
Inference (DeepSeek 671B): On par with H100 (early software ecosystem)
Self-hosted B200 can be up to 10× cheaper than cloud H100

Note the trend: memory bandwidth is growing faster than raw compute. This reflects the recognition that LLM workloads are memory-bound—the bottleneck is moving data, not computing on it.

TPUs: Google's Alternative

Google's Tensor Processing Units (TPUs) take a different approach: specialized hardware designed specifically for tensor operations, available exclusively through Google Cloud.

TPUs use a systolic array architecture—a grid of processing elements that data flows through rhythmically. This is more efficient than GPUs' general-purpose SMs for regular tensor operations but less flexible for arbitrary computation.

Current TPU generations:

TPU	Peak FP8 TFLOPS	Memory	Bandwidth	ICI Bandwidth	Per-Chip Power
v4	275	32GB HBM2e	1.2 TB/s	600 GB/s	~200W
v5e	197	16GB HBM2	820 GB/s	1.6 TB/s	~150W
v5p	459	95GB HBM2e	2.8 TB/s	4.8 TB/s	~250W
v6 (Trillium)	918	32GB HBM2e	1.6 TB/s	3.2 TB/s	~200W
v7 (Ironwood)	4,614	192GB HBM3e	7.37 TB/s	1.2 TB/s	~400W

TPU v6 (Trillium): Achieves 4.7× peak compute improvement over v5e, with 2× HBM capacity/bandwidth and 2× ICI bandwidth. Over 67% more energy-efficient than v5e. Scales to 256 TPUs per pod.

TPU v7 (Ironwood): Google's latest generation, now generally available, specifically designed for inference at scale. Architecture details:

Each chip contains two TensorCores and four SparseCores across two chiplets
Single primary compute die (~700mm²) on TSMC N3P with CoWoS
~1kW power consumption (liquid-cooled)
Inter-Chip Interconnect (ICI) network at 9.6 Tb/s

Key specs vs. Trillium (v6e):

5× compute performance
6× HBM capacity (192GB vs 32GB)
4.5× HBM bandwidth (7.4 TB/s)
2× performance per watt
10× peak performance over TPU v5p

Scale and economics:

Scales to 9,216 chips per superpod (1.77 PB shared HBM)
44% lower TCO than GB200 NVL72 per Google's analysis
Anthropic committed to deploying over 1 million Ironwood chips beginning 2026
Nearly closes the gap to B200 on FLOPs, memory, and bandwidth (albeit 1 year later GA)

TPUs excel in specific scenarios:

Strengths:

Superior performance-per-watt (29× better than CPU, competitive with GPU)
Native bfloat16 support (Google invented the format for TPUs)
Tight integration with JAX/XLA for automatic optimization
Aggressive pricing for certain workloads

Weaknesses:

Requires XLA-compatible code (no arbitrary CUDA kernels)
Dynamic shapes and control flow are problematic
Exclusive to Google Cloud (vendor lock-in)
Smaller ecosystem and community

The Economics

Hardware choice has significant cost implications:

Training costs (estimated for a hypothetical 70B model):

H100 cluster: ~$2M compute for training
TPU v5p pod: ~$1.5M compute for equivalent training
Savings require XLA-compatible architecture

Inference costs (per million tokens, 70B model):

H100: $0.50-1.00
A100: $0.80-1.50
TPU v5e: $0.30-0.60
H200: $0.40-0.80

TPUs can be 2-4× cheaper for inference when models fit their constraints. GPUs offer more flexibility but at higher cost.

The Hybrid Strategy

Many organizations use both:

Training: H100 clusters for flexibility in model development, rapid iteration, and debugging.

Inference: TPU v5e/v6 for production serving where models are stable and cost optimization matters.

This "follow Meta's model" approach balances research agility with production efficiency, achieving 40-50% total compute savings while maintaining development velocity.

GPU Architecture Deep Dive

To optimize for GPUs, we need to understand their architecture. Modern NVIDIA GPUs consist of:

Streaming Multiprocessors (SMs)

The GPU's fundamental compute unit. An H100 has 132 SMs, each containing:

128 FP32 CUDA cores
64 FP64 CUDA cores (double precision)
4 Tensor Cores (matrix acceleration)
256KB register file
256KB L1 cache / shared memory (configurable)
Warp schedulers

SMs execute warps—groups of 32 threads that execute in lockstep (SIMT: Single Instruction, Multiple Thread). All threads in a warp execute the same instruction simultaneously. Divergent branches cause serialization and performance loss.

Tensor Cores

Tensor Cores are specialized matrix multiplication units that accelerate the operations dominating transformer workloads. They operate on small matrix tiles:

H100: 16×16 matrices in various precisions
Mixed precision: FP16/BF16 inputs with FP32 accumulation
Sparse support: 2× throughput for structured sparsity

To use Tensor Cores effectively:

Matrix dimensions must be multiples of 8 (FP16) or 16 (INT8)
Data must be properly aligned in memory
Libraries like cuBLAS and cuDNN handle this automatically

Tensor Core utilization is the key metric for LLM performance. Achieving 70-80% utilization on H100 is considered excellent; many naive implementations achieve only 30-40%.

Memory Hierarchy

GPU memory is hierarchical, with dramatic differences in bandwidth and latency:

Code

                    Bandwidth       Latency     Size
Registers           ~20 TB/s        ~1 cycle    256KB/SM
L1/Shared Memory    ~20 TB/s        ~30 cycles  256KB/SM
L2 Cache            ~5 TB/s         ~100 cycles 50MB
HBM (Global)        3.35 TB/s       ~500 cycles 80GB

The 1000× bandwidth difference between registers and HBM dominates performance considerations. Optimized kernels maximize data reuse at higher cache levels.

Memory Bandwidth: The Real Bottleneck

For a 70B model in FP16:

Weight size: 140GB
Single forward pass: Load 140GB of weights
H100 bandwidth: 3.35 TB/s
Theoretical minimum time: 140GB / 3.35 TB/s = 42ms

This is just for weight loading—before any computation. The actual compute (matrix multiplications) could complete in under 5ms if data were already in registers. We're spending 8× longer moving data than computing.

This is why LLM inference is memory-bound, not compute-bound. Optimization focuses on:

Reducing memory movement (quantization, caching)
Increasing arithmetic intensity (batching)
Hiding memory latency (pipelining)

CUDA Kernel Optimization

CUDA kernels are the functions that execute on GPUs. For LLM workloads, key kernels include:

Matrix multiplication (GEMM): Attention projections, feedforward layers
Softmax: Attention score normalization
Layer normalization: Per-layer feature normalization
Activation functions: GELU, SiLU
Attention: Combined QKV projection, attention, output

Kernel Fusion

Kernel fusion combines multiple operations into a single kernel, eliminating intermediate memory writes:

Unfused (naive):

Code

y = matmul(x, W1)  # Write to HBM
y = gelu(y)        # Read from HBM, write back
y = matmul(y, W2)  # Read from HBM

Fused:

Code

y = fused_mlp(x, W1, W2)  # One kernel, no intermediate HBM

The fused version avoids two HBM round-trips. For a 4096-dim hidden state with 8192-dim intermediate, this saves:

2 × 4096 × batch_size × 2 bytes per token
At batch_size=1024: 16MB saved per layer

Across 80 layers, this is 1.3GB of avoided memory traffic per forward pass. FlashAttention achieves its speedups largely through fusion.

Persistent Kernels

Traditional CUDA launches a separate kernel for each operation. Kernel launch overhead (~5-10μs) accumulates across hundreds of operations per forward pass.

Persistent kernels execute the entire forward pass in a single kernel launch:

Load model weights into shared memory (partitioned across SMs)
Process tokens without returning to CPU
Synchronize between layers using global memory barriers
Output final logits

This approach achieves up to 6.7× reduction in kernel launch latency and GPU idle time. The tradeoff is implementation complexity and reduced flexibility.

Memory Access Patterns

GPUs achieve peak bandwidth only with coalesced memory access—consecutive threads accessing consecutive memory locations:

Coalesced (good): Thread 0 reads address 0, thread 1 reads address 1, ...

Strided (bad): Thread 0 reads address 0, thread 1 reads address 128, ...

Strided access wastes bandwidth because memory is transferred in chunks (cache lines). Requesting scattered addresses within a chunk loads the entire chunk but uses only part of it.

For attention operations, the key/value cache is particularly problematic. Autoregressive generation accesses K/V vectors for all previous tokens—potentially millions of scattered accesses. Optimized implementations use:

Paged attention (vLLM): Organize K/V cache in contiguous blocks
Chunked access: Process K/V in cache-friendly chunks
Memory layout optimization: Store K/V in formats that enable coalesced access

Quantization-Aware Kernels

Quantized models (INT8, INT4) require specialized kernels:

Standard GEMM: FP16 × FP16 → FP16 Quantized GEMM: INT8 × INT8 → INT32 → FP16

The quantized version:

Loads compressed weights (2-4× smaller)
Dequantizes on-the-fly
Computes with integer arithmetic
Converts output to FP16

For INT4 quantization, a single memory access loads 2 weights. Combined with reduced memory bandwidth requirements, INT4 achieves 2-4× speedup on memory-bound workloads despite the dequantization overhead.

Custom kernels for grouped quantization (different scales per group of weights) further optimize quality-efficiency tradeoffs.

FlashAttention: A Case Study in Optimization

FlashAttention exemplifies modern CUDA optimization, achieving 2-4× speedups through careful memory management.

The Problem

Standard attention computes: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$

For sequence length $n$ , this creates an $n \times n$ attention matrix. At 128K tokens, this matrix has 16 billion elements—too large to fit in GPU memory.

Naive implementations materialize this full matrix, causing:

Massive memory allocation ( $O(n^2)$ space)
Repeated HBM round-trips
Memory bandwidth bottleneck

The Solution

FlashAttention computes attention in tiles, never materializing the full attention matrix:

Divide Q, K, V into blocks that fit in shared memory
For each Q block: a. Load Q block to shared memory b. For each K, V block:
- Load K, V to shared memory
- Compute partial attention (QK^T for this tile)
- Update running softmax statistics
- Accumulate output contribution c. Write final output to HBM

The key insight is maintaining running softmax statistics (max value and sum of exponentials) that allow computing the correct softmax incrementally without seeing all attention scores simultaneously.

Performance Impact

FlashAttention-3 on H100 achieves:

85% of theoretical peak FLOPS utilization
1.5-2× speedup over FlashAttention-2
Linear memory complexity ( $O(n)$ instead of $O(n^2)$ )
Native support for FP8 computation

The optimization goes beyond algorithmic improvement—FlashAttention includes hand-tuned assembly for memory access patterns, warp scheduling, and Tensor Core utilization specific to each GPU architecture.

Lessons for Optimization

FlashAttention's success demonstrates key principles:

Memory is the bottleneck: Reducing HBM accesses matters more than reducing FLOPS
Tiling enables scale: Breaking problems into cache-sized pieces enables efficient memory use
Numerically-equivalent alternatives exist: The same mathematical result can be computed different ways with vastly different efficiency
Hardware specificity matters: Optimal kernels differ across GPU generations

TPU Optimization

TPU optimization differs fundamentally from CUDA. Rather than writing kernels, you write high-level code that the XLA compiler optimizes.

The XLA Paradigm

XLA (Accelerated Linear Algebra) is a domain-specific compiler for tensor operations. It:

Takes a computational graph (from JAX, TensorFlow, or PyTorch)
Analyzes the entire computation
Applies optimizations (fusion, layout changes, parallelization)
Generates efficient code for the target accelerator

This approach trades control for automation. You can't write custom TPU kernels, but XLA often finds optimizations humans would miss.

TPU-Friendly Patterns

XLA works best with:

Static shapes: Known dimensions enable aggressive optimization. Dynamic shapes force conservative code generation.

Regular tensor operations: Matrix multiplications, convolutions, and elementwise operations map directly to TPU systolic arrays.

Batch dimensions: TPUs excel at batched operations. Small batches underutilize the hardware.

No Python control flow in hot paths: Use jax.lax.cond and jax.lax.scan instead of Python if/for.

TPU-hostile patterns:

Dynamic shapes: Variable sequence lengths, ragged tensors Sparse operations: TPU systolic arrays expect dense computation Custom operations: Anything not in XLA's operation set Fine-grained control flow: Many small conditional branches

JAX for TPU

JAX is the preferred framework for TPU development:

Code

Model code (JAX)
    ↓
jit compilation (trace to XLA graph)
    ↓
XLA optimization (fusion, scheduling)
    ↓
TPU executable (optimized for hardware)

JAX's functional style aligns with XLA's graph-based optimization. The jax.jit decorator compiles functions, and jax.pmap handles data parallelism across TPU cores.

GSPMD and Parallelism

TPUs use GSPMD (General and Scalable Parallelization for ML) for distributed training:

Annotate which tensors to shard and how
GSPMD automatically inserts communication operations
XLA optimizes the distributed computation

This declarative approach simplifies distributed training compared to manual GPU parallelization, but requires models to fit GSPMD's partitioning model.

Practical Optimization Strategies

Beyond low-level kernel optimization, several high-level strategies improve hardware utilization:

Batching for Throughput

Single-request inference vastly underutilizes hardware. A 70B model on H100:

Single request: ~20 tokens/second
Batch of 32: ~400 tokens/second
Batch of 128: ~800 tokens/second

The improvement comes from amortizing weight loading across more tokens. Each batch increases arithmetic intensity (computation per byte loaded).

Continuous batching (used by vLLM, TGI) dynamically adds/removes requests from the batch as they complete, maximizing utilization without waiting for all requests to finish.

Mixed Precision Training

Training in FP32 wastes half the available compute:

FP32 training:

140GB model weights
140GB gradients
~300GB optimizer states
Total: ~580GB

Mixed precision (BF16 weights, FP32 optimizer):

70GB model weights
70GB gradients
~150GB optimizer states
Total: ~290GB

Plus, Tensor Cores achieve 2× throughput on FP16/BF16 versus FP32. Mixed precision is strictly better for LLM training with proper loss scaling.

Quantization for Inference

Quantization reduces memory requirements and increases throughput:

Precision	Memory	Tokens/s (70B, H100)	Quality Impact
FP16	140GB	20/req	Baseline
INT8	70GB	35/req	Minimal
INT4	35GB	50/req	Slight
FP8	70GB	40/req	Minimal

INT4 with GPTQ or AWQ achieves 2-3× inference speedup with <1% quality loss on most benchmarks. FP8 (supported on H100+) provides a middle ground with native Tensor Core support.

KV Cache Optimization

The key-value cache grows linearly with sequence length, consuming significant memory during generation:

Standard caching:

70B model, 128K context: ~80GB KV cache
Limits batch size severely

PagedAttention (vLLM):

Manages KV cache in non-contiguous blocks
Reduces memory fragmentation from 60-80% to under 4%
Enables larger batches and longer contexts

Sliding window attention:

Only cache recent tokens (e.g., last 4K)
Reduces memory but loses long-range information

Cross-attention caching:

Cache encoder outputs for encoder-decoder models
Single encoder pass serves multiple decoder steps

Model Parallelism

Large models require distribution across GPUs:

Tensor Parallelism: Split individual operations across GPUs

Matrix multiplication split column-wise or row-wise
Requires fast interconnect (NVLink)
Typical: 4-8 GPUs

Pipeline Parallelism: Split layers across GPUs

Each GPU handles a subset of layers
Micro-batching hides pipeline bubbles
Typical: 4-16 GPUs

Data Parallelism: Replicate model, split data

Each GPU processes different batches
Gradients synchronized across replicas
Scales to thousands of GPUs

Combined approaches: Production systems typically use all three:

TP within nodes (fast NVLink)
PP across node groups
DP across the cluster

Speculative Decoding

Speculative decoding addresses the autoregressive bottleneck:

Small draft model proposes multiple tokens
Large target model verifies in parallel
Accept correct predictions, reject and regenerate otherwise

This achieves 2-3× speedup by converting sequential generation into parallel verification. See the dedicated post on speculative decoding for details.

Inference Frameworks

Production inference uses specialized frameworks that implement these optimizations:

vLLM

vLLM pioneered PagedAttention and continuous batching:

Memory-efficient KV cache management
Optimized CUDA kernels (FlashAttention, custom attention)
Speculative decoding support
Distributed inference with tensor parallelism

Best for: High-throughput serving, memory-constrained environments

TensorRT-LLM

NVIDIA's optimized inference framework:

Kernel fusion and optimization
INT8/FP8 quantization with custom kernels
Multi-GPU support (TP, PP)
Speculative decoding with EAGLE-3

Best for: Maximum performance on NVIDIA hardware, production deployments

SGLang

High-performance serving with unique features:

RadixAttention for efficient KV cache sharing
Constrained decoding with CUDA-accelerated FSMs
Speculative decoding
OpenAI-compatible API

Best for: Complex prompting patterns, structured generation

Framework Comparison (70B model, H100)

Framework	Throughput	Latency (P50)	Memory	Ease of Use
vLLM	High	Low	Excellent	Good
TensorRT-LLM	Highest	Lowest	Good	Complex
SGLang	High	Low	Good	Good
HuggingFace TGI	Medium	Medium	Good	Excellent

AI CUDA Engineer: LLM-Generated Kernels

A fascinating development is using LLMs to write CUDA kernels. Sakana AI's "AI CUDA Engineer" uses frontier models to automatically optimize PyTorch code:

Process:

Input: PyTorch function
LLM generates candidate CUDA kernels
Kernels are compiled and benchmarked
Best performing kernel is selected
Iterative refinement based on profiling

Results:

10-100× speedups over naive PyTorch
Competitive with hand-tuned implementations
Discovers novel optimization strategies

This approach is particularly valuable for custom operations where hand-tuning expertise is unavailable. The LLM leverages patterns from millions of CUDA kernels in its training data.

Energy Efficiency and Sustainability

Hardware efficiency increasingly considers energy:

Performance per Watt

Accelerator	Peak FP16 TFLOPS	Power	TFLOPS/Watt
A100	312	400W	0.78
H100	989	700W	1.41
TPU v5p	459	250W	1.84
B100	1800	700W	2.57

TPUs achieve better TFLOPS/Watt through specialization—they sacrifice flexibility for efficiency. For workloads that fit TPU constraints, this translates to lower carbon footprint and operating cost.

Optimization Impact

Efficiency improvements compound:

FlashAttention: 2× efficiency (fewer HBM accesses)
INT4 quantization: 2× efficiency (smaller weights)
Continuous batching: 3× efficiency (better utilization)
Combined: 12× efficiency versus naive baseline

A 70B model serving can achieve:

Naive: 20 tokens/second/GPU
Optimized: 200+ tokens/second/GPU

This 10× improvement directly translates to 10× fewer GPUs, 10× less energy, and 10× lower cost.

Future Hardware Trends

The hardware landscape continues evolving:

Near-term (2025-2026)

NVIDIA Blackwell (B100, B200):

2× compute over Hopper
Native FP4 support
8 TB/s HBM bandwidth
Enhanced sparsity support

AMD MI350X:

Competitive with H100 on paper
Growing ROCm ecosystem
Potential cost advantage

Intel Gaudi 3:

Strong price/performance
Growing software support
Enterprise focus

Medium-term (2026-2028)

In-memory computing: Processing near or in memory to eliminate bandwidth bottleneck. IBM, Samsung, and others have research prototypes.

Photonic accelerators: Using light for computation offers fundamental efficiency advantages. Lightmatter and others are commercializing photonic chips.

Neuromorphic chips: Brain-inspired architectures with potential for sparse, event-driven computation.

Implications for LLMs

Future models will likely require:

Hardware-aware architecture design: Models optimized for specific accelerators, not just generic transformers
Heterogeneous deployment: Different hardware for different workloads (training vs. inference, dense vs. sparse)
Continued software optimization: Hardware advances are meaningless without software to exploit them

Frequently Asked Questions

For most teams, start with GPUs. The ecosystem is more mature, debugging is easier, and you're not locked into a single cloud provider. Consider TPUs if: (1) you're using JAX and can write XLA-compatible code, (2) you're deploying at scale where cost savings justify the learning curve, or (3) you're on Google Cloud anyway. Many large organizations use GPUs for development and TPUs for production inference. Note: Anthropic has committed to over 1 million TPU v7 (Ironwood) chips for Claude training and serving.

For most use cases, INT8 (W8A8) or INT4 (W4A16) provides the best quality-speed tradeoff. INT8 has minimal quality impact and 1.5-2× speedup. INT4 has slight quality degradation but 2-3× speedup. FP8 on H100+/B200 is a good middle ground if you need activation quantization. FP4 on Blackwell (B100/B200) offers even higher throughput for inference. Test on your specific tasks—some are more sensitive to quantization than others.

Usually not. vLLM, TensorRT-LLM, and FlashAttention already include highly optimized kernels. Custom kernels make sense if: (1) you have a novel operation not in existing libraries, (2) you've identified a specific bottleneck through profiling, (3) you have CUDA expertise or are willing to invest significant time. For most teams, using optimized frameworks is the right choice. Alternatively, consider using Sakana AI's "AI CUDA Engineer" to auto-generate optimized kernels.

In priority order: (1) Use quantization (INT8/INT4)—often 2× cost reduction with minimal quality impact. (2) Maximize batch size through continuous batching—can be 5× improvement. (3) Use speculative decoding for latency-sensitive workloads—2-3× speedup. (4) Consider smaller models—a well-tuned 8B model often beats a naive 70B deployment. (5) Evaluate TPUs if at scale and willing to adapt code.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

LLMsML Engineering

vLLM in Production: The Complete Guide to High-Performance LLM Serving

Hands-on guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.

9 min read

LLMsML Engineering

LLM Inference Optimization: From Quantization to Speculative Decoding

Practical guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.

12 min read

LLMsInference

Speculative Decoding: Accelerating LLM Inference Without Sacrificing Quality

Practical guide to speculative decoding techniques that accelerate LLM inference by 2-4× while maintaining exact output quality, covering draft models, EAGLE, Medusa, and production deployment strategies.

9 min read

LLMsML Engineering

Distributed Training: How to Train 70B+ Parameter Models

Detailed walkthrough of distributed training—how to train models that don't fit on a single GPU. Understand data parallelism, tensor parallelism, pipeline parallelism, ZeRO optimization, and the engineering behind training frontier LLMs.

3 min read

LLMsML Engineering

Attention Mechanisms: From Self-Attention to FlashAttention

Detailed walkthrough of attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.

7 min read

LLMsML Engineering

Mixture of Experts: Scaling LLMs Beyond Dense Models

Detailed walkthrough of Mixture of Experts (MoE) architecture—how models like Mixtral and GPT-4 achieve massive capacity without proportional compute costs. Understand routing mechanisms, expert specialization, load balancing, and why MoE represents the future of LLM scaling.

6 min read

Table of Contents

The Hardware Landscape

GPUs: The Default Choice

TPUs: Google's Alternative

The Economics

The Hybrid Strategy

GPU Architecture Deep Dive

Streaming Multiprocessors (SMs)

Tensor Cores

Memory Hierarchy

Memory Bandwidth: The Real Bottleneck

CUDA Kernel Optimization

Kernel Fusion

Persistent Kernels

Memory Access Patterns

Quantization-Aware Kernels

FlashAttention: A Case Study in Optimization

The Problem

The Solution

Performance Impact

Lessons for Optimization

TPU Optimization

The XLA Paradigm

TPU-Friendly Patterns

JAX for TPU

GSPMD and Parallelism

Practical Optimization Strategies

Batching for Throughput

Mixed Precision Training

Quantization for Inference

KV Cache Optimization

Model Parallelism

Speculative Decoding

Inference Frameworks

vLLM

TensorRT-LLM

SGLang

Framework Comparison (70B model, H100)

AI CUDA Engineer: LLM-Generated Kernels

Energy Efficiency and Sustainability

Performance per Watt

Optimization Impact

Future Hardware Trends

Near-term (2025-2026)

Medium-term (2026-2028)

Implications for LLMs

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

vLLM in Production: The Complete Guide to High-Performance LLM Serving

LLM Inference Optimization: From Quantization to Speculative Decoding

Speculative Decoding: Accelerating LLM Inference Without Sacrificing Quality

Distributed Training: How to Train 70B+ Parameter Models

Attention Mechanisms: From Self-Attention to FlashAttention

Mixture of Experts: Scaling LLMs Beyond Dense Models