Coded Caching: From Information Theory to AI-Optimized Edge Networks
Detailed look at coded caching—from Maddah-Ali & Niesen's seminal information-theoretic foundations to modern AI-driven cache optimization. Deep analysis of local and global caching gains, decentralized schemes, and the integration of deep reinforcement learning, federated learning, and graph neural networks for 5G/6G MEC systems.
Table of Contents
Introduction
The growth of mobile data traffic presents one of the most pressing challenges in modern telecommunications. Video streaming alone accounts for over 70% of mobile traffic, with users increasingly expecting instant access to high-definition content anywhere, anytime. Traditional network architectures—designed around the principle of fetching content from distant servers on demand—simply cannot scale to meet this demand economically or sustainably.
Caching offers a fundamentally different paradigm: rather than repeatedly transmitting the same popular content across congested backhaul links, we store it closer to users during off-peak hours and serve requests locally during peak times. This simple idea powers content delivery networks (CDNs), browser caches, and increasingly, mobile edge computing (MEC) systems.
But conventional caching has a fundamental limitation: each cache operates independently. If User A and User B both want different files, and each file is partially cached at both users, conventional caching cannot exploit this. The server must transmit missing portions separately to each user.
Coded caching, pioneered by Mohammad Ali Maddah-Ali and Urs Niesen in their seminal 2014 paper "Fundamental Limits of Caching," revolutionizes this paradigm. Through clever use of coding theory, coded caching achieves something remarkable: a global caching gain that scales with the aggregate cache size across all users, even when users don't cooperate or communicate with each other.
The implications are profound. In a system with 30 users, each caching 10 out of 30 files, conventional uncoded caching requires transmitting 20 file-equivalents during peak hours. Coded caching reduces this to just 1.8 file-equivalents—an 11× improvement. This isn't a small optimization; it's a fundamental change in what's achievable.
The AI revolution takes coded caching from theory to practice. While information theory provides optimal schemes for idealized settings with known, static popularity distributions, real networks face:
- Dynamic popularity: Content trends change rapidly; viral videos emerge unpredictably
- Heterogeneous users: Different users have different preferences and access patterns
- Limited coordination: Practical systems can't implement the full complexity of optimal schemes
- Privacy constraints: Users may not want to share their viewing preferences
Modern AI/ML techniques—deep reinforcement learning for adaptive cache placement, federated learning for privacy-preserving popularity prediction, graph neural networks for modeling user-content relationships—are essential for practical deployment.
This post provides a comprehensive technical exploration of coded caching: the information-theoretic foundations that explain why it works, the mathematical analysis of local and global caching gains, extensions to decentralized and multi-access settings, AI-driven optimization for practical deployment, and integration with 5G/6G mobile edge computing architectures.
Prerequisites: Information theory basics (entropy, mutual information), probability theory, familiarity with optimization concepts.
Key Papers:
- Fundamental Limits of Caching (Maddah-Ali & Niesen, 2014)
- Learning to Code: Coded Caching via Deep RL
- Graph Federated Learning for Proactive Caching (2025)
- Coded Caching Optimization via A2C (2025)
Part I: The Information-Theoretic Foundations
The Caching Problem: A Precise Formulation
Consider a content delivery system with the following components:
- A server with access to a library of files: , each of size bits
- users, each equipped with a cache capable of storing bits (equivalently, complete files)
- A shared broadcast link connecting the server to all users
The system operates in two distinct phases:
Placement Phase (Off-peak hours): The server fills user caches without knowing what files users will request later. This constraint is crucial—if the server knew future requests, optimal caching would be trivial (just cache what will be requested).
Delivery Phase (Peak hours): Each user requests exactly one file . The server must transmit enough information over the shared link for every user to recover their requested file, using both the transmission and their cached content.
The goal: Minimize the rate —the number of file-equivalents transmitted during delivery. Lower rate means less backhaul congestion, lower latency, and better scalability.
┌─────────────────────────────────────────────────────────────────────────┐
│ THE CODED CACHING SYSTEM MODEL │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ │
│ │ SERVER │ │
│ │ │ │
│ │ File Library: │ │
│ │ W₁, W₂, ..., Wₙ │ │
│ │ │ │
│ │ Each file: F bits │ │
│ │ Total: N files │ │
│ └──────────┬──────────┘ │
│ │ │
│ PHASE 1: PLACEMENT │
│ (Server fills caches │
│ without knowing │
│ future requests) │
│ │ │
│ PHASE 2: DELIVERY │
│ (Server broadcasts │
│ to satisfy all │
│ requests) │
│ │ │
│ ┌──────────────┴──────────────┐ │
│ │ SHARED LINK │ │
│ │ Rate = R files │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ USER 1 │ │ USER 2 │ │ USER K │ │
│ │ │ │ │ │ │ │
│ │ Cache: │ │ Cache: │ │ Cache: │ │
│ │ MF bits │ │ MF bits │ │ MF bits │ │
│ │ │ │ │ │ │ │
│ │ Request: │ │ Request: │ │ Request: │ │
│ │ d₁ │ │ d₂ │ │ dₖ │ │
│ └───────────┘ └───────────┘ └───────────┘ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ FUNDAMENTAL PARAMETERS: │
│ │
│ • N: Number of files in the library │
│ • K: Number of users in the system │
│ • M: Normalized cache size (can store M file-equivalents) │
│ • F: Size of each file in bits │
│ • R: Delivery rate (file-equivalents transmitted) │
│ │
│ CONSTRAINTS: │
│ • 0 ≤ M ≤ N (cache size between empty and full library) │
│ • Placement is UNCODED: caches store actual file bits │
│ • Delivery CAN BE CODED: XOR combinations allowed │
│ │
│ THE QUESTION: What is the minimum achievable rate R for given N,K,M? │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Uncoded Caching: The Conventional Approach
Before exploring coded caching, let's understand what conventional (uncoded) caching achieves.
Strategy: Each user independently caches portions of files, typically the most popular ones. When a user requests a file, they receive what's already in their cache locally, and the server transmits only the missing portions.
Best case: All users request files that are fully cached →
Worst case (distinct requests, nothing cached): Server must transmit complete files →
Average case (assuming uniform popularity): Each user has cached fraction of each file. The missing fraction is . With users making independent requests:
This formula reveals the local caching gain: each user benefits from their own cache, reducing their individual demand by factor . But there's no benefit from what other users have cached—the caches operate in isolation.
Example: With users, files, and :
Even with substantial caching (each user stores 1/3 of the library), we still need to transmit 20 file-equivalents during peak hours.
The Maddah-Ali & Niesen Breakthrough
The key insight of Maddah-Ali and Niesen was that caches can be coordinated—not through user cooperation, but through clever placement design—so that a single coded transmission simultaneously benefits multiple users.
The Central Idea: Split each file into carefully designed pieces. Place different pieces at different users such that when the server transmits coded combinations (XORs) of file pieces, each user can decode the piece they need using other pieces they already have cached.
┌─────────────────────────────────────────────────────────────────────────┐
│ THE MADDAH-ALI & NIESEN SCHEME: DETAILED EXAMPLE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SETUP: K = 3 users, N = 3 files (A, B, C), M = 1 (each caches 1 file) │
│ │
│ STEP 1: COMPUTE THE CACHING RATIO │
│ │
│ t = KM/N = (3 × 1)/3 = 1 │
│ │
│ This parameter t determines the subpacketization: each file │
│ is split into C(K,t) = C(3,1) = 3 pieces. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STEP 2: PLACEMENT PHASE │
│ │
│ Split each file into 3 pieces, indexed by which user caches them: │
│ │
│ File A = { A₁, A₂, A₃ } where Aᵢ is cached by user i │
│ File B = { B₁, B₂, B₃ } │
│ File C = { C₁, C₂, C₃ } │
│ │
│ Cache contents after placement: │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ USER 1 │ │ USER 2 │ │ USER 3 │ │
│ │ │ │ │ │ │ │
│ │ Caches: │ │ Caches: │ │ Caches: │ │
│ │ A₁, B₁, C₁ │ │ A₂, B₂, C₂ │ │ A₃, B₃, C₃ │ │
│ │ │ │ │ │ │ │
│ │ (1/3 of │ │ (1/3 of │ │ (1/3 of │ │
│ │ each file) │ │ each file) │ │ each file) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Each user caches 1/3 of A + 1/3 of B + 1/3 of C = 1 file total ✓ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STEP 3: DELIVERY PHASE │
│ │
│ Suppose: User 1 wants A, User 2 wants B, User 3 wants C │
│ (Worst case: all different requests) │
│ │
│ What each user NEEDS (doesn't have cached): │
│ • User 1 needs: A₂, A₃ (has A₁, wants complete A) │
│ • User 2 needs: B₁, B₃ (has B₂, wants complete B) │
│ • User 3 needs: C₁, C₂ (has C₃, wants complete C) │
│ │
│ UNCODED approach: Send A₂, A₃, B₁, B₃, C₁, C₂ = 6 pieces = 2 files │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CODED DELIVERY (the magic): │
│ │
│ Transmission 1: X₁ = A₂ ⊕ B₁ │
│ │
│ • User 1 receives X₁, has B₁ cached │
│ → Computes A₂ = X₁ ⊕ B₁ ✓ │
│ │
│ • User 2 receives X₁, has A₂ cached │
│ → Computes B₁ = X₁ ⊕ A₂ ✓ │
│ │
│ ONE transmission satisfies TWO users! │
│ │
│ Transmission 2: X₂ = A₃ ⊕ C₁ │
│ • User 1 gets A₃ (has C₁), User 3 gets C₁ (has A₃) │
│ │
│ Transmission 3: X₃ = B₃ ⊕ C₂ │
│ • User 2 gets B₃ (has C₂), User 3 gets C₂ (has B₃) │
│ │
│ TOTAL: 3 pieces transmitted = 1 file worth of data! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMPARISON: │
│ │
│ Uncoded rate: R = K(1 - M/N) = 3(1 - 1/3) = 2 files │
│ Coded rate: R = K(1 - M/N)/(1 + KM/N) = 2/(1 + 1) = 1 file │
│ │
│ Improvement factor: 2× (global caching gain = 1 + t = 2) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The General Scheme and Achievable Rate
The example above illustrates the principle; the general scheme works as follows:
Placement: Divide each file into subfiles, where is the caching ratio. Index each subfile by a subset of size :
User caches all subfiles whose index contains :
Delivery: For each subset of size , transmit the coded message:
Each user needs and has all other terms in the XOR cached. They can decode their needed piece by XORing with cached content.
Achievable Rate:
This beautiful formula reveals the structure of the gain.
Local vs. Global Caching Gain: A Deep Dive
The coded caching rate can be decomposed to understand where the gain comes from:
Local Caching Gain (): This is what conventional caching provides. Each user caches fraction of the library, so they need only fraction from the network. This gain is additive across users—it reduces each user's demand independently.
Global Caching Gain (): This is unique to coded caching. It represents the multicast opportunity—each coded transmission benefits users simultaneously. This gain is multiplicative and scales with the aggregate cache size , not individual cache size .
┌─────────────────────────────────────────────────────────────────────────┐
│ UNDERSTANDING LOCAL vs GLOBAL CACHING GAIN │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SYSTEM: K = 30 users, N = 30 files, M = 10 (33% cache ratio) │
│ │
│ LOCAL CACHING GAIN (what conventional caching provides): │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Each user caches 10/30 = 33% of each file. │
│ They need 67% of their requested file from the network. │
│ │
│ Without local caching: Each user needs 1 full file → 30 files total │
│ With local caching: Each user needs 0.67 file → 20 files total │
│ │
│ Local gain = 30/20 = 1.5× │
│ │
│ This gain is INDEPENDENT across users—my cache helps me, not you. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ GLOBAL CACHING GAIN (unique to coded caching): │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Caching ratio t = KM/N = 30 × 10 / 30 = 10 │
│ Global gain = 1 + t = 1 + 10 = 11× │
│ │
│ Each coded transmission benefits t + 1 = 11 users simultaneously! │
│ │
│ This is because: │
│ • When server sends X = A₁ ⊕ B₂ ⊕ C₃ ⊕ ... (11 terms) │
│ • User wanting A has all terms except A₁ cached │
│ • User wanting B has all terms except B₂ cached │
│ • ... and so on for all 11 users in this group │
│ │
│ The server makes ONE transmission; 11 users each extract their need. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMBINED EFFECT: │
│ │
│ Uncoded rate: R = K(1 - M/N) = 30 × 0.67 = 20 files │
│ Coded rate: R = 20 / (1 + 10) = 20/11 ≈ 1.8 files │
│ │
│ Total improvement: 20 / 1.8 ≈ 11× better than uncoded! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ THE REMARKABLE INSIGHT: │
│ │
│ Global gain scales with AGGREGATE cache size KM, not individual M. │
│ │
│ Double the number of users K (same M per user)? │
│ → Global gain INCREASES (more multicast opportunity) │
│ → Even though total demand also increases │
│ → Net effect: Rate grows sublinearly with K │
│ │
│ This is counter-intuitive: More users → More efficient delivery! │
│ │
│ In conventional systems, more users = more load = worse performance. │
│ In coded caching, more users = more coding opportunity = better gain. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Information-Theoretic Optimality
A natural question: Is the Maddah-Ali & Niesen scheme optimal? Can we do better?
The authors proved a converse bound—a lower limit on the rate that any scheme must achieve:
The gap between achievable rate and lower bound is at most a factor of 12 for all parameter values. This means coded caching is order-optimal: we cannot hope to do fundamentally better (by more than a constant factor).
Converse Bound Derivation (Cut-Set Argument)
The converse uses a cut-set bound argument, a fundamental technique in network information theory. Here's the rigorous derivation:
Setup: Consider any subset of users. We'll lower bound the rate by considering what information must cross the "cut" separating the server from these users.
Key Insight: Consider the worst-case demand where each user in requests a distinct file. The server must transmit enough information so that together with the cached content, each user can decode their file.
Step 1: Total Information Requirement
The users collectively need complete files ( bits). They collectively have cached at most files worth of content ( bits). Therefore, the transmission must provide at least:
But this ignores the structure. A tighter bound uses the following:
Step 2: Cut-Set Entropy Bound
Let denote the caches of users in , and denote the transmitted message. For the users to decode their requested files , we need:
By the chain rule and independence of files:
Step 3: Bounding the Mutual Information
The cache entropy is bounded by:
Thus:
Step 4: Accounting for Broadcast Advantage
The above bound doesn't exploit the fact that is a broadcast—all users receive it. A refined argument considers that the transmission must help each user independently:
For any user :
Averaging and using the shared broadcast:
But since is common to all:
A careful analysis using the submodularity of entropy yields:
Step 5: Optimizing Over
Since this bound holds for any subset size , we take the maximum:
Tightness Analysis: Let . Taking the derivative:
Since for , the maximum is achieved at .
The Optimality Gap: Comparing achievable rate to lower bound:
When and , this ratio is exactly 1—the MAN scheme is optimal.
More recent work has tightened this gap for many parameter regimes, showing that the MAN scheme is often exactly optimal or within a factor of 2.
Part II: Extensions and Practical Variants
Decentralized Coded Caching
The centralized MAN scheme requires the server to coordinate placement across all users—each user must cache specific subfiles determined by a global plan. This is impractical for large systems where users join and leave dynamically.
Decentralized coded caching (also by Maddah-Ali and Niesen) removes this requirement: each user independently caches a random fraction of each file. Remarkably, coded delivery still works!
Placement: Each user independently caches each bit of each file with probability .
Delivery: The server identifies which users have which bits cached (through feedback), then constructs coded transmissions that exploit random overlaps.
Achievable Rate (for large file size ):
Remarkably, the decentralized scheme achieves essentially the same rate as the coordinated scheme, with only a vanishing penalty as file size grows.
Why it works: With high probability, the random placement creates sufficient overlap structure. Any users will have complementary cached content that enables coding. The law of large numbers ensures this works well for large files.
Multi-Access Coded Caching
In practical networks, users often have access to multiple caches—for example, a user at the boundary between two cells might connect to both base stations.
Multi-access coded caching exploits this:
┌─────────────────────────────────────────────────────────────────────────┐
│ MULTI-ACCESS CODED CACHING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SERVER │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Cache 1 │ │Cache 2 │ │Cache 3 │ │
│ │ (C₁) │ │ (C₂) │ │ (C₃) │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │
│ │ ┌──────────┼───────────┐ │ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ User 1 │ │ User 2 │ │ User 3 │ │
│ │ │ │ │ │ │ │
│ │ Access: │ │ Access: │ │ Access: │ │
│ │ C₁, C₂ │ │ C₂, C₃ │ │ C₁, C₃ │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Each user can retrieve content from L = 2 caches. │
│ │
│ BENEFITS: │
│ • Effective cache size per user increases │
│ • More coding opportunities from overlapping access │
│ • Natural model for cell-edge users │
│ │
│ The MAN scheme extends to this setting with modified placement │
│ that accounts for shared cache access patterns. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The multi-access setting offers additional gains because users can combine content from multiple sources, increasing effective cache size and creating new coding opportunities.
Device-to-Device (D2D) Coded Caching
In D2D coded caching, users can also receive content from nearby users through direct device-to-device links. This adds a spatial dimension:
- Server broadcasts reach all users (but limited capacity)
- D2D transmissions have limited range but can occur simultaneously in different locations
The combination of coded caching with D2D spatial reuse can provide multiplicative gains: coding gain from cached content coordination, plus spatial reuse gain from simultaneous D2D transmissions in different areas.
Hierarchical Coded Caching
Real networks often have multiple cache layers:
- Core caches: Large, centrally located (CDN servers)
- Edge caches: Medium-sized, at base stations (MEC)
- User caches: Small, on end devices
Hierarchical coded caching applies the coded caching principle at each layer, with careful coordination to avoid redundant caching. Content that's already coded for delivery at layer need not be re-cached at layer .
The analysis becomes complex, but the fundamental insight remains: coordinated placement plus coded delivery provides multiplicative gains at each layer, compounding through the hierarchy.
Part III: Challenges in Practical Deployment
The Subpacketization Problem
The MAN scheme requires splitting each file into subfiles. For realistic parameters, this number explodes:
- users, (cache 10% of library)
- Subfiles per file:
Each subfile needs a unique index, and the delivery phase must identify and combine specific subfiles. This is clearly impractical.
Solutions:
- Placement Delivery Arrays (PDAs): Combinatorial designs that achieve coded caching gain with polynomial subpacketization
- User grouping: Divide users into smaller groups, apply coded caching within groups
- Approximate schemes: Accept some loss in coding gain for practical subpacketization
Placement Delivery Arrays: A Rigorous Construction
A Placement Delivery Array (PDA) is a combinatorial structure that specifies both cache placement and coded delivery with controlled subpacketization.
Definition: A -PDA is a array where:
- = number of users
- = number of subfiles per file (subpacketization)
- = number of entries in each row that are "*" (determines cache size)
- = number of distinct integers used (determines delivery transmissions)
The array entries are either "*" (indicating cached content) or integers from .
Properties:
- Each row contains exactly stars → cache size
- Each integer appears at most once per column
- For each integer , the set of rows where appears (excluding column with ) forms a valid coding group
Example: A -PDA for Users
Interpretation:
- Rows = Users (1, 2, 3, 4)
- Columns = Subfiles ( for each file )
- "*" = User caches this subfile
- Integer = This subfile is delivered in transmission
Cache contents (from stars):
- User 1 caches: for all → 2/4 = 50% of each file
- User 2 caches: for all
- User 3 caches: for all
- User 4 caches: for all
Delivery (for demand ):
-
Transmission 1 (integer 1 appears in positions and ): User 1 wants , has cached (from column 1, row 3) User 3 wants , has cached
-
Similar for transmissions 2, 3, 4.
Rate achieved: file
Comparison to MAN scheme: For , :
- MAN scheme: subfiles,
- PDA scheme: subfiles,
The PDA trades slightly higher rate (1 vs 2/3) for much lower subpacketization (4 vs 6).
General Constructions: Several PDA families achieve polynomial subpacketization:
- Maddah-Ali Niesen PDA: , ,
- Resolvable design PDA: , with rate penalty
- Ruzsa-Szemerédi graph PDA: for any
The Fundamental Trade-off:
Lower subpacketization generally requires higher rate , but the product is bounded below.
Recent work has reduced subpacketization from exponential to polynomial in while retaining most of the coding gain.
Non-Uniform Popularity
The MAN scheme assumes all files are equally likely to be requested. Real content follows highly skewed popularity distributions (Zipf's law):
where for typical video content (empirically validated across Netflix, YouTube, and CDN traces).
Implication: Most requests concentrate on a few popular files. Should we:
- Cache popular files completely (maximize local hits)?
- Cache diverse content (maximize coding opportunities)?
Optimal Placement Under Zipf: A Convex Optimization
For non-uniform popularity, the optimal cache allocation can be formulated as a convex optimization problem.
Decision Variables: Let denote the fraction of file cached at each user.
Constraint: Total cache size is files:
Objective: Minimize expected delivery rate. For uncoded caching:
For coded caching with non-uniform popularity, the rate expression becomes more complex. A tractable approximation uses the grouped coded caching framework:
Grouped Coded Caching Formulation:
Partition files into groups based on popularity. Within each group, apply coded caching with uniform treatment.
where is the average cache fraction for files in group .
Closed-Form Solution for Special Cases:
Case 1: Pure Local Caching (no coding gain exploited)
The optimal allocation follows a water-filling solution:
where is the Lagrange multiplier set so that , and .
Case 2: Two-Group Approximation
Partition files into "popular" () and "unpopular" ():
- Popular files: Cache completely ()
- Unpopular files: Apply coded caching with for all
The optimization becomes:
subject to
The Critical Trade-off:
Define the coding opportunity cost:
Files with high benefit more from being fully cached (no coding), while files with moderate benefit from participating in coded delivery.
Numerical Example: For , , , :
- Optimal strategy caches top ~5 files completely
- Applies coded caching to files 6-40
- Ignores files 41-100 (too unpopular)
- Rate reduction vs. uniform MAN: ~25%
AI/ML contribution: Predicting popularity and optimally allocating cache space across files is precisely where machine learning excels.
Unknown and Time-Varying Popularity
Real popularity isn't just non-uniform—it's also unknown and changes over time. Viral content emerges suddenly; trending topics shift daily.
Online learning approaches:
- Bandit algorithms: Explore-exploit tradeoff for cache updates
- Prediction models: LSTM, transformers to forecast future popularity from past patterns
- Reinforcement learning: Learn cache replacement policies that adapt to popularity dynamics
This is where the information-theoretic foundations meet practical machine learning.
Part IV: AI/ML for Coded Caching
Why AI is Essential
Information theory tells us what is achievable—the minimum rate under idealized assumptions. But practical deployment requires:
- Predicting content popularity without complete information
- Adapting to time-varying demands in real-time
- Handling heterogeneous user preferences across locations
- Respecting privacy constraints in data collection
- Making decisions under computational constraints
AI/ML provides the tools to bridge theory and practice.
Deep Reinforcement Learning for Cache Placement
Cache placement can be formulated as a Markov Decision Process (MDP):
State: Current cache contents, recent request history, time of day, user context
Action: Which content to cache, which to evict, how to split cache across files
Reward: Negative delivery rate (or equivalently, cache hit rate, latency reduction, backhaul savings)
The challenge: The action space is enormous (all possible cache configurations), and the reward depends on future requests that are stochastic.
┌─────────────────────────────────────────────────────────────────────────┐
│ DEEP RL FOR CODED CACHING OPTIMIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE MDP FORMULATION: │
│ │
│ STATE SPACE S: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ s = ( C, H, P̂, τ ) │ │
│ │ │ │
│ │ C: Current cache configuration │ │
│ │ • Which files are cached │ │
│ │ • What fraction of each file │ │
│ │ • Which subfiles (for coded caching) │ │
│ │ │ │
│ │ H: Request history │ │
│ │ • Recent requests from each user │ │
│ │ • Temporal patterns (time of day, day of week) │ │
│ │ • User context (location, device type) │ │
│ │ │ │
│ │ P̂: Estimated popularity distribution │ │
│ │ • Current best estimate of file popularities │ │
│ │ • Uncertainty/confidence in estimates │ │
│ │ │ │
│ │ τ: Time features │ │
│ │ • Hour of day, day of week (cyclic encoding) │ │
│ │ • Special events (sports, news, holidays) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ACTION SPACE A: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ For conventional caching: │ │
│ │ • a = (evict_file, cache_file) discrete choices │ │
│ │ │ │
│ │ For coded caching: │ │
│ │ • a = (placement_strategy, subfile_allocation) │ │
│ │ • More complex: must maintain coding structure │ │
│ │ │ │
│ │ For practical systems: │ │
│ │ • a = (cache_probabilities) continuous vector │ │
│ │ • Probabilistic placement: cache file n with probability aₙ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ REWARD FUNCTION: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ r(s, a) = -R(s, a) (negative delivery rate) │ │
│ │ │ │
│ │ Or multi-objective: │ │
│ │ r = α·(cache_hit_rate) + β·(latency_reduction) + │ │
│ │ γ·(backhaul_savings) - λ·(update_cost) │ │
│ │ │ │
│ │ The reward captures both immediate benefit (hits on current │ │
│ │ requests) and long-term value (positioned for future demand). │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ALGORITHM: Advantage Actor-Critic (A2C) or Proximal Policy │
│ Optimization (PPO) work well for this continuous control problem. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RESULTS FROM RECENT RESEARCH (2025): │
│ │
│ • A2C-based coded caching achieves 15-25% rate reduction │
│ over heuristic policies (LRU, LFU) │
│ │
│ • Converges in ~5000 episodes for moderate-sized systems │
│ │
│ • Adapts automatically to popularity shifts within ~100 requests │
│ │
│ • Outperforms optimal static policy under time-varying popularity │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Bellman Equation for Optimal Caching
The optimal caching policy satisfies the Bellman optimality equation. For infinite-horizon discounted reward with discount factor :
Value Function: The optimal value function represents the maximum expected cumulative reward starting from state :
Q-Function: The action-value function represents the value of taking action in state :
For Caching Specifically:
Let be the state (cache configuration , history ), and be the caching action.
The reward for caching configuration when request arrives:
For coded caching, the reward becomes:
where is the delivery rate given cache configuration and demand vector .
State Transition: Under demand distribution :
where is the deterministic state transition function.
Function Approximation: For large state spaces, we approximate using neural networks. The A2C algorithm updates:
Critic (value function):
Actor (policy):
where the advantage .
Competitive Ratio and Regret Analysis
For online caching algorithms, we compare against the optimal offline algorithm (which knows all future requests) using competitive ratio and regret.
Competitive Ratio: An online algorithm has competitive ratio if:
for all request sequences, where is an additive constant.
Classical Results:
- LRU (Least Recently Used): Competitive ratio = for cache size
- LFU (Least Frequently Used): Competitive ratio =
- Marker algorithm: Competitive ratio =
- Lower bound: No deterministic algorithm achieves ratio better than
Regret for Learning Algorithms: For RL-based caching, we measure regret—the difference between cumulative reward of the learning agent and the optimal policy:
Sublinear Regret: A good learning algorithm achieves , meaning the average per-step regret vanishes as .
Results for RL Caching:
For contextual bandit formulation (stateless):
where is the number of files.
For full MDP formulation with finite state space and action space :
Practical Implications:
- After requests, regret per request is ~
- RL policies become near-optimal within ~1000 requests for moderate-sized systems
- The learning phase cost is amortized over the deployment lifetime
Comparison: RL vs. Heuristics:
| Algorithm | Competitive Ratio | Adaptivity | Complexity |
|---|---|---|---|
| LRU | Low | ||
| LFU | Medium | ||
| Static Optimal | 1 (known dist.) | None | |
| RL (after training) | ~1.1-1.3 | High | (NN) |
Deep Q-Networks (DQN) for Discrete Caching Decisions
For caching problems with discrete action spaces (cache/evict specific files), Deep Q-Networks (DQN) provide a powerful framework. DQN combines Q-learning with deep neural networks to handle large state spaces.
The DQN Algorithm
Core Idea: Approximate the optimal Q-function with a neural network .
Loss Function: Minimize the temporal difference (TD) error:
where is the replay buffer storing past experiences, and are the target network parameters (updated slowly).
Key DQN Innovations for Stability:
-
Experience Replay: Store transitions in buffer of size . Sample mini-batches uniformly to break correlation between consecutive samples.
-
Target Network: Maintain a separate target network updated every steps: Or with soft updates (Polyak averaging):
-
-Greedy Exploration: With probability , select random action; otherwise select . Anneal from 1.0 to 0.01 over training.
DQN Variants for Caching
Double DQN: Reduces overestimation bias by decoupling action selection and evaluation:
instead of .
Dueling DQN: Separates state value from action advantage :
This architecture learns that some states are inherently good/bad regardless of action—useful for caching where cache state strongly determines performance.
Prioritized Experience Replay: Sample transitions with probability proportional to TD error:
where is the TD error and controls prioritization strength.
DQN Architecture for Caching
┌─────────────────────────────────────────────────────────────────────────┐
│ DQN ARCHITECTURE FOR CACHING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT STATE REPRESENTATION: │
│ │
│ s = [ cache_vector | popularity_estimate | temporal_features ] │
│ └─────────────┘ └──────────────────┘ └─────────────────┘ │
│ N-dim 0/1 N-dim float d_t-dim │
│ │
│ NETWORK ARCHITECTURE: │
│ │
│ State s ∈ ℝ^{N + N + d_t} │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Input Layer │ │
│ │ Dimension: 2N + d_t (e.g., 2×1000 + 24 = 2024) │
│ └────────────────┬────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Hidden Layer 1 │ │
│ │ • Linear(2024, 512) │ │
│ │ • ReLU activation │ │
│ │ • LayerNorm (optional) │ │
│ └────────────────┬────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Hidden Layer 2 │ │
│ │ • Linear(512, 256) │ │
│ │ • ReLU activation │ │
│ └────────────────┬────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Hidden Layer 3 │ │
│ │ • Linear(256, 128) │ │
│ │ • ReLU activation │ │
│ └────────────────┬────────────────────┘ │
│ │ │
│ ┌───────┴───────┐ (Dueling architecture) │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Value Head │ │ Adv. Head │ │
│ │ Linear(128,│ │ Linear(128,│ │
│ │ 1) │ │ |A|) │ │
│ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ Q(s,a) = V(s) + (A(s,a) - mean(A)) │
│ │
│ OUTPUT: Q-values for each action │
│ Dimension: |A| = N (for single-file decisions) or 2N (cache/evict) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
DQN Hyperparameters for Caching (Recommended Values)
| Hyperparameter | Symbol | Value | Notes |
|---|---|---|---|
| Learning rate | Adam optimizer | ||
| Discount factor | 0.99 | Long-term reward focus | |
| Replay buffer size | $ | \mathcal{D} | $ |
| Batch size | 64-128 | ||
| Target update frequency | 1000 steps | Hard update | |
| Soft update rate | 0.001 | If using Polyak | |
| Initial exploration | 1.0 | ||
| Final exploration | 0.01 | ||
| Exploration decay | - | Linear over steps | |
| Hidden dimensions | - | [512, 256, 128] | |
| Prioritization exponent | 0.6 | For PER | |
| Importance sampling | 0.4 → 1.0 | Annealed |
Proximal Policy Optimization (PPO) for Continuous Cache Allocation
For continuous action spaces (allocating fractional cache to each file), Proximal Policy Optimization (PPO) is the state-of-the-art algorithm, combining sample efficiency with stable training.
The PPO Algorithm
PPO is an actor-critic method that constrains policy updates to prevent destructive large changes.
Policy (Actor): outputs a probability distribution over actions. For continuous actions:
where and are neural networks.
Value Function (Critic): estimates expected return from state .
PPO-Clip Objective: The key innovation is clipping the probability ratio to prevent large updates:
where:
- is the probability ratio
- is the estimated advantage
- is the clip range
Generalized Advantage Estimation (GAE): Compute advantages using a weighted combination of n-step returns:
where is the TD residual.
Typically balances bias (low ) and variance (high ).
Full PPO Loss:
where:
- is the value function loss
- is entropy bonus for exploration
- , are weighting coefficients
PPO Architecture for Cache Allocation
┌─────────────────────────────────────────────────────────────────────────┐
│ PPO ACTOR-CRITIC FOR CACHING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SHARED FEATURE EXTRACTOR: │
│ │
│ State s ──► [ MLP: 2N+d_t → 256 → 256 ] ──► Features f(s) ∈ ℝ^{256} │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ACTOR (Policy Network): │
│ │
│ f(s) ──► [ Linear(256, 128) → ReLU → Linear(128, 2N) ] │
│ │ │
│ ┌─────────┴─────────┐ │
│ │ │ │
│ ▼ ▼ │
│ μ(s) ∈ ℝ^N log σ(s) ∈ ℝ^N │
│ (mean) (log std dev) │
│ │ │ │
│ └─────────┬─────────┘ │
│ │ │
│ ▼ │
│ a ~ N(μ(s), exp(log σ(s))²) │
│ Cache allocation: a_n = sigmoid(a_n) × M/N │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CRITIC (Value Network): │
│ │
│ f(s) ──► [ Linear(256, 128) → ReLU → Linear(128, 1) ] ──► V(s) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TRAINING LOOP (per epoch): │
│ │
│ 1. Collect T timesteps using current policy π_θ │
│ 2. Compute advantages Â_t using GAE │
│ 3. For K epochs (typically K=10): │
│ a. Sample mini-batches from collected data │
│ b. Compute PPO-Clip loss │
│ c. Update θ, φ using Adam │
│ 4. Repeat │
│ │
└─────────────────────────────────────────────────────────────────────────┘
PPO Hyperparameters for Caching (Recommended Values)
| Hyperparameter | Symbol | Value | Notes |
|---|---|---|---|
| Learning rate | Adam, can decay | ||
| Clip range | 0.2 | PPO-Clip parameter | |
| GAE lambda | 0.95 | Advantage estimation | |
| Discount factor | 0.99 | ||
| Value loss coefficient | 0.5 | ||
| Entropy coefficient | 0.01 | Encourage exploration | |
| Epochs per update | 10 | Reuse collected data | |
| Mini-batch size | 64 | ||
| Rollout length | 2048 | Steps before update | |
| Number of parallel envs | - | 8-16 | For variance reduction |
| Hidden dimensions | - | [256, 256, 128] | Shared + heads |
| Max gradient norm | - | 0.5 | Gradient clipping |
A2C vs. PPO: When to Use Which
| Criterion | A2C | PPO |
|---|---|---|
| Sample efficiency | Lower | Higher (reuses data) |
| Training stability | Moderate | High (clipping) |
| Wall-clock time | Faster per update | Slower per update |
| Hyperparameter sensitivity | Higher | Lower |
| Best for | Fast prototyping | Production deployment |
| Parallelization | Synchronous workers | Parallel envs |
Recommendation: Use PPO for production coded caching systems due to its stability. Use A2C for rapid experimentation.
Federated Learning for Privacy-Preserving Popularity Prediction
Predicting content popularity requires observing user behavior—but users may not want their viewing habits shared with a central server.
Federated learning solves this:
- Local training: Each edge server trains a popularity prediction model on local user data
- Gradient aggregation: Only model gradients (not raw data) are sent to the central server
- Global model update: Central server averages gradients to update a global model
- Model distribution: Updated global model is sent back to edges
┌─────────────────────────────────────────────────────────────────────────┐
│ FEDERATED LEARNING FOR CACHING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ │
│ │ CENTRAL │ │
│ │ SERVER │ │
│ │ │ │
│ │ Global Model │ │
│ │ θ_global │ │
│ └───────┬───────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ │ │ │ │
│ Distribute θ Aggregate Distribute θ │
│ │ Gradients │ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌────────────┐ ┌──────┴──────┐ ┌────────────┐ │
│ │ EDGE 1 │ │ │ │ EDGE K │ │
│ │ │ │ Gradients │ │ │ │
│ │ Local Data │────│ ∇θ₁,...,∇θₖ│────│ Local Data │ │
│ │ (private) │ │ │ │ (private) │ │
│ │ │ │ │ │ │ │
│ │ Train on │ │ Aggregate: │ │ Train on │ │
│ │ local │ │ θ ← θ - │ │ local │ │
│ │ requests │ │ η·Σ∇θᵢ/K │ │ requests │ │
│ └────────────┘ └─────────────┘ └────────────┘ │
│ │
│ PRIVACY GUARANTEES: │
│ │
│ • Raw user data NEVER leaves the edge │
│ • Central server sees only aggregated gradients │
│ • With differential privacy noise, even gradients don't leak │
│ individual user preferences │
│ │
│ BENEFITS FOR CODED CACHING: │
│ │
│ • Learn global popularity patterns (what's trending everywhere) │
│ • Preserve local personalization (edge models fine-tune) │
│ • Scale to millions of users without central data collection │
│ • Comply with privacy regulations (GDPR, CCPA) │
│ │
│ RECENT WORK (Graph Federated Learning, arXiv Feb 2025): │
│ │
│ • Integrate Graph Neural Networks with Federated Learning │
│ • LightGCN captures user-content relationships locally │
│ • Global aggregation learns cross-user patterns │
│ • 20-30% improvement in cache hit rate over non-federated baseline │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Graph Neural Networks for User-Content Relationships
Content requests aren't independent—they form a graph structure:
- User-content edges: User requested content
- Content-content edges: Contents and are often requested together
- User-user edges: Users and have similar preferences
Graph Neural Networks (GNNs) can learn representations that capture this structure, enabling powerful popularity prediction for cache placement.
The Bipartite User-Content Graph
Graph Definition: Let where:
- are user nodes
- are content nodes
- are edges
Adjacency Matrix: The bipartite adjacency matrix :
where is the user-content interaction matrix.
GNN Message Passing Framework
The general GNN update rule for node at layer :
Different GNN architectures define different AGGREGATE and UPDATE functions.
Graph Convolutional Networks (GCN) for Caching
GCN Layer: Propagates information using normalized adjacency:
where:
- (add self-loops)
- (degree matrix)
- is learnable
Per-Node Update:
The normalization prevents nodes with high degree from dominating.
LightGCN: State-of-the-Art for Recommendation/Caching
LightGCN simplifies GCN by removing nonlinearities and feature transformations, which proves more effective for collaborative filtering:
Layer Propagation:
Final Embedding: Average over all layers:
Prediction: Request probability via dot product:
Loss Function: Bayesian Personalized Ranking (BPR):
where are triplets with a positive (requested) and a negative (not requested) content.
Graph Attention Networks (GAT) for Caching
GAT learns attention weights to determine neighbor importance:
Attention Coefficient:
Normalized Attention:
Aggregation:
Multi-Head Attention: Use parallel attention heads and concatenate:
Why GAT for Caching: Attention learns that some user-content interactions are more informative than others. A user who watched 1000 videos provides less signal per video than one who watched 10.
┌─────────────────────────────────────────────────────────────────────────┐
│ GNN ARCHITECTURE FOR CACHE PREDICTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT: Bipartite graph G = (Users ∪ Content, Edges) │
│ │
│ User initial embeddings: E_U ∈ ℝ^{K × d}, d = 64 │
│ Content initial embeddings: E_C ∈ ℝ^{N × d} │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ LAYER 0 (Initial) │ │
│ │ │ │
│ │ e_u^(0) = E_U[u] (learnable user embedding) │ │
│ │ e_c^(0) = E_C[c] (learnable content embedding) │ │
│ │ │ │
│ │ Optional: Concatenate content features (genre, duration, etc.) │ │
│ │ e_c^(0) = [E_C[c] || f_c] where f_c ∈ ℝ^{d_feat} │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ LAYER 1 (First-hop neighbors) │ │
│ │ │ │
│ │ For each user u: │ │
│ │ e_u^(1) = AGG({ e_c^(0) : c ∈ N(u) }) │ │
│ │ │ │
│ │ For each content c: │ │
│ │ e_c^(1) = AGG({ e_u^(0) : u ∈ N(c) }) │ │
│ │ │ │
│ │ AGG = LightGCN (normalized sum) or GAT (attention-weighted) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ LAYER 2 (Second-hop = user-user, content-content similarity) │ │
│ │ │ │
│ │ User embedding now contains info about: │ │
│ │ • Directly watched content (Layer 1) │ │
│ │ • Other users who watched same content (Layer 2) │ │
│ │ │ │
│ │ Content embedding now contains info about: │ │
│ │ • Users who watched it (Layer 1) │ │
│ │ • Other content those users watched (Layer 2) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ FINAL EMBEDDINGS (LightGCN: average across layers) │ │
│ │ │ │
│ │ e_u = (e_u^(0) + e_u^(1) + e_u^(2)) / 3 │ │
│ │ e_c = (e_c^(0) + e_c^(1) + e_c^(2)) / 3 │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PREDICTION & CACHE PLACEMENT │ │
│ │ │ │
│ │ Popularity prediction for content c: │ │
│ │ p_c = Σ_u σ(e_u^T e_c) / K (average predicted probability) │ │
│ │ │ │
│ │ Cache placement: Sort by p_c, cache top M files │ │
│ │ │ │
│ │ For coded caching: Also consider embedding diversity │ │
│ │ Maximize: Σ p_c × cache_c + λ × diversity(cached_embeddings)│ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ GNN HYPERPARAMETERS: │
│ ───────────────────────────────────────────────────────────────────── │
│ Embedding dimension d: 64 │
│ Number of layers L: 2-3 │
│ Learning rate: 1e-3 │
│ Batch size: 2048 (user-content pairs) │
│ Negative samples per positive: 4 │
│ Regularization λ: 1e-4 │
│ Optimizer: Adam │
│ Training epochs: 100-500 │
│ │
└─────────────────────────────────────────────────────────────────────────┘
For coded caching, GNN-based popularity prediction informs placement decisions. Content with similar embeddings should be cached together (they'll be requested by similar users). Content with complementary embeddings should be spread across caches (for coding opportunities).
Attention Mechanisms for Content Popularity Prediction
Beyond GNNs, attention mechanisms can directly model user-content interactions without explicit graph structure.
Self-Attention for Request Sequences
Model each user's request history as a sequence and use self-attention to capture patterns:
Input: User 's request history
Embedding:
Self-Attention:
where , , .
Multi-Head Attention:
where
Position Encoding: Add positional information to distinguish recent vs. older requests:
For temporal patterns in caching, we can also add time-of-day encoding:
where is hour of day and is day of week.
Cross-Attention for User-Content Matching
Use cross-attention to match user preferences with content features:
User Query: = user embedding from GNN or MLP Content Keys/Values: = content embeddings for all content
This gives a user representation that's a weighted combination of content embeddings, weighted by relevance.
Transformer-Based Popularity Prediction
Full Transformer architectures can model complex temporal patterns in content requests.
Transformer Encoder for Request Prediction
┌─────────────────────────────────────────────────────────────────────────┐
│ TRANSFORMER FOR POPULARITY PREDICTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT: Recent request history across all users │
│ X = [request_1, request_2, ..., request_T] │
│ Each request = (user_id, content_id, timestamp) │
│ │
│ EMBEDDING LAYER: │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ e_t = Embed_content(c_t) + Embed_user(u_t) + PE(t) + TE(timestamp) │
│ │
│ where PE = positional encoding, TE = temporal encoding │
│ │
│ X_embed ∈ ℝ^{T × d}, d = 256 │
│ │
│ TRANSFORMER ENCODER STACK (N=4 layers): │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ For each layer n: │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Multi-Head Self-Attention (8 heads) │ │
│ │ │ │
│ │ Q, K, V = X @ W_Q, X @ W_K, X @ W_V │ │
│ │ Attn = softmax(QK^T / √d_k) V │ │
│ │ X = LayerNorm(X + Dropout(Attn)) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Feed-Forward Network │ │
│ │ │ │
│ │ FFN(x) = ReLU(x @ W_1 + b_1) @ W_2 + b_2 │ │
│ │ W_1 ∈ ℝ^{d × 4d}, W_2 ∈ ℝ^{4d × d} │ │
│ │ X = LayerNorm(X + Dropout(FFN(X))) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ OUTPUT HEAD (Popularity Prediction): │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ h_final = X[-1] (last position embedding) │
│ │
│ For each content c: │
│ p_c = σ(h_final @ W_out @ e_c) (predicted request probability) │
│ │
│ Or: p = softmax(h_final @ E_content^T) (distribution over content) │
│ │
│ TRAINING: │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Next-request prediction: Given history, predict next requested content│
│ │
│ Loss = CrossEntropy(p, c_{T+1}) │
│ = -log(p_{c_{T+1}}) │
│ │
│ For multi-step prediction, use autoregressive generation or │
│ predict distribution over next K requests simultaneously. │
│ │
│ HYPERPARAMETERS: │
│ ───────────────────────────────────────────────────────────────────── │
│ Model dimension d: 256 │
│ Number of layers N: 4 │
│ Number of attention heads: 8 │
│ Feed-forward dimension: 1024 (4×d) │
│ Dropout: 0.1 │
│ Sequence length T: 100-500 │
│ Learning rate: 1e-4 (with warmup) │
│ Batch size: 64 sequences │
│ Warmup steps: 4000 │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Pre-Training Strategies for Caching Transformers
Masked Content Prediction: Like BERT, mask random requests and predict them:
Contrastive Learning: Learn representations where similar request patterns are close:
Multi-Task Pre-Training:
- Predict next content
- Predict time until next request
- Predict user of next request
- Classify content category
Meta-Reinforcement Learning for Fast Adaptation
Content popularity can shift suddenly—a viral video emerges, breaking news changes demand patterns. Standard RL agents take thousands of interactions to adapt.
Meta-reinforcement learning addresses this by training agents that can learn to adapt quickly:
- Meta-training: Train on many different popularity distributions
- Meta-objective: Not just high reward, but fast adaptation to new distributions
- Deployment: When popularity shifts, the agent adapts in ~10-100 interactions instead of thousands
MAML: Model-Agnostic Meta-Learning for Caching
MAML learns an initialization such that a few gradient steps on a new task yield good performance.
Algorithm:
-
Sample task batch: where each is a different popularity distribution
-
Inner loop (task-specific adaptation): For each task :
This is one (or few) gradient steps on task-specific data.
-
Outer loop (meta-update): Update to improve post-adaptation performance:
Note: This requires computing gradients through the inner gradient step (second-order).
First-Order Approximation (FOMAML): Ignore second-order terms:
Much cheaper computationally, works nearly as well in practice.
MAML for Caching:
where represents requests sampled from popularity distribution .
┌─────────────────────────────────────────────────────────────────────────┐
│ MAML FOR ADAPTIVE CACHING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ META-TRAINING PHASE: │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Task Distribution: Different Zipf parameters α ∈ [0.5, 1.5] │
│ Different content libraries │
│ Different diurnal patterns │
│ │
│ For each meta-iteration: │
│ │
│ 1. Sample B=4 tasks (popularity distributions) │
│ │
│ 2. For each task i: │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Inner Loop (K=5 gradient steps): │ │
│ │ │ │
│ │ θ_i^(0) = θ │ │
│ │ for k = 1 to K: │ │
│ │ Sample batch of requests from task i │ │
│ │ θ_i^(k) = θ_i^(k-1) - α ∇ L_i(θ_i^(k-1)) │ │
│ │ │ │
│ │ θ_i' = θ_i^(K) (adapted parameters) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ 3. Meta-update: │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Evaluate L_i(θ_i') on NEW data from task i │ │
│ │ │ │
│ │ θ ← θ - β ∇_θ Σ_i L_i(θ_i') │ │
│ │ │ │
│ │ (This gradient flows through the inner loop!) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ DEPLOYMENT PHASE: │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ When popularity shifts detected: │
│ │
│ 1. Start with meta-learned θ* │
│ 2. Collect ~50-100 requests from new distribution │
│ 3. Take K=5 gradient steps: θ_new = θ* - α ∇ L_new(θ*) │
│ 4. Deploy θ_new │
│ │
│ Result: Adaptation in 50-100 requests instead of 5000+ │
│ │
│ HYPERPARAMETERS: │
│ ───────────────────────────────────────────────────────────────────── │
│ Inner learning rate α: 0.01 │
│ Outer learning rate β: 0.001 │
│ Inner gradient steps K: 5 │
│ Tasks per batch B: 4-8 │
│ Meta-training iterations: 10,000-50,000 │
│ Policy architecture: Same as PPO (shared backbone) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Reptile: A Simpler Alternative
Reptile is a first-order meta-learning algorithm that's simpler than MAML:
Algorithm:
- Sample task
- Initialize
- Run gradient steps on :
- Meta-update:
Intuition: Move the initialization toward the final parameters after training on each task. The optimal is equidistant from all task-specific optima.
Reptile vs. MAML:
| Aspect | MAML | Reptile |
|---|---|---|
| Gradients | Second-order (or FOMAML) | First-order only |
| Memory | High (store computation graph) | Low |
| Implementation | Complex | Simple |
| Performance | Slightly better | Nearly as good |
| Computation | ~2× slower | Faster |
RL² (Learning to Reinforcement Learn)
RL² uses a recurrent neural network that learns to adapt within an episode:
Architecture: LSTM-based policy that receives as input
Training: Train on many tasks where each episode is a new task. The LSTM learns to:
- Explore efficiently at the start of episodes
- Exploit learned knowledge later
- Adapt its behavior based on observed rewards
For Caching: The LSTM receives:
- Current cache state
- Previous caching action
- Previous reward (hit rate, latency)
Over time, it learns to recognize the current popularity pattern and adapt its caching strategy accordingly—all within a single forward pass, no gradient updates needed at deployment.
RL² Update Rule: Standard policy gradient on the recurrent policy:
where is the LSTM hidden state encoding the history.
Recent work (2025) on "Fast Adaptive Caching Algorithm for Mobile Edge Networks Based on Meta-reinforcement Learning" shows:
- 5× faster adaptation to popularity shifts
- Only 2-3% performance loss during adaptation periods
- Robust across different types of popularity changes (gradual drift, sudden shift, periodic patterns)
When to Use Which Meta-RL Algorithm
| Scenario | Recommended Algorithm | Reason |
|---|---|---|
| Discrete distribution shifts | MAML/FOMAML | Best for distinct task boundaries |
| Continuous drift | RL² | No explicit adaptation step needed |
| Limited compute budget | Reptile | Simpler, faster training |
| Online adaptation | RL² | No gradient steps at deployment |
| Periodic patterns (daily/weekly) | MAML with time-aware tasks | Can pre-learn periodic structure |
Part V: Integration with 5G/6G Mobile Edge Computing
MEC Architecture and Caching
Mobile Edge Computing (MEC) places computational and storage resources at the network edge—at or near cellular base stations. This is the natural deployment point for coded caching:
┌─────────────────────────────────────────────────────────────────────────┐
│ CODED CACHING IN 5G/6G MEC ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ │
│ │ CORE NETWORK │ │
│ │ (Content CDN) │ │
│ └──────────┬──────────┘ │
│ │ │
│ Backhaul Link │
│ (Expensive, congested) │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ MEC 1 │ │ MEC 2 │ │ MEC 3 │ │
│ │ │ │ │ │ │ │
│ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │ CACHE │ │ │ │ CACHE │ │ │ │ CACHE │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ │ Coded │ │ │ │ Coded │ │ │ │ Coded │ │ │
│ │ │Content │ │ │ │Content │ │ │ │Content │ │ │
│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │
│ │ │ │ │ │ │ │
│ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │ AI │ │ │ │ AI │ │ │ │ AI │ │ │
│ │ │ ENGINE │ │ │ │ ENGINE │ │ │ │ ENGINE │ │ │
│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │ │
│ Fronthaul Fronthaul Fronthaul │
│ │ │ │ │
│ ┌─────┴──────┐ ┌─────┴──────┐ ┌─────┴──────┐ │
│ │ gNB 1 │ │ gNB 2 │ │ gNB 3 │ │
│ │ │ │ │ │ │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │ │
│ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
│ │ UEs │ │ UEs │ │ UEs │ │
│ └─────┘ └─────┘ └─────┘ │
│ │
│ AI ENGINE FUNCTIONS: │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 1. POPULARITY PREDICTION │
│ • Federated learning across edges │
│ • GNN for user-content modeling │
│ • LSTM/Transformer for temporal patterns │
│ │
│ 2. CACHE PLACEMENT OPTIMIZATION │
│ • Deep RL for dynamic placement │
│ • Coordinated across MEC nodes for coding gain │
│ • Trade-off: local hits vs. coding opportunity │
│ │
│ 3. CODED DELIVERY SCHEDULING │
│ • Identify multicast groups │
│ • Construct optimal coded messages │
│ • Schedule transmissions across time/frequency │
│ │
│ 4. USER MOBILITY PREDICTION │
│ • Predict user movement across cells │
│ • Proactive content migration │
│ • Handover-aware cache coordination │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ BENEFITS OF CODED CACHING IN MEC: │
│ │
│ • Reduced backhaul: Coded multicasting reduces core traffic │
│ • Lower latency: Edge-served content, no core round-trip │
│ • Improved QoE: Proactive caching before user requests │
│ • Network efficiency: Same backhaul serves more users │
│ • Scalability: Global caching gain grows with user density │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Semantic Communication and Coded Caching in 6G
6G networks will introduce semantic communication—transmitting meaning rather than bits. This fundamentally changes caching:
Traditional caching: Store exact file bits. Transmit missing bits.
Semantic caching: Store semantic representations (compressed meaning). Transmit semantic differences. Receiver reconstructs content using local knowledge base.
Example: Instead of caching a full 4K video file:
- Cache a semantic description + key frames + neural decoder model
- When user requests video, transmit only the scene-specific information
- Local AI reconstructs the full video using cached knowledge
Coded semantic caching combines these ideas:
- Cache semantic representations (100-1000× smaller than raw content)
- Code across semantic representations for delivery
- Achieve both compression gain AND coding gain
The potential improvement is multiplicative: if semantic compression gives 100× and coded caching gives 10×, the combined gain approaches 1000×.
Performance Analysis: Putting It All Together
┌─────────────────────────────────────────────────────────────────────────┐
│ CODED CACHING PERFORMANCE IN 5G/6G MEC │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SYSTEM PARAMETERS: │
│ • N = 1000 files (video library) │
│ • K = 100 users per MEC node │
│ • M = 100 files per user cache (10% of library) │
│ • 10 MEC nodes in deployment │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CACHING STRATEGY COMPARISON: │
│ │
│ Strategy Delivery Rate Backhaul Latency │
│ (files/period) Reduction (ms) │
│ ───────────────────────────────────────────────────────────────────── │
│ No caching 100 0% 150 │
│ LRU (uncoded) 75 25% 120 │
│ LFU (uncoded) 68 32% 115 │
│ Popularity-optimal (uncoded) 60 40% 105 │
│ MAN coded caching 12 88% 35 │
│ AI-optimized coded 10 90% 30 │
│ Semantic coded (6G) 1 99% 10 │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ KEY INSIGHTS: │
│ │
│ 1. Coded caching provides 5-6× improvement over best uncoded policy │
│ This is the "global caching gain" predicted by theory │
│ │
│ 2. AI optimization adds another 15-20% on top of theoretical scheme │
│ By adapting to actual (non-uniform, time-varying) popularity │
│ │
│ 3. Semantic communication offers order-of-magnitude further gains │
│ By transmitting meaning rather than bits │
│ │
│ 4. Latency reduction is dramatic │
│ From 150ms (core) to 10ms (edge semantic) — enables new apps │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DEPLOYMENT CONSIDERATIONS: │
│ │
│ Subpacketization: With K=100, t=10, need C(100,10) ≈ 10^13 subfiles │
│ Solution: User grouping (10 groups of 10) reduces to C(10,1) = 10 │
│ Trade-off: Lose some coding gain (factor ~2) but remain practical │
│ │
│ Coordination overhead: MEC nodes must share cache state │
│ Solution: Periodic synchronization (every 1-10 minutes sufficient) │
│ AI handles short-term dynamics; coordination handles long-term │
│ │
│ Heterogeneous caches: Different MEC nodes have different sizes │
│ Solution: Weighted placement—larger caches store more popular files │
│ AI learns optimal allocation across heterogeneous infrastructure │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part VI: Current Research Frontiers
Coded Caching with Security and Privacy
Problem: Coded delivery reveals information about what other users requested. If I receive , I learn that someone requested file B.
Solutions being researched:
- Secret sharing: Split cache content so no single coded message reveals complete information
- Private information retrieval (PIR): Users download content without revealing which file they want
- Differential privacy: Add noise to coded messages that masks individual requests while preserving utility
Coded Caching with Heterogeneous Files
Real content libraries have diverse file sizes—short clips, long movies, live streams. The MAN scheme assumes equal file sizes.
Extensions:
- Weighted placement: Allocate cache space proportional to file size × popularity
- Layered coding: Split large files into layers, apply coded caching at each layer
- Adaptive subpacketization: Different subpacketization for different file sizes
Online Coded Caching
The MAN scheme assumes batch processing—all requests arrive together, then delivery phase begins. Real systems have continuous request arrivals.
Online approaches:
- Sliding window: Group requests in time windows, apply coded delivery within windows
- Queue-based: Delay requests briefly to accumulate coding opportunities
- Predictive delivery: Start delivering predicted requests before they're actually made
Coded Caching for Machine Learning
A emerging application: distributed ML training generates massive data movement (gradients, parameters). Coded caching principles can help:
- Gradient caching: Workers cache parts of the gradient; coordinator sends coded combinations
- Model caching: Different workers cache different model components; federated aggregation uses coding
- Data shuffling: Shuffle training data across workers using coded communication
Sources:
- Fundamental Limits of Caching (Maddah-Ali & Niesen, 2014) - arXiv
- Learning to Code: Coded Caching via Deep RL - arXiv
- Graph Federated Learning Based Proactive Content Caching (2025) - arXiv
- Coded Caching Optimization via A2C Learning (2025) - MDPI
- ML Techniques for Edge Caching Survey - ScienceDirect
- Multimodal Semantic Communication for 6G MEC - ScienceDirect
- Epidemic Dynamics Edge Caching for 6G (2024) - Frontiers
- Deep RL for Adaptive Caching - arXiv
- Fast Adaptive Caching via Meta-RL (2025) - Springer
- Privacy-Preserving Edge Caching Survey (2024) - ACM
Frequently Asked Questions
Related Articles
AI-RAN: The AI-Native Foundation for 6G Networks
In-depth tour of AI-Radio Access Networks (AI-RAN)—the foundational architecture transforming 5G and enabling 6G. From traditional RAN to AI-native systems, understand the RAN Intelligent Controller (RIC), real-time optimization, and production deployment patterns.
6G Network Architecture: AI at Every Layer - A Complete Technical Vision for IMT-2030
Detailed look at 6G (IMT-2030) network architecture—from AI-native air interfaces and semantic communication to integrated sensing, digital twins, and self-evolving protocols. The complete technical roadmap for next-generation wireless beyond 2030.
Deep Learning for Channel Estimation in Massive MIMO Systems
In-depth technical deep dive into deep learning approaches for channel estimation in massive MIMO—from traditional methods to state-of-the-art CNN-LSTM-Transformer hybrid architectures. Complete with equations, implementations, and performance analysis showing 90%+ NMSE reduction.
Building Intelligent RAN: O-RAN and RIC Architecture Deep Dive
A practical deep dive into Open RAN and RAN Intelligent Controller architecture—from E2 interface specifications to xApp/rApp development, deployment patterns, and real-world production implementations powering modern 5G networks.
AI for Channel Coding: Neural Decoders and End-to-End Learned Codes
In-depth exploration of AI-powered channel coding—from neural belief propagation decoders for LDPC and Polar codes to end-to-end learned codes with Turbo Autoencoders. Deep theoretical foundations, architectural innovations, performance analysis, and the path toward 6G learned physical layers.