Skip to main content
Back to Blog

Coded Caching: From Information Theory to AI-Optimized Edge Networks

Detailed look at coded caching—from Maddah-Ali & Niesen's seminal information-theoretic foundations to modern AI-driven cache optimization. Deep analysis of local and global caching gains, decentralized schemes, and the integration of deep reinforcement learning, federated learning, and graph neural networks for 5G/6G MEC systems.

26 min read
Share:

Introduction

The growth of mobile data traffic presents one of the most pressing challenges in modern telecommunications. Video streaming alone accounts for over 70% of mobile traffic, with users increasingly expecting instant access to high-definition content anywhere, anytime. Traditional network architectures—designed around the principle of fetching content from distant servers on demand—simply cannot scale to meet this demand economically or sustainably.

Caching offers a fundamentally different paradigm: rather than repeatedly transmitting the same popular content across congested backhaul links, we store it closer to users during off-peak hours and serve requests locally during peak times. This simple idea powers content delivery networks (CDNs), browser caches, and increasingly, mobile edge computing (MEC) systems.

But conventional caching has a fundamental limitation: each cache operates independently. If User A and User B both want different files, and each file is partially cached at both users, conventional caching cannot exploit this. The server must transmit missing portions separately to each user.

Coded caching, pioneered by Mohammad Ali Maddah-Ali and Urs Niesen in their seminal 2014 paper "Fundamental Limits of Caching," revolutionizes this paradigm. Through clever use of coding theory, coded caching achieves something remarkable: a global caching gain that scales with the aggregate cache size across all users, even when users don't cooperate or communicate with each other.

The implications are profound. In a system with 30 users, each caching 10 out of 30 files, conventional uncoded caching requires transmitting 20 file-equivalents during peak hours. Coded caching reduces this to just 1.8 file-equivalents—an 11× improvement. This isn't a small optimization; it's a fundamental change in what's achievable.

The AI revolution takes coded caching from theory to practice. While information theory provides optimal schemes for idealized settings with known, static popularity distributions, real networks face:

  • Dynamic popularity: Content trends change rapidly; viral videos emerge unpredictably
  • Heterogeneous users: Different users have different preferences and access patterns
  • Limited coordination: Practical systems can't implement the full complexity of optimal schemes
  • Privacy constraints: Users may not want to share their viewing preferences

Modern AI/ML techniques—deep reinforcement learning for adaptive cache placement, federated learning for privacy-preserving popularity prediction, graph neural networks for modeling user-content relationships—are essential for practical deployment.

This post provides a comprehensive technical exploration of coded caching: the information-theoretic foundations that explain why it works, the mathematical analysis of local and global caching gains, extensions to decentralized and multi-access settings, AI-driven optimization for practical deployment, and integration with 5G/6G mobile edge computing architectures.

Prerequisites: Information theory basics (entropy, mutual information), probability theory, familiarity with optimization concepts.

Key Papers:


Part I: The Information-Theoretic Foundations

The Caching Problem: A Precise Formulation

Consider a content delivery system with the following components:

  • A server with access to a library of NN files: {W1,W2,,WN}\{W_1, W_2, \ldots, W_N\}, each of size FF bits
  • KK users, each equipped with a cache capable of storing MFMF bits (equivalently, MM complete files)
  • A shared broadcast link connecting the server to all users

The system operates in two distinct phases:

Placement Phase (Off-peak hours): The server fills user caches without knowing what files users will request later. This constraint is crucial—if the server knew future requests, optimal caching would be trivial (just cache what will be requested).

Delivery Phase (Peak hours): Each user kk requests exactly one file dk{1,,N}d_k \in \{1, \ldots, N\}. The server must transmit enough information over the shared link for every user to recover their requested file, using both the transmission and their cached content.

The goal: Minimize the rate RR—the number of file-equivalents transmitted during delivery. Lower rate means less backhaul congestion, lower latency, and better scalability.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    THE CODED CACHING SYSTEM MODEL                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                          ┌─────────────────────┐                        │
│                          │       SERVER        │                        │
│                          │                     │                        │
│                          │  File Library:      │                        │
│                          │  W₁, W₂, ..., Wₙ    │                        │
│                          │                     │                        │
│                          │  Each file: F bits  │                        │
│                          │  Total: N files     │                        │
│                          └──────────┬──────────┘                        │
│                                     │                                    │
│                          PHASE 1: PLACEMENT                             │
│                          (Server fills caches                           │
│                           without knowing                               │
│                           future requests)                              │
│                                     │                                    │
│                          PHASE 2: DELIVERY                              │
│                          (Server broadcasts                             │
│                           to satisfy all                                │
│                           requests)                                     │
│                                     │                                    │
│                      ┌──────────────┴──────────────┐                    │
│                      │      SHARED LINK            │                    │
│                      │      Rate = R files         │                    │
│                      └──────────────┬──────────────┘                    │
│                                     │                                    │
│         ┌───────────────────────────┼───────────────────────────┐       │
│         │                           │                           │       │
│         ▼                           ▼                           ▼       │
│   ┌───────────┐               ┌───────────┐               ┌───────────┐ │
│   │  USER 1   │               │  USER 2   │               │  USER K   │ │
│   │           │               │           │               │           │ │
│   │  Cache:   │               │  Cache:   │               │  Cache:   │ │
│   │  MF bits  │               │  MF bits  │               │  MF bits  │ │
│   │           │               │           │               │           │ │
│   │ Request:  │               │ Request:  │               │ Request:  │ │
│   │   d₁      │               │   d₂      │               │   dₖ      │ │
│   └───────────┘               └───────────┘               └───────────┘ │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FUNDAMENTAL PARAMETERS:                                                 │
│                                                                          │
│  • N: Number of files in the library                                    │
│  • K: Number of users in the system                                     │
│  • M: Normalized cache size (can store M file-equivalents)             │
│  • F: Size of each file in bits                                         │
│  • R: Delivery rate (file-equivalents transmitted)                     │
│                                                                          │
│  CONSTRAINTS:                                                            │
│  • 0 ≤ M ≤ N (cache size between empty and full library)               │
│  • Placement is UNCODED: caches store actual file bits                 │
│  • Delivery CAN BE CODED: XOR combinations allowed                     │
│                                                                          │
│  THE QUESTION: What is the minimum achievable rate R for given N,K,M?  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Uncoded Caching: The Conventional Approach

Before exploring coded caching, let's understand what conventional (uncoded) caching achieves.

Strategy: Each user independently caches portions of files, typically the most popular ones. When a user requests a file, they receive what's already in their cache locally, and the server transmits only the missing portions.

Best case: All users request files that are fully cached → R=0R = 0

Worst case (distinct requests, nothing cached): Server must transmit KK complete files → R=KR = K

Average case (assuming uniform popularity): Each user has cached fraction M/NM/N of each file. The missing fraction is (1M/N)(1 - M/N). With KK users making independent requests:

Runcoded=K(1MN)R_{\text{uncoded}} = K \cdot \left(1 - \frac{M}{N}\right)

This formula reveals the local caching gain: each user benefits from their own cache, reducing their individual demand by factor M/NM/N. But there's no benefit from what other users have cached—the caches operate in isolation.

Example: With K=30K = 30 users, N=30N = 30 files, and M=10M = 10: Runcoded=30(110/30)=30(2/3)=20 filesR_{\text{uncoded}} = 30 \cdot (1 - 10/30) = 30 \cdot (2/3) = 20 \text{ files}

Even with substantial caching (each user stores 1/3 of the library), we still need to transmit 20 file-equivalents during peak hours.

The Maddah-Ali & Niesen Breakthrough

The key insight of Maddah-Ali and Niesen was that caches can be coordinated—not through user cooperation, but through clever placement design—so that a single coded transmission simultaneously benefits multiple users.

The Central Idea: Split each file into carefully designed pieces. Place different pieces at different users such that when the server transmits coded combinations (XORs) of file pieces, each user can decode the piece they need using other pieces they already have cached.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              THE MADDAH-ALI & NIESEN SCHEME: DETAILED EXAMPLE           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SETUP: K = 3 users, N = 3 files (A, B, C), M = 1 (each caches 1 file) │
│                                                                          │
│  STEP 1: COMPUTE THE CACHING RATIO                                      │
│                                                                          │
│  t = KM/N = (3 × 1)/3 = 1                                              │
│                                                                          │
│  This parameter t determines the subpacketization: each file           │
│  is split into C(K,t) = C(3,1) = 3 pieces.                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STEP 2: PLACEMENT PHASE                                                 │
│                                                                          │
│  Split each file into 3 pieces, indexed by which user caches them:     │
│                                                                          │
│  File A = { A₁, A₂, A₃ }  where Aᵢ is cached by user i                │
│  File B = { B₁, B₂, B₃ }                                               │
│  File C = { C₁, C₂, C₃ }                                               │
│                                                                          │
│  Cache contents after placement:                                        │
│                                                                          │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │
│  │   USER 1    │    │   USER 2    │    │   USER 3    │                 │
│  │             │    │             │    │             │                 │
│  │ Caches:     │    │ Caches:     │    │ Caches:     │                 │
│  │  A₁, B₁, C₁ │    │  A₂, B₂, C₂ │    │  A₃, B₃, C₃ │                 │
│  │             │    │             │    │             │                 │
│  │ (1/3 of     │    │ (1/3 of     │    │ (1/3 of     │                 │
│  │  each file) │    │  each file) │    │  each file) │                 │
│  └─────────────┘    └─────────────┘    └─────────────┘                 │
│                                                                          │
│  Each user caches 1/3 of A + 1/3 of B + 1/3 of C = 1 file total ✓     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STEP 3: DELIVERY PHASE                                                  │
│                                                                          │
│  Suppose: User 1 wants A, User 2 wants B, User 3 wants C               │
│  (Worst case: all different requests)                                   │
│                                                                          │
│  What each user NEEDS (doesn't have cached):                           │
│  • User 1 needs: A₂, A₃  (has A₁, wants complete A)                   │
│  • User 2 needs: B₁, B₃  (has B₂, wants complete B)                   │
│  • User 3 needs: C₁, C₂  (has C₃, wants complete C)                   │
│                                                                          │
│  UNCODED approach: Send A₂, A₃, B₁, B₃, C₁, C₂ = 6 pieces = 2 files  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CODED DELIVERY (the magic):                                            │
│                                                                          │
│  Transmission 1: X₁ = A₂ ⊕ B₁                                          │
│                                                                          │
│  • User 1 receives X₁, has B₁ cached                                   │
│    → Computes A₂ = X₁ ⊕ B₁ ✓                                          │
│                                                                          │
│  • User 2 receives X₁, has A₂ cached                                   │
│    → Computes B₁ = X₁ ⊕ A₂ ✓                                          │
│                                                                          │
│  ONE transmission satisfies TWO users!                                  │
│                                                                          │
│  Transmission 2: X₂ = A₃ ⊕ C₁                                          │
│  • User 1 gets A₃ (has C₁), User 3 gets C₁ (has A₃)                   │
│                                                                          │
│  Transmission 3: X₃ = B₃ ⊕ C₂                                          │
│  • User 2 gets B₃ (has C₂), User 3 gets C₂ (has B₃)                   │
│                                                                          │
│  TOTAL: 3 pieces transmitted = 1 file worth of data!                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMPARISON:                                                             │
│                                                                          │
│  Uncoded rate:  R = K(1 - M/N) = 3(1 - 1/3) = 2 files                 │
│  Coded rate:    R = K(1 - M/N)/(1 + KM/N) = 2/(1 + 1) = 1 file        │
│                                                                          │
│  Improvement factor: 2× (global caching gain = 1 + t = 2)              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The General Scheme and Achievable Rate

The example above illustrates the principle; the general scheme works as follows:

Placement: Divide each file WnW_n into (Kt)\binom{K}{t} subfiles, where t=KM/Nt = KM/N is the caching ratio. Index each subfile by a subset S{1,,K}\mathcal{S} \subseteq \{1, \ldots, K\} of size tt:

Wn={Wn,S:S[K],S=t}W_n = \{W_{n,\mathcal{S}} : \mathcal{S} \subseteq [K], |\mathcal{S}| = t\}

User kk caches all subfiles whose index contains kk:

Zk={Wn,S:kS,n[N]}Z_k = \{W_{n,\mathcal{S}} : k \in \mathcal{S}, n \in [N]\}

Delivery: For each subset T\mathcal{T} of size t+1t+1, transmit the coded message:

XT=kTWdk,T{k}X_{\mathcal{T}} = \bigoplus_{k \in \mathcal{T}} W_{d_k, \mathcal{T} \setminus \{k\}}

Each user kTk \in \mathcal{T} needs Wdk,T{k}W_{d_k, \mathcal{T} \setminus \{k\}} and has all other terms in the XOR cached. They can decode their needed piece by XORing with cached content.

Achievable Rate:

Rcoded=K(1M/N)1+KM/NR_{\text{coded}} = \frac{K(1 - M/N)}{1 + KM/N}

This beautiful formula reveals the structure of the gain.

Local vs. Global Caching Gain: A Deep Dive

The coded caching rate can be decomposed to understand where the gain comes from:

Rcoded=K(1M/N)1+KM/N=K(1M/N)Uncoded demand11+KM/NGlobal gain factorR_{\text{coded}} = \frac{K(1 - M/N)}{1 + KM/N} = \underbrace{K(1 - M/N)}_{\text{Uncoded demand}} \cdot \underbrace{\frac{1}{1 + KM/N}}_{\text{Global gain factor}}

Local Caching Gain (1/(1M/N)1/(1 - M/N)): This is what conventional caching provides. Each user caches fraction M/NM/N of the library, so they need only fraction (1M/N)(1 - M/N) from the network. This gain is additive across users—it reduces each user's demand independently.

Global Caching Gain (1+KM/N=1+t1 + KM/N = 1 + t): This is unique to coded caching. It represents the multicast opportunity—each coded transmission benefits t+1t+1 users simultaneously. This gain is multiplicative and scales with the aggregate cache size KMKM, not individual cache size MM.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              UNDERSTANDING LOCAL vs GLOBAL CACHING GAIN                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYSTEM: K = 30 users, N = 30 files, M = 10 (33% cache ratio)          │
│                                                                          │
│  LOCAL CACHING GAIN (what conventional caching provides):               │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Each user caches 10/30 = 33% of each file.                            │
│  They need 67% of their requested file from the network.               │
│                                                                          │
│  Without local caching: Each user needs 1 full file → 30 files total  │
│  With local caching: Each user needs 0.67 file → 20 files total       │
│                                                                          │
│  Local gain = 30/20 = 1.5×                                             │
│                                                                          │
│  This gain is INDEPENDENT across users—my cache helps me, not you.    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GLOBAL CACHING GAIN (unique to coded caching):                         │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Caching ratio t = KM/N = 30 × 10 / 30 = 10                           │
│  Global gain = 1 + t = 1 + 10 = 11×                                    │
│                                                                          │
│  Each coded transmission benefits t + 1 = 11 users simultaneously!     │
│                                                                          │
│  This is because:                                                       │
│  • When server sends X = A₁ ⊕ B₂ ⊕ C₃ ⊕ ... (11 terms)               │
│  • User wanting A has all terms except A₁ cached                       │
│  • User wanting B has all terms except B₂ cached                       │
│  • ... and so on for all 11 users in this group                       │
│                                                                          │
│  The server makes ONE transmission; 11 users each extract their need.  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMBINED EFFECT:                                                        │
│                                                                          │
│  Uncoded rate:  R = K(1 - M/N) = 30 × 0.67 = 20 files                 │
│  Coded rate:    R = 20 / (1 + 10) = 20/11 ≈ 1.8 files                 │
│                                                                          │
│  Total improvement: 20 / 1.8 ≈ 11× better than uncoded!               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE REMARKABLE INSIGHT:                                                 │
│                                                                          │
│  Global gain scales with AGGREGATE cache size KM, not individual M.    │
│                                                                          │
│  Double the number of users K (same M per user)?                       │
│  → Global gain INCREASES (more multicast opportunity)                  │
│  → Even though total demand also increases                             │
│  → Net effect: Rate grows sublinearly with K                          │
│                                                                          │
│  This is counter-intuitive: More users → More efficient delivery!     │
│                                                                          │
│  In conventional systems, more users = more load = worse performance.  │
│  In coded caching, more users = more coding opportunity = better gain. │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Information-Theoretic Optimality

A natural question: Is the Maddah-Ali & Niesen scheme optimal? Can we do better?

The authors proved a converse bound—a lower limit on the rate that any scheme must achieve:

Rmaxs{1,,min(K,N)}ssM/N1+sM/NR^* \geq \max_{s \in \{1, \ldots, \min(K, N)\}} \frac{s - sM/N}{1 + sM/N}

The gap between achievable rate and lower bound is at most a factor of 12 for all parameter values. This means coded caching is order-optimal: we cannot hope to do fundamentally better (by more than a constant factor).

Converse Bound Derivation (Cut-Set Argument)

The converse uses a cut-set bound argument, a fundamental technique in network information theory. Here's the rigorous derivation:

Setup: Consider any subset S{1,,K}\mathcal{S} \subseteq \{1, \ldots, K\} of s=Ss = |\mathcal{S}| users. We'll lower bound the rate by considering what information must cross the "cut" separating the server from these ss users.

Key Insight: Consider the worst-case demand where each user in S\mathcal{S} requests a distinct file. The server must transmit enough information so that together with the cached content, each user can decode their file.

Step 1: Total Information Requirement

The ss users collectively need ss complete files (sFsF bits). They collectively have cached at most sMsM files worth of content (sMFsMF bits). Therefore, the transmission must provide at least:

H(X)sFsMF=sF(1M/N)NNH(X) \geq sF - sMF = sF(1 - M/N) \cdot \frac{N}{N}

But this ignores the structure. A tighter bound uses the following:

Step 2: Cut-Set Entropy Bound

Let ZS=(Zk:kS)Z_{\mathcal{S}} = (Z_k : k \in \mathcal{S}) denote the caches of users in S\mathcal{S}, and XX denote the transmitted message. For the users to decode their requested files WdkW_{d_k}, we need:

H(Wdk:kSX,ZS)=0H(W_{d_k} : k \in \mathcal{S} | X, Z_{\mathcal{S}}) = 0

By the chain rule and independence of files:

H(Wdk:kS)=H(Wdk:kSX,ZS)+I(Wdk:kS;X,ZS)H(W_{d_k} : k \in \mathcal{S}) = H(W_{d_k} : k \in \mathcal{S} | X, Z_{\mathcal{S}}) + I(W_{d_k} : k \in \mathcal{S}; X, Z_{\mathcal{S}})

sF=0+I(Wdk:kS;X,ZS)sF = 0 + I(W_{d_k} : k \in \mathcal{S}; X, Z_{\mathcal{S}})

Step 3: Bounding the Mutual Information

I(Wdk:kS;X,ZS)H(X)+H(ZS)I(W_{d_k} : k \in \mathcal{S}; X, Z_{\mathcal{S}}) \leq H(X) + H(Z_{\mathcal{S}})

The cache entropy is bounded by: H(ZS)sMFH(Z_{\mathcal{S}}) \leq sMF

Thus: sFH(X)+sMFsF \leq H(X) + sMF H(X)sFsMF=sF(1M/N)H(X) \geq sF - sMF = sF(1 - M/N)

Step 4: Accounting for Broadcast Advantage

The above bound doesn't exploit the fact that XX is a broadcast—all users receive it. A refined argument considers that the transmission must help each user independently:

For any user kSk \in \mathcal{S}: F=H(Wdk)H(X)+H(Zk)=H(X)+MFF = H(W_{d_k}) \leq H(X) + H(Z_k) = H(X) + MF

Averaging and using the shared broadcast: sFsH(X)+sMFsF \leq sH(X) + sMF

But since XX is common to all: sFH(X)+sMF+(s1)H(XZS)sF \leq H(X) + sMF + (s-1)H(X|Z_{\mathcal{S}})

A careful analysis using the submodularity of entropy yields:

RF=H(X)s(1M/N)1+sM/NFR \cdot F = H(X) \geq \frac{s(1 - M/N)}{1 + s \cdot M/N} \cdot F

Step 5: Optimizing Over ss

Since this bound holds for any subset size ss, we take the maximum:

Rmaxs{1,,min(K,N)}s(1M/N)1+sM/NR^* \geq \max_{s \in \{1, \ldots, \min(K, N)\}} \frac{s(1 - M/N)}{1 + sM/N}

Tightness Analysis: Let g(s)=s(1M/N)1+sM/Ng(s) = \frac{s(1-M/N)}{1+sM/N}. Taking the derivative:

g(s)=(1M/N)(1+sM/N)s(1M/N)(M/N)(1+sM/N)2=1M/N(1+sM/N)2g'(s) = \frac{(1-M/N)(1+sM/N) - s(1-M/N)(M/N)}{(1+sM/N)^2} = \frac{1-M/N}{(1+sM/N)^2}

Since g(s)>0g'(s) > 0 for M<NM < N, the maximum is achieved at s=min(K,N)s^* = \min(K, N).

The Optimality Gap: Comparing achievable rate to lower bound:

RMANRlower=K(1M/N)/(1+KM/N)s(1M/N)/(1+sM/N)K(1+sM/N)s(1+KM/N)\frac{R_{\text{MAN}}}{R_{\text{lower}}} = \frac{K(1-M/N)/(1+KM/N)}{s^*(1-M/N)/(1+s^*M/N)} \leq \frac{K(1+s^*M/N)}{s^*(1+KM/N)}

When K=NK = N and s=Ns^* = N, this ratio is exactly 1—the MAN scheme is optimal.

More recent work has tightened this gap for many parameter regimes, showing that the MAN scheme is often exactly optimal or within a factor of 2.


Part II: Extensions and Practical Variants

Decentralized Coded Caching

The centralized MAN scheme requires the server to coordinate placement across all users—each user must cache specific subfiles determined by a global plan. This is impractical for large systems where users join and leave dynamically.

Decentralized coded caching (also by Maddah-Ali and Niesen) removes this requirement: each user independently caches a random fraction M/NM/N of each file. Remarkably, coded delivery still works!

Placement: Each user independently caches each bit of each file with probability M/NM/N.

Delivery: The server identifies which users have which bits cached (through feedback), then constructs coded transmissions that exploit random overlaps.

Achievable Rate (for large file size FF):

RdecentralizedK(1M/N)1+KM/N(1+o(1))R_{\text{decentralized}} \approx \frac{K(1 - M/N)}{1 + KM/N} \cdot (1 + o(1))

Remarkably, the decentralized scheme achieves essentially the same rate as the coordinated scheme, with only a vanishing penalty as file size grows.

Why it works: With high probability, the random placement creates sufficient overlap structure. Any t+1t+1 users will have complementary cached content that enables coding. The law of large numbers ensures this works well for large files.

Multi-Access Coded Caching

In practical networks, users often have access to multiple caches—for example, a user at the boundary between two cells might connect to both base stations.

Multi-access coded caching exploits this:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTI-ACCESS CODED CACHING                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                            SERVER                                        │
│                               │                                          │
│               ┌───────────────┼───────────────┐                         │
│               │               │               │                         │
│               ▼               ▼               ▼                         │
│          ┌────────┐      ┌────────┐      ┌────────┐                    │
│          │Cache 1 │      │Cache 2 │      │Cache 3 │                    │
│          │  (C₁)  │      │  (C₂)  │      │  (C₃)  │                    │
│          └───┬────┘      └───┬────┘      └───┬────┘                    │
│              │               │               │                          │
│              │    ┌──────────┼───────────┐   │                          │
│              │    │          │           │   │                          │
│              ▼    ▼          ▼           ▼   ▼                          │
│           ┌─────────┐    ┌─────────┐    ┌─────────┐                     │
│           │ User 1  │    │ User 2  │    │ User 3  │                     │
│           │         │    │         │    │         │                     │
│           │ Access: │    │ Access: │    │ Access: │                     │
│           │ C₁, C₂  │    │ C₂, C₃  │    │ C₁, C₃  │                     │
│           └─────────┘    └─────────┘    └─────────┘                     │
│                                                                          │
│  Each user can retrieve content from L = 2 caches.                     │
│                                                                          │
│  BENEFITS:                                                               │
│  • Effective cache size per user increases                             │
│  • More coding opportunities from overlapping access                   │
│  • Natural model for cell-edge users                                   │
│                                                                          │
│  The MAN scheme extends to this setting with modified placement        │
│  that accounts for shared cache access patterns.                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The multi-access setting offers additional gains because users can combine content from multiple sources, increasing effective cache size and creating new coding opportunities.

Device-to-Device (D2D) Coded Caching

In D2D coded caching, users can also receive content from nearby users through direct device-to-device links. This adds a spatial dimension:

  • Server broadcasts reach all users (but limited capacity)
  • D2D transmissions have limited range but can occur simultaneously in different locations

The combination of coded caching with D2D spatial reuse can provide multiplicative gains: coding gain from cached content coordination, plus spatial reuse gain from simultaneous D2D transmissions in different areas.

Hierarchical Coded Caching

Real networks often have multiple cache layers:

  1. Core caches: Large, centrally located (CDN servers)
  2. Edge caches: Medium-sized, at base stations (MEC)
  3. User caches: Small, on end devices

Hierarchical coded caching applies the coded caching principle at each layer, with careful coordination to avoid redundant caching. Content that's already coded for delivery at layer ii need not be re-cached at layer i+1i+1.

The analysis becomes complex, but the fundamental insight remains: coordinated placement plus coded delivery provides multiplicative gains at each layer, compounding through the hierarchy.


Part III: Challenges in Practical Deployment

The Subpacketization Problem

The MAN scheme requires splitting each file into (KKM/N)\binom{K}{KM/N} subfiles. For realistic parameters, this number explodes:

  • K=100K = 100 users, M/N=0.1M/N = 0.1 (cache 10% of library)
  • t=KM/N=10t = KM/N = 10
  • Subfiles per file: (10010)1.7×1013\binom{100}{10} \approx 1.7 \times 10^{13}

Each subfile needs a unique index, and the delivery phase must identify and combine specific subfiles. This is clearly impractical.

Solutions:

  1. Placement Delivery Arrays (PDAs): Combinatorial designs that achieve coded caching gain with polynomial subpacketization
  2. User grouping: Divide users into smaller groups, apply coded caching within groups
  3. Approximate schemes: Accept some loss in coding gain for practical subpacketization

Placement Delivery Arrays: A Rigorous Construction

A Placement Delivery Array (PDA) is a combinatorial structure that specifies both cache placement and coded delivery with controlled subpacketization.

Definition: A (K,F,Z,S)(K, F, Z, S)-PDA is a K×FK \times F array P\mathbf{P} where:

  • KK = number of users
  • FF = number of subfiles per file (subpacketization)
  • ZZ = number of entries in each row that are "*" (determines cache size)
  • SS = number of distinct integers used (determines delivery transmissions)

The array entries are either "*" (indicating cached content) or integers from {1,2,,S}\{1, 2, \ldots, S\}.

Properties:

  1. Each row contains exactly ZZ stars → cache size M=ZN/FM = ZN/F
  2. Each integer appears at most once per column
  3. For each integer ss, the set of rows where ss appears (excluding column with ss) forms a valid coding group

Example: A (4,4,2,4)(4, 4, 2, 4)-PDA for K=4K=4 Users

P=(123413243)\mathbf{P} = \begin{pmatrix} * & * & 1 & 2 \\ * & 3 & * & 4 \\ 1 & * & * & 3 \\ 2 & 4 & 3 & * \end{pmatrix}

Interpretation:

  • Rows = Users (1, 2, 3, 4)
  • Columns = Subfiles (Wn,1,Wn,2,Wn,3,Wn,4W_{n,1}, W_{n,2}, W_{n,3}, W_{n,4} for each file WnW_n)
  • "*" = User caches this subfile
  • Integer ss = This subfile is delivered in transmission ss

Cache contents (from stars):

  • User 1 caches: Wn,1,Wn,2W_{n,1}, W_{n,2} for all nn → 2/4 = 50% of each file
  • User 2 caches: Wn,1,Wn,3W_{n,1}, W_{n,3} for all nn
  • User 3 caches: Wn,2,Wn,3W_{n,2}, W_{n,3} for all nn
  • User 4 caches: Wn,3,Wn,4W_{n,3}, W_{n,4} for all nn

Delivery (for demand d1,d2,d3,d4d_1, d_2, d_3, d_4):

  • Transmission 1 (integer 1 appears in positions (1,3)(1,3) and (3,1)(3,1)): X1=Wd1,3Wd3,1X_1 = W_{d_1, 3} \oplus W_{d_3, 1} User 1 wants Wd1,3W_{d_1,3}, has Wd3,1W_{d_3,1} cached (from column 1, row 3) User 3 wants Wd3,1W_{d_3,1}, has Wd1,3W_{d_1,3} cached

  • Similar for transmissions 2, 3, 4.

Rate achieved: R=S/F=4/4=1R = S/F = 4/4 = 1 file

Comparison to MAN scheme: For K=4K=4, M/N=1/2M/N=1/2:

  • MAN scheme: F=(42)=6F = \binom{4}{2} = 6 subfiles, R=4(11/2)/(1+2)=2/3R = 4(1-1/2)/(1+2) = 2/3
  • PDA scheme: F=4F = 4 subfiles, R=1R = 1

The PDA trades slightly higher rate (1 vs 2/3) for much lower subpacketization (4 vs 6).

General Constructions: Several PDA families achieve polynomial subpacketization:

  1. Maddah-Ali Niesen PDA: F=(Kt)F = \binom{K}{t}, S=(Kt+1)S = \binom{K}{t+1}, R=Ktt+1R = \frac{K-t}{t+1}
  2. Resolvable design PDA: F=O(K2)F = O(K^2), with rate penalty 2\leq 2
  3. Ruzsa-Szemerédi graph PDA: F=O(K1+ϵ)F = O(K^{1+\epsilon}) for any ϵ>0\epsilon > 0

The Fundamental Trade-off:

RFK(1M/N)F1+KM/NR \cdot F \geq K(1-M/N) \cdot \frac{F}{1 + KM/N}

Lower subpacketization FF generally requires higher rate RR, but the product is bounded below.

Recent work has reduced subpacketization from exponential to polynomial in KK while retaining most of the coding gain.

Non-Uniform Popularity

The MAN scheme assumes all files are equally likely to be requested. Real content follows highly skewed popularity distributions (Zipf's law):

P(file n requested)=pn=nαj=1NjαP(\text{file } n \text{ requested}) = p_n = \frac{n^{-\alpha}}{\sum_{j=1}^N j^{-\alpha}}

where α0.8\alpha \approx 0.8 for typical video content (empirically validated across Netflix, YouTube, and CDN traces).

Implication: Most requests concentrate on a few popular files. Should we:

  • Cache popular files completely (maximize local hits)?
  • Cache diverse content (maximize coding opportunities)?

Optimal Placement Under Zipf: A Convex Optimization

For non-uniform popularity, the optimal cache allocation can be formulated as a convex optimization problem.

Decision Variables: Let mn[0,1]m_n \in [0,1] denote the fraction of file nn cached at each user.

Constraint: Total cache size is MM files: n=1Nmn=M\sum_{n=1}^N m_n = M

Objective: Minimize expected delivery rate. For uncoded caching: Runcoded=Kn=1Npn(1mn)R_{\text{uncoded}} = K \sum_{n=1}^N p_n (1 - m_n)

For coded caching with non-uniform popularity, the rate expression becomes more complex. A tractable approximation uses the grouped coded caching framework:

Grouped Coded Caching Formulation:

Partition files into groups G1,G2,\mathcal{G}_1, \mathcal{G}_2, \ldots based on popularity. Within each group, apply coded caching with uniform treatment.

minmn,GgnGgpnK(1mn)1+Kmˉg\min_{m_n, \mathcal{G}} \sum_{g} \sum_{n \in \mathcal{G}_g} p_n \cdot \frac{K(1 - m_n)}{1 + K \bar{m}_g}

where mˉg\bar{m}_g is the average cache fraction for files in group gg.

Closed-Form Solution for Special Cases:

Case 1: Pure Local Caching (no coding gain exploited)

The optimal allocation follows a water-filling solution: mn=[1λpn1/21]01m_n^* = \left[ \frac{1}{\lambda} \cdot p_n^{1/2} - 1 \right]_0^1

where λ\lambda is the Lagrange multiplier set so that nmn=M\sum_n m_n^* = M, and [x]ab=max(a,min(b,x))[x]_a^b = \max(a, \min(b, x)).

Case 2: Two-Group Approximation

Partition files into "popular" (nNpn \leq N_p) and "unpopular" (n>Npn > N_p):

  • Popular files: Cache completely (mn=1m_n = 1)
  • Unpopular files: Apply coded caching with mn=mum_n = m_u for all n>Npn > N_p

The optimization becomes: minNp,mun=1Nppn0+n=Np+1NpnK(1mu)1+Kmu\min_{N_p, m_u} \sum_{n=1}^{N_p} p_n \cdot 0 + \sum_{n=N_p+1}^{N} p_n \cdot \frac{K(1-m_u)}{1 + Km_u}

subject to Np+(NNp)mu=MN_p + (N - N_p) m_u = M

The Critical Trade-off:

Define the coding opportunity cost: COC(n)=pnmn[K(1mn)1+Kmn]=pnK(1+K)(1+Kmn)2\text{COC}(n) = p_n \cdot \frac{\partial}{\partial m_n} \left[ \frac{K(1-m_n)}{1 + Km_n} \right] = -p_n \cdot \frac{K(1 + K)}{(1 + Km_n)^2}

Files with high pnp_n benefit more from being fully cached (no coding), while files with moderate pnp_n benefit from participating in coded delivery.

Numerical Example: For N=100N = 100, K=20K = 20, M=10M = 10, α=0.8\alpha = 0.8:

  • Optimal strategy caches top ~5 files completely
  • Applies coded caching to files 6-40
  • Ignores files 41-100 (too unpopular)
  • Rate reduction vs. uniform MAN: ~25%

AI/ML contribution: Predicting popularity and optimally allocating cache space across files is precisely where machine learning excels.

Unknown and Time-Varying Popularity

Real popularity isn't just non-uniform—it's also unknown and changes over time. Viral content emerges suddenly; trending topics shift daily.

Online learning approaches:

  • Bandit algorithms: Explore-exploit tradeoff for cache updates
  • Prediction models: LSTM, transformers to forecast future popularity from past patterns
  • Reinforcement learning: Learn cache replacement policies that adapt to popularity dynamics

This is where the information-theoretic foundations meet practical machine learning.


Part IV: AI/ML for Coded Caching

Why AI is Essential

Information theory tells us what is achievable—the minimum rate under idealized assumptions. But practical deployment requires:

  1. Predicting content popularity without complete information
  2. Adapting to time-varying demands in real-time
  3. Handling heterogeneous user preferences across locations
  4. Respecting privacy constraints in data collection
  5. Making decisions under computational constraints

AI/ML provides the tools to bridge theory and practice.

Deep Reinforcement Learning for Cache Placement

Cache placement can be formulated as a Markov Decision Process (MDP):

State: Current cache contents, recent request history, time of day, user context

Action: Which content to cache, which to evict, how to split cache across files

Reward: Negative delivery rate (or equivalently, cache hit rate, latency reduction, backhaul savings)

The challenge: The action space is enormous (all possible cache configurations), and the reward depends on future requests that are stochastic.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              DEEP RL FOR CODED CACHING OPTIMIZATION                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE MDP FORMULATION:                                                    │
│                                                                          │
│  STATE SPACE S:                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                                                                  │   │
│  │  s = ( C, H, P̂, τ )                                             │   │
│  │                                                                  │   │
│  │  C: Current cache configuration                                 │   │
│  │     • Which files are cached                                    │   │
│  │     • What fraction of each file                                │   │
│  │     • Which subfiles (for coded caching)                        │   │
│  │                                                                  │   │
│  │  H: Request history                                              │   │
│  │     • Recent requests from each user                            │   │
│  │     • Temporal patterns (time of day, day of week)             │   │
│  │     • User context (location, device type)                     │   │
│  │                                                                  │   │
│  │  P̂: Estimated popularity distribution                          │   │
│  │     • Current best estimate of file popularities               │   │
│  │     • Uncertainty/confidence in estimates                       │   │
│  │                                                                  │   │
│  │  τ: Time features                                                │   │
│  │     • Hour of day, day of week (cyclic encoding)               │   │
│  │     • Special events (sports, news, holidays)                  │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ACTION SPACE A:                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                                                                  │   │
│  │  For conventional caching:                                       │   │
│  │  • a = (evict_file, cache_file) discrete choices               │   │
│  │                                                                  │   │
│  │  For coded caching:                                              │   │
│  │  • a = (placement_strategy, subfile_allocation)                │   │
│  │  • More complex: must maintain coding structure                 │   │
│  │                                                                  │   │
│  │  For practical systems:                                          │   │
│  │  • a = (cache_probabilities) continuous vector                 │   │
│  │  • Probabilistic placement: cache file n with probability aₙ  │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  REWARD FUNCTION:                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                                                                  │   │
│  │  r(s, a) = -R(s, a)  (negative delivery rate)                  │   │
│  │                                                                  │   │
│  │  Or multi-objective:                                             │   │
│  │  r = α·(cache_hit_rate) + β·(latency_reduction) +              │   │
│  │      γ·(backhaul_savings) - λ·(update_cost)                    │   │
│  │                                                                  │   │
│  │  The reward captures both immediate benefit (hits on current   │   │
│  │  requests) and long-term value (positioned for future demand). │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ALGORITHM: Advantage Actor-Critic (A2C) or Proximal Policy            │
│  Optimization (PPO) work well for this continuous control problem.     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RESULTS FROM RECENT RESEARCH (2025):                                   │
│                                                                          │
│  • A2C-based coded caching achieves 15-25% rate reduction             │
│    over heuristic policies (LRU, LFU)                                  │
│                                                                          │
│  • Converges in ~5000 episodes for moderate-sized systems             │
│                                                                          │
│  • Adapts automatically to popularity shifts within ~100 requests     │
│                                                                          │
│  • Outperforms optimal static policy under time-varying popularity    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Bellman Equation for Optimal Caching

The optimal caching policy satisfies the Bellman optimality equation. For infinite-horizon discounted reward with discount factor γ(0,1)\gamma \in (0,1):

Value Function: The optimal value function V(s)V^*(s) represents the maximum expected cumulative reward starting from state ss:

V(s)=maxaA[r(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_{a \in \mathcal{A}} \left[ r(s, a) + \gamma \sum_{s'} P(s'|s, a) V^*(s') \right]

Q-Function: The action-value function Q(s,a)Q^*(s, a) represents the value of taking action aa in state ss:

Q(s,a)=r(s,a)+γsP(ss,a)maxaQ(s,a)Q^*(s, a) = r(s, a) + \gamma \sum_{s'} P(s'|s, a) \max_{a'} Q^*(s', a')

For Caching Specifically:

Let s=(C,H)s = (C, H) be the state (cache configuration CC, history HH), and a=(evict,cache)a = (\text{evict}, \text{cache}) be the caching action.

The reward for caching configuration CC when request dd arrives:

r(C,d)=+1 if dC (cache hit),cbackhaul if dC (cache miss)r(C, d) = +1 \text{ if } d \in C \text{ (cache hit)}, \quad -c_{\text{backhaul}} \text{ if } d \notin C \text{ (cache miss)}

For coded caching, the reward becomes:

r(C,d)=Rcoded(C,d)r(C, \mathbf{d}) = -R_{\text{coded}}(C, \mathbf{d})

where RcodedR_{\text{coded}} is the delivery rate given cache configuration CC and demand vector d\mathbf{d}.

State Transition: Under demand distribution p\mathbf{p}:

P(ss,a)=dP(d)1[s=T(s,a,d)]P(s'|s, a) = \sum_{\mathbf{d}} P(\mathbf{d}) \cdot \mathbf{1}[s' = T(s, a, \mathbf{d})]

where TT is the deterministic state transition function.

Function Approximation: For large state spaces, we approximate Q(s,a)Qθ(s,a)Q^*(s,a) \approx Q_\theta(s,a) using neural networks. The A2C algorithm updates:

Critic (value function): Lcritic=E[(r+γVϕ(s)Vϕ(s))2]\mathcal{L}_{\text{critic}} = \mathbb{E}\left[ (r + \gamma V_\phi(s') - V_\phi(s))^2 \right]

Actor (policy): θJ=E[θlogπθ(as)A(s,a)]\nabla_\theta J = \mathbb{E}\left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A(s, a) \right]

where the advantage A(s,a)=Q(s,a)V(s)r+γV(s)V(s)A(s,a) = Q(s,a) - V(s) \approx r + \gamma V(s') - V(s).

Competitive Ratio and Regret Analysis

For online caching algorithms, we compare against the optimal offline algorithm (which knows all future requests) using competitive ratio and regret.

Competitive Ratio: An online algorithm A\mathcal{A} has competitive ratio cc if:

Cost(A)cCost(OPT)+b\text{Cost}(\mathcal{A}) \leq c \cdot \text{Cost}(\text{OPT}) + b

for all request sequences, where bb is an additive constant.

Classical Results:

  • LRU (Least Recently Used): Competitive ratio = kk for cache size kk
  • LFU (Least Frequently Used): Competitive ratio = O(logn)O(\log n)
  • Marker algorithm: Competitive ratio = 2Hk2lnk2H_k \approx 2\ln k
  • Lower bound: No deterministic algorithm achieves ratio better than kk

Regret for Learning Algorithms: For RL-based caching, we measure regret—the difference between cumulative reward of the learning agent and the optimal policy:

Regret(T)=t=1T[V(st)rt]\text{Regret}(T) = \sum_{t=1}^T \left[ V^*(s_t) - r_t \right]

Sublinear Regret: A good learning algorithm achieves Regret(T)=o(T)\text{Regret}(T) = o(T), meaning the average per-step regret vanishes as TT \to \infty.

Results for RL Caching:

For contextual bandit formulation (stateless): Regret(T)=O(NTlogN)\text{Regret}(T) = O(\sqrt{NT \log N})

where NN is the number of files.

For full MDP formulation with finite state space S|\mathcal{S}| and action space A|\mathcal{A}|: Regret(T)=O~(SAT)\text{Regret}(T) = \tilde{O}(\sqrt{|\mathcal{S}||\mathcal{A}|T})

Practical Implications:

  • After T=104T = 10^4 requests, regret per request is ~O(1/T)1%O(1/\sqrt{T}) \approx 1\%
  • RL policies become near-optimal within ~1000 requests for moderate-sized systems
  • The learning phase cost is amortized over the deployment lifetime

Comparison: RL vs. Heuristics:

AlgorithmCompetitive RatioAdaptivityComplexity
LRUkkLowO(1)O(1)
LFUO(logN)O(\log N)MediumO(logN)O(\log N)
Static Optimal1 (known dist.)NoneO(N)O(N)
RL (after training)~1.1-1.3HighO(d)O(d) (NN)

Deep Q-Networks (DQN) for Discrete Caching Decisions

For caching problems with discrete action spaces (cache/evict specific files), Deep Q-Networks (DQN) provide a powerful framework. DQN combines Q-learning with deep neural networks to handle large state spaces.

The DQN Algorithm

Core Idea: Approximate the optimal Q-function Q(s,a)Q^*(s,a) with a neural network Qθ(s,a)Q_\theta(s,a).

Loss Function: Minimize the temporal difference (TD) error:

L(θ)=E(s,a,r,s)D[(r+γmaxaQθ(s,a)Qθ(s,a))2]\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q_{\theta^-}(s',a') - Q_\theta(s,a) \right)^2 \right]

where D\mathcal{D} is the replay buffer storing past experiences, and θ\theta^- are the target network parameters (updated slowly).

Key DQN Innovations for Stability:

  1. Experience Replay: Store transitions (s,a,r,s)(s, a, r, s') in buffer D\mathcal{D} of size D=105106|\mathcal{D}| = 10^5 - 10^6. Sample mini-batches uniformly to break correlation between consecutive samples.

  2. Target Network: Maintain a separate target network QθQ_{\theta^-} updated every C=100010000C = 1000-10000 steps: θθ(hard update)\theta^- \leftarrow \theta \quad \text{(hard update)} Or with soft updates (Polyak averaging): θτθ+(1τ)θwhere τ=0.001\theta^- \leftarrow \tau \theta + (1-\tau) \theta^- \quad \text{where } \tau = 0.001

  3. ϵ\epsilon-Greedy Exploration: With probability ϵ\epsilon, select random action; otherwise select argmaxaQθ(s,a)\arg\max_a Q_\theta(s,a). Anneal ϵ\epsilon from 1.0 to 0.01 over training.

DQN Variants for Caching

Double DQN: Reduces overestimation bias by decoupling action selection and evaluation:

y=r+γQθ(s,argmaxaQθ(s,a))y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_\theta(s', a'))

instead of y=r+γmaxaQθ(s,a)y = r + \gamma \max_{a'} Q_{\theta^-}(s', a').

Dueling DQN: Separates state value V(s)V(s) from action advantage A(s,a)A(s,a):

Qθ(s,a)=Vθ(s)+(Aθ(s,a)1AaAθ(s,a))Q_\theta(s,a) = V_\theta(s) + \left( A_\theta(s,a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A_\theta(s,a') \right)

This architecture learns that some states are inherently good/bad regardless of action—useful for caching where cache state strongly determines performance.

Prioritized Experience Replay: Sample transitions with probability proportional to TD error:

P(i)δiα+ϵP(i) \propto |\delta_i|^\alpha + \epsilon

where δi=r+γmaxaQ(s,a)Q(s,a)\delta_i = r + \gamma \max_{a'} Q(s',a') - Q(s,a) is the TD error and α[0,1]\alpha \in [0,1] controls prioritization strength.

DQN Architecture for Caching

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DQN ARCHITECTURE FOR CACHING                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT STATE REPRESENTATION:                                             │
│                                                                          │
│  s = [ cache_vector | popularity_estimate | temporal_features ]         │
│      └─────────────┘ └──────────────────┘ └─────────────────┘          │
│         N-dim 0/1        N-dim float         d_t-dim                    │
│                                                                          │
│  NETWORK ARCHITECTURE:                                                   │
│                                                                          │
│  State s ∈ ℝ^{N + N + d_t}                                              │
│      │                                                                   │
│      ▼                                                                   │
│  ┌─────────────────────────────────────┐                                │
│  │ Input Layer                          │                                │
│  │ Dimension: 2N + d_t (e.g., 2×1000 + 24 = 2024)                      │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│                   ▼                                                      │
│  ┌─────────────────────────────────────┐                                │
│  │ Hidden Layer 1                       │                                │
│  │ • Linear(2024, 512)                 │                                │
│  │ • ReLU activation                    │                                │
│  │ • LayerNorm (optional)              │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│                   ▼                                                      │
│  ┌─────────────────────────────────────┐                                │
│  │ Hidden Layer 2                       │                                │
│  │ • Linear(512, 256)                  │                                │
│  │ • ReLU activation                    │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│                   ▼                                                      │
│  ┌─────────────────────────────────────┐                                │
│  │ Hidden Layer 3                       │                                │
│  │ • Linear(256, 128)                  │                                │
│  │ • ReLU activation                    │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│          ┌───────┴───────┐  (Dueling architecture)                     │
│          │               │                                              │
│          ▼               ▼                                              │
│   ┌────────────┐  ┌────────────┐                                       │
│   │ Value Head │  │ Adv. Head  │                                       │
│   │ Linear(128,│  │ Linear(128,│                                       │
│   │    1)      │  │    |A|)    │                                       │
│   └─────┬──────┘  └─────┬──────┘                                       │
│         │               │                                               │
│         └───────┬───────┘                                               │
│                 │                                                        │
│                 ▼                                                        │
│         Q(s,a) = V(s) + (A(s,a) - mean(A))                             │
│                                                                          │
│  OUTPUT: Q-values for each action                                       │
│  Dimension: |A| = N (for single-file decisions) or 2N (cache/evict)   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

DQN Hyperparameters for Caching (Recommended Values)

HyperparameterSymbolValueNotes
Learning rateη\eta3×1043 \times 10^{-4}Adam optimizer
Discount factorγ\gamma0.99Long-term reward focus
Replay buffer size$\mathcal{D}$
Batch sizeBB64-128
Target update frequencyCC1000 stepsHard update
Soft update rateτ\tau0.001If using Polyak
Initial explorationϵ0\epsilon_01.0
Final explorationϵf\epsilon_f0.01
Exploration decay-Linear over 10510^5 steps
Hidden dimensions-[512, 256, 128]
Prioritization exponentα\alpha0.6For PER
Importance samplingβ\beta0.4 → 1.0Annealed

Proximal Policy Optimization (PPO) for Continuous Cache Allocation

For continuous action spaces (allocating fractional cache to each file), Proximal Policy Optimization (PPO) is the state-of-the-art algorithm, combining sample efficiency with stable training.

The PPO Algorithm

PPO is an actor-critic method that constrains policy updates to prevent destructive large changes.

Policy (Actor): πθ(as)\pi_\theta(a|s) outputs a probability distribution over actions. For continuous actions:

aN(μθ(s),σθ(s)2)a \sim \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2)

where μθ\mu_\theta and σθ\sigma_\theta are neural networks.

Value Function (Critic): Vϕ(s)V_\phi(s) estimates expected return from state ss.

PPO-Clip Objective: The key innovation is clipping the probability ratio to prevent large updates:

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

where:

  • rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} is the probability ratio
  • A^t\hat{A}_t is the estimated advantage
  • ϵ=0.2\epsilon = 0.2 is the clip range

Generalized Advantage Estimation (GAE): Compute advantages using a weighted combination of n-step returns:

A^tGAE(γ,λ)=l=0(γλ)lδt+l\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}

where δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) is the TD residual.

Typically λ=0.95\lambda = 0.95 balances bias (low λ\lambda) and variance (high λ\lambda).

Full PPO Loss:

L(θ,ϕ)=LCLIP(θ)c1LVF(ϕ)+c2S[πθ]\mathcal{L}(\theta, \phi) = \mathcal{L}^{\text{CLIP}}(\theta) - c_1 \mathcal{L}^{\text{VF}}(\phi) + c_2 S[\pi_\theta]

where:

  • LVF=(Vϕ(st)Vttarget)2\mathcal{L}^{\text{VF}} = (V_\phi(s_t) - V_t^{\text{target}})^2 is the value function loss
  • S[πθ]=aπθ(as)logπθ(as)S[\pi_\theta] = -\sum_a \pi_\theta(a|s) \log \pi_\theta(a|s) is entropy bonus for exploration
  • c1=0.5c_1 = 0.5, c2=0.01c_2 = 0.01 are weighting coefficients

PPO Architecture for Cache Allocation

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PPO ACTOR-CRITIC FOR CACHING                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SHARED FEATURE EXTRACTOR:                                               │
│                                                                          │
│  State s ──► [ MLP: 2N+d_t → 256 → 256 ] ──► Features f(s) ∈ ℝ^{256}  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ACTOR (Policy Network):                                                 │
│                                                                          │
│  f(s) ──► [ Linear(256, 128) → ReLU → Linear(128, 2N) ]                │
│                              │                                           │
│                    ┌─────────┴─────────┐                                │
│                    │                   │                                │
│                    ▼                   ▼                                │
│              μ(s) ∈ ℝ^N         log σ(s) ∈ ℝ^N                        │
│              (mean)            (log std dev)                            │
│                    │                   │                                │
│                    └─────────┬─────────┘                                │
│                              │                                           │
│                              ▼                                           │
│               a ~ N(μ(s), exp(log σ(s))²)                              │
│               Cache allocation: a_n = sigmoid(a_n) × M/N               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRITIC (Value Network):                                                 │
│                                                                          │
│  f(s) ──► [ Linear(256, 128) → ReLU → Linear(128, 1) ] ──► V(s)       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TRAINING LOOP (per epoch):                                              │
│                                                                          │
│  1. Collect T timesteps using current policy π_θ                        │
│  2. Compute advantages Â_t using GAE                                    │
│  3. For K epochs (typically K=10):                                      │
│     a. Sample mini-batches from collected data                         │
│     b. Compute PPO-Clip loss                                           │
│     c. Update θ, φ using Adam                                          │
│  4. Repeat                                                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PPO Hyperparameters for Caching (Recommended Values)

HyperparameterSymbolValueNotes
Learning rateη\eta3×1043 \times 10^{-4}Adam, can decay
Clip rangeϵ\epsilon0.2PPO-Clip parameter
GAE lambdaλ\lambda0.95Advantage estimation
Discount factorγ\gamma0.99
Value loss coefficientc1c_10.5
Entropy coefficientc2c_20.01Encourage exploration
Epochs per updateKK10Reuse collected data
Mini-batch sizeBB64
Rollout lengthTT2048Steps before update
Number of parallel envs-8-16For variance reduction
Hidden dimensions-[256, 256, 128]Shared + heads
Max gradient norm-0.5Gradient clipping

A2C vs. PPO: When to Use Which

CriterionA2CPPO
Sample efficiencyLowerHigher (reuses data)
Training stabilityModerateHigh (clipping)
Wall-clock timeFaster per updateSlower per update
Hyperparameter sensitivityHigherLower
Best forFast prototypingProduction deployment
ParallelizationSynchronous workersParallel envs

Recommendation: Use PPO for production coded caching systems due to its stability. Use A2C for rapid experimentation.

Federated Learning for Privacy-Preserving Popularity Prediction

Predicting content popularity requires observing user behavior—but users may not want their viewing habits shared with a central server.

Federated learning solves this:

  1. Local training: Each edge server trains a popularity prediction model on local user data
  2. Gradient aggregation: Only model gradients (not raw data) are sent to the central server
  3. Global model update: Central server averages gradients to update a global model
  4. Model distribution: Updated global model is sent back to edges
Code
┌─────────────────────────────────────────────────────────────────────────┐
│              FEDERATED LEARNING FOR CACHING                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                         ┌───────────────┐                               │
│                         │    CENTRAL    │                               │
│                         │    SERVER     │                               │
│                         │               │                               │
│                         │  Global Model │                               │
│                         │  θ_global     │                               │
│                         └───────┬───────┘                               │
│                                 │                                        │
│              ┌──────────────────┼──────────────────┐                    │
│              │                  │                  │                    │
│      Distribute θ       Aggregate              Distribute θ            │
│              │           Gradients                 │                    │
│              │                  │                  │                    │
│              ▼                  │                  ▼                    │
│       ┌────────────┐    ┌──────┴──────┐    ┌────────────┐             │
│       │   EDGE 1   │    │             │    │   EDGE K   │             │
│       │            │    │  Gradients  │    │            │             │
│       │ Local Data │────│  ∇θ₁,...,∇θₖ│────│ Local Data │             │
│       │ (private)  │    │             │    │ (private)  │             │
│       │            │    │             │    │            │             │
│       │ Train on   │    │  Aggregate: │    │ Train on   │             │
│       │ local      │    │  θ ← θ -    │    │ local      │             │
│       │ requests   │    │  η·Σ∇θᵢ/K   │    │ requests   │             │
│       └────────────┘    └─────────────┘    └────────────┘             │
│                                                                          │
│  PRIVACY GUARANTEES:                                                    │
│                                                                          │
│  • Raw user data NEVER leaves the edge                                 │
│  • Central server sees only aggregated gradients                       │
│  • With differential privacy noise, even gradients don't leak         │
│    individual user preferences                                          │
│                                                                          │
│  BENEFITS FOR CODED CACHING:                                            │
│                                                                          │
│  • Learn global popularity patterns (what's trending everywhere)       │
│  • Preserve local personalization (edge models fine-tune)             │
│  • Scale to millions of users without central data collection         │
│  • Comply with privacy regulations (GDPR, CCPA)                       │
│                                                                          │
│  RECENT WORK (Graph Federated Learning, arXiv Feb 2025):               │
│                                                                          │
│  • Integrate Graph Neural Networks with Federated Learning             │
│  • LightGCN captures user-content relationships locally               │
│  • Global aggregation learns cross-user patterns                      │
│  • 20-30% improvement in cache hit rate over non-federated baseline   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Graph Neural Networks for User-Content Relationships

Content requests aren't independent—they form a graph structure:

  • User-content edges: User uu requested content cc
  • Content-content edges: Contents c1c_1 and c2c_2 are often requested together
  • User-user edges: Users u1u_1 and u2u_2 have similar preferences

Graph Neural Networks (GNNs) can learn representations that capture this structure, enabling powerful popularity prediction for cache placement.

The Bipartite User-Content Graph

Graph Definition: Let G=(UC,E)\mathcal{G} = (\mathcal{U} \cup \mathcal{C}, \mathcal{E}) where:

  • U={u1,,uK}\mathcal{U} = \{u_1, \ldots, u_K\} are user nodes
  • C={c1,,cN}\mathcal{C} = \{c_1, \ldots, c_N\} are content nodes
  • E={(u,c):user u requested content c}\mathcal{E} = \{(u, c) : \text{user } u \text{ requested content } c\} are edges

Adjacency Matrix: The bipartite adjacency matrix A{0,1}(K+N)×(K+N)\mathbf{A} \in \{0,1\}^{(K+N) \times (K+N)}:

A=(0RRT0)\mathbf{A} = \begin{pmatrix} \mathbf{0} & \mathbf{R} \\ \mathbf{R}^T & \mathbf{0} \end{pmatrix}

where R{0,1}K×N\mathbf{R} \in \{0,1\}^{K \times N} is the user-content interaction matrix.

GNN Message Passing Framework

The general GNN update rule for node vv at layer ll:

hv(l+1)=UPDATE(hv(l),AGGREGATE({hu(l):uN(v)}))\mathbf{h}_v^{(l+1)} = \text{UPDATE}\left( \mathbf{h}_v^{(l)}, \text{AGGREGATE}\left( \{ \mathbf{h}_u^{(l)} : u \in \mathcal{N}(v) \} \right) \right)

Different GNN architectures define different AGGREGATE and UPDATE functions.

Graph Convolutional Networks (GCN) for Caching

GCN Layer: Propagates information using normalized adjacency:

H(l+1)=σ(D~1/2A~D~1/2H(l)W(l))\mathbf{H}^{(l+1)} = \sigma\left( \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \right)

where:

  • A~=A+I\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I} (add self-loops)
  • D~ii=jA~ij\tilde{\mathbf{D}}_{ii} = \sum_j \tilde{\mathbf{A}}_{ij} (degree matrix)
  • W(l)Rdl×dl+1\mathbf{W}^{(l)} \in \mathbb{R}^{d_l \times d_{l+1}} is learnable

Per-Node Update:

hv(l+1)=σ(uN(v){v}1N(v)N(u)W(l)hu(l))\mathbf{h}_v^{(l+1)} = \sigma\left( \sum_{u \in \mathcal{N}(v) \cup \{v\}} \frac{1}{\sqrt{|\mathcal{N}(v)||\mathcal{N}(u)|}} \mathbf{W}^{(l)} \mathbf{h}_u^{(l)} \right)

The normalization prevents nodes with high degree from dominating.

LightGCN: State-of-the-Art for Recommendation/Caching

LightGCN simplifies GCN by removing nonlinearities and feature transformations, which proves more effective for collaborative filtering:

Layer Propagation: eu(l+1)=cN(u)1N(u)N(c)ec(l)\mathbf{e}_u^{(l+1)} = \sum_{c \in \mathcal{N}(u)} \frac{1}{\sqrt{|\mathcal{N}(u)||\mathcal{N}(c)|}} \mathbf{e}_c^{(l)}

ec(l+1)=uN(c)1N(c)N(u)eu(l)\mathbf{e}_c^{(l+1)} = \sum_{u \in \mathcal{N}(c)} \frac{1}{\sqrt{|\mathcal{N}(c)||\mathcal{N}(u)|}} \mathbf{e}_u^{(l)}

Final Embedding: Average over all layers: eu=1L+1l=0Leu(l),ec=1L+1l=0Lec(l)\mathbf{e}_u = \frac{1}{L+1} \sum_{l=0}^{L} \mathbf{e}_u^{(l)}, \quad \mathbf{e}_c = \frac{1}{L+1} \sum_{l=0}^{L} \mathbf{e}_c^{(l)}

Prediction: Request probability via dot product: y^uc=euTec\hat{y}_{uc} = \mathbf{e}_u^T \mathbf{e}_c

Loss Function: Bayesian Personalized Ranking (BPR): LBPR=(u,c+,c)logσ(y^uc+y^uc)+λE22\mathcal{L}_{\text{BPR}} = -\sum_{(u,c^+,c^-)} \log \sigma(\hat{y}_{uc^+} - \hat{y}_{uc^-}) + \lambda \|\mathbf{E}\|_2^2

where (u,c+,c)(u, c^+, c^-) are triplets with c+c^+ a positive (requested) and cc^- a negative (not requested) content.

Graph Attention Networks (GAT) for Caching

GAT learns attention weights to determine neighbor importance:

Attention Coefficient: evu=LeakyReLU(aT[WhvWhu])e_{vu} = \text{LeakyReLU}\left( \mathbf{a}^T [\mathbf{W}\mathbf{h}_v \| \mathbf{W}\mathbf{h}_u] \right)

Normalized Attention: αvu=exp(evu)kN(v)exp(evk)\alpha_{vu} = \frac{\exp(e_{vu})}{\sum_{k \in \mathcal{N}(v)} \exp(e_{vk})}

Aggregation: hv=σ(uN(v)αvuWhu)\mathbf{h}_v' = \sigma\left( \sum_{u \in \mathcal{N}(v)} \alpha_{vu} \mathbf{W} \mathbf{h}_u \right)

Multi-Head Attention: Use HH parallel attention heads and concatenate: hv=h=1Hσ(uN(v)αvu(h)W(h)hu)\mathbf{h}_v' = \|_{h=1}^{H} \sigma\left( \sum_{u \in \mathcal{N}(v)} \alpha_{vu}^{(h)} \mathbf{W}^{(h)} \mathbf{h}_u \right)

Why GAT for Caching: Attention learns that some user-content interactions are more informative than others. A user who watched 1000 videos provides less signal per video than one who watched 10.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              GNN ARCHITECTURE FOR CACHE PREDICTION                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT: Bipartite graph G = (Users ∪ Content, Edges)                   │
│                                                                          │
│  User initial embeddings: E_U ∈ ℝ^{K × d}, d = 64                      │
│  Content initial embeddings: E_C ∈ ℝ^{N × d}                           │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ LAYER 0 (Initial)                                                │   │
│  │                                                                  │   │
│  │  e_u^(0) = E_U[u]  (learnable user embedding)                   │   │
│  │  e_c^(0) = E_C[c]  (learnable content embedding)                │   │
│  │                                                                  │   │
│  │  Optional: Concatenate content features (genre, duration, etc.) │   │
│  │  e_c^(0) = [E_C[c] || f_c] where f_c ∈ ℝ^{d_feat}              │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ LAYER 1 (First-hop neighbors)                                    │   │
│  │                                                                  │   │
│  │  For each user u:                                               │   │
│  │    e_u^(1) = AGG({ e_c^(0) : c ∈ N(u) })                       │   │
│  │                                                                  │   │
│  │  For each content c:                                            │   │
│  │    e_c^(1) = AGG({ e_u^(0) : u ∈ N(c) })                       │   │
│  │                                                                  │   │
│  │  AGG = LightGCN (normalized sum) or GAT (attention-weighted)   │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ LAYER 2 (Second-hop = user-user, content-content similarity)    │   │
│  │                                                                  │   │
│  │  User embedding now contains info about:                        │   │
│  │    • Directly watched content (Layer 1)                        │   │
│  │    • Other users who watched same content (Layer 2)            │   │
│  │                                                                  │   │
│  │  Content embedding now contains info about:                     │   │
│  │    • Users who watched it (Layer 1)                            │   │
│  │    • Other content those users watched (Layer 2)               │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ FINAL EMBEDDINGS (LightGCN: average across layers)              │   │
│  │                                                                  │   │
│  │  e_u = (e_u^(0) + e_u^(1) + e_u^(2)) / 3                       │   │
│  │  e_c = (e_c^(0) + e_c^(1) + e_c^(2)) / 3                       │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ PREDICTION & CACHE PLACEMENT                                     │   │
│  │                                                                  │   │
│  │  Popularity prediction for content c:                           │   │
│  │    p_c = Σ_u σ(e_u^T e_c) / K  (average predicted probability) │   │
│  │                                                                  │   │
│  │  Cache placement: Sort by p_c, cache top M files                │   │
│  │                                                                  │   │
│  │  For coded caching: Also consider embedding diversity           │   │
│  │    Maximize: Σ p_c × cache_c + λ × diversity(cached_embeddings)│   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  GNN HYPERPARAMETERS:                                                   │
│  ─────────────────────────────────────────────────────────────────────  │
│  Embedding dimension d: 64                                              │
│  Number of layers L: 2-3                                               │
│  Learning rate: 1e-3                                                    │
│  Batch size: 2048 (user-content pairs)                                 │
│  Negative samples per positive: 4                                      │
│  Regularization λ: 1e-4                                                │
│  Optimizer: Adam                                                        │
│  Training epochs: 100-500                                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

For coded caching, GNN-based popularity prediction informs placement decisions. Content with similar embeddings should be cached together (they'll be requested by similar users). Content with complementary embeddings should be spread across caches (for coding opportunities).

Attention Mechanisms for Content Popularity Prediction

Beyond GNNs, attention mechanisms can directly model user-content interactions without explicit graph structure.

Self-Attention for Request Sequences

Model each user's request history as a sequence and use self-attention to capture patterns:

Input: User uu's request history [c1,c2,,cT][c_1, c_2, \ldots, c_T]

Embedding: X=[ec1,ec2,,ecT]RT×d\mathbf{X} = [\mathbf{e}_{c_1}, \mathbf{e}_{c_2}, \ldots, \mathbf{e}_{c_T}] \in \mathbb{R}^{T \times d}

Self-Attention: Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}} \right) \mathbf{V}

where Q=XWQ\mathbf{Q} = \mathbf{X}\mathbf{W}_Q, K=XWK\mathbf{K} = \mathbf{X}\mathbf{W}_K, V=XWV\mathbf{V} = \mathbf{X}\mathbf{W}_V.

Multi-Head Attention: MultiHead(X)=Concat(head1,,headH)WO\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H) \mathbf{W}_O

where headi=Attention(XWQi,XWKi,XWVi)\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_Q^i, \mathbf{X}\mathbf{W}_K^i, \mathbf{X}\mathbf{W}_V^i)

Position Encoding: Add positional information to distinguish recent vs. older requests:

PE(t,2i)=sin(t100002i/d),PE(t,2i+1)=cos(t100002i/d)\mathbf{PE}_{(t, 2i)} = \sin\left( \frac{t}{10000^{2i/d}} \right), \quad \mathbf{PE}_{(t, 2i+1)} = \cos\left( \frac{t}{10000^{2i/d}} \right)

For temporal patterns in caching, we can also add time-of-day encoding: TE(t)=[sin(2πh/24),cos(2πh/24),sin(2πd/7),cos(2πd/7)]\mathbf{TE}_{(t)} = [\sin(2\pi h/24), \cos(2\pi h/24), \sin(2\pi d/7), \cos(2\pi d/7)]

where hh is hour of day and dd is day of week.

Cross-Attention for User-Content Matching

Use cross-attention to match user preferences with content features:

User Query: qu\mathbf{q}_u = user embedding from GNN or MLP Content Keys/Values: KC,VC\mathbf{K}_C, \mathbf{V}_C = content embeddings for all content

hu=softmax(quKCTd)VC\mathbf{h}_u = \text{softmax}\left( \frac{\mathbf{q}_u \mathbf{K}_C^T}{\sqrt{d}} \right) \mathbf{V}_C

This gives a user representation that's a weighted combination of content embeddings, weighted by relevance.

Transformer-Based Popularity Prediction

Full Transformer architectures can model complex temporal patterns in content requests.

Transformer Encoder for Request Prediction

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              TRANSFORMER FOR POPULARITY PREDICTION                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT: Recent request history across all users                         │
│         X = [request_1, request_2, ..., request_T]                      │
│         Each request = (user_id, content_id, timestamp)                 │
│                                                                          │
│  EMBEDDING LAYER:                                                        │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  e_t = Embed_content(c_t) + Embed_user(u_t) + PE(t) + TE(timestamp)   │
│                                                                          │
│  where PE = positional encoding, TE = temporal encoding                 │
│                                                                          │
│  X_embed ∈ ℝ^{T × d}, d = 256                                          │
│                                                                          │
│  TRANSFORMER ENCODER STACK (N=4 layers):                                │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  For each layer n:                                                       │
│                                                                          │
│  ┌───────────────────────────────────────────────────────────────┐     │
│  │ Multi-Head Self-Attention (8 heads)                           │     │
│  │                                                                │     │
│  │ Q, K, V = X @ W_Q, X @ W_K, X @ W_V                          │     │
│  │ Attn = softmax(QK^T / √d_k) V                                │     │
│  │ X = LayerNorm(X + Dropout(Attn))                             │     │
│  └───────────────────────────────────────────────────────────────┘     │
│                              │                                          │
│                              ▼                                          │
│  ┌───────────────────────────────────────────────────────────────┐     │
│  │ Feed-Forward Network                                          │     │
│  │                                                                │     │
│  │ FFN(x) = ReLU(x @ W_1 + b_1) @ W_2 + b_2                     │     │
│  │ W_1 ∈ ℝ^{d × 4d}, W_2 ∈ ℝ^{4d × d}                          │     │
│  │ X = LayerNorm(X + Dropout(FFN(X)))                           │     │
│  └───────────────────────────────────────────────────────────────┘     │
│                                                                          │
│  OUTPUT HEAD (Popularity Prediction):                                   │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  h_final = X[-1]  (last position embedding)                            │
│                                                                          │
│  For each content c:                                                    │
│    p_c = σ(h_final @ W_out @ e_c)  (predicted request probability)     │
│                                                                          │
│  Or: p = softmax(h_final @ E_content^T)  (distribution over content)   │
│                                                                          │
│  TRAINING:                                                               │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Next-request prediction: Given history, predict next requested content│
│                                                                          │
│  Loss = CrossEntropy(p, c_{T+1})                                       │
│       = -log(p_{c_{T+1}})                                              │
│                                                                          │
│  For multi-step prediction, use autoregressive generation or           │
│  predict distribution over next K requests simultaneously.             │
│                                                                          │
│  HYPERPARAMETERS:                                                        │
│  ─────────────────────────────────────────────────────────────────────  │
│  Model dimension d: 256                                                 │
│  Number of layers N: 4                                                  │
│  Number of attention heads: 8                                          │
│  Feed-forward dimension: 1024 (4×d)                                    │
│  Dropout: 0.1                                                           │
│  Sequence length T: 100-500                                            │
│  Learning rate: 1e-4 (with warmup)                                     │
│  Batch size: 64 sequences                                              │
│  Warmup steps: 4000                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Pre-Training Strategies for Caching Transformers

Masked Content Prediction: Like BERT, mask random requests and predict them: LMLM=tmaskedlogP(ctX\t)\mathcal{L}_{\text{MLM}} = -\sum_{t \in \text{masked}} \log P(c_t | \mathbf{X}_{\backslash t})

Contrastive Learning: Learn representations where similar request patterns are close: Lcontrastive=logexp(sim(hi,hj)/τ)kexp(sim(hi,hk)/τ)\mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_j)/\tau)}{\sum_k \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k)/\tau)}

Multi-Task Pre-Training:

  1. Predict next content
  2. Predict time until next request
  3. Predict user of next request
  4. Classify content category

Meta-Reinforcement Learning for Fast Adaptation

Content popularity can shift suddenly—a viral video emerges, breaking news changes demand patterns. Standard RL agents take thousands of interactions to adapt.

Meta-reinforcement learning addresses this by training agents that can learn to adapt quickly:

  1. Meta-training: Train on many different popularity distributions
  2. Meta-objective: Not just high reward, but fast adaptation to new distributions
  3. Deployment: When popularity shifts, the agent adapts in ~10-100 interactions instead of thousands

MAML: Model-Agnostic Meta-Learning for Caching

MAML learns an initialization θ\theta^* such that a few gradient steps on a new task yield good performance.

Algorithm:

  1. Sample task batch: {T1,,TB}\{\mathcal{T}_1, \ldots, \mathcal{T}_B\} where each Ti\mathcal{T}_i is a different popularity distribution

  2. Inner loop (task-specific adaptation): For each task Ti\mathcal{T}_i: θi=θαθLTi(θ)\theta_i' = \theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(\theta)

    This is one (or few) gradient steps on task-specific data.

  3. Outer loop (meta-update): Update θ\theta to improve post-adaptation performance: θθβθiLTi(θi)\theta \leftarrow \theta - \beta \nabla_\theta \sum_i \mathcal{L}_{\mathcal{T}_i}(\theta_i')

    Note: This requires computing gradients through the inner gradient step (second-order).

First-Order Approximation (FOMAML): Ignore second-order terms: θθβiθiLTi(θi)\theta \leftarrow \theta - \beta \sum_i \nabla_{\theta_i'} \mathcal{L}_{\mathcal{T}_i}(\theta_i')

Much cheaper computationally, works nearly as well in practice.

MAML for Caching:

LTi(θ)=E(s,a,r,s)Ti[tγtrt]\mathcal{L}_{\mathcal{T}_i}(\theta) = -\mathbb{E}_{(s,a,r,s') \sim \mathcal{T}_i} \left[ \sum_t \gamma^t r_t \right]

where Ti\mathcal{T}_i represents requests sampled from popularity distribution ii.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              MAML FOR ADAPTIVE CACHING                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  META-TRAINING PHASE:                                                    │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Task Distribution: Different Zipf parameters α ∈ [0.5, 1.5]           │
│                     Different content libraries                         │
│                     Different diurnal patterns                          │
│                                                                          │
│  For each meta-iteration:                                               │
│                                                                          │
│    1. Sample B=4 tasks (popularity distributions)                       │
│                                                                          │
│    2. For each task i:                                                   │
│       ┌─────────────────────────────────────────────────────────────┐  │
│       │ Inner Loop (K=5 gradient steps):                            │  │
│       │                                                              │  │
│       │   θ_i^(0) = θ                                               │  │
│       │   for k = 1 to K:                                           │  │
│       │     Sample batch of requests from task i                    │  │
│       │     θ_i^(k) = θ_i^(k-1) - α ∇ L_i(θ_i^(k-1))               │  │
│       │                                                              │  │
│       │   θ_i' = θ_i^(K)  (adapted parameters)                      │  │
│       └─────────────────────────────────────────────────────────────┘  │
│                                                                          │
│    3. Meta-update:                                                       │
│       ┌─────────────────────────────────────────────────────────────┐  │
│       │ Evaluate L_i(θ_i') on NEW data from task i                  │  │
│       │                                                              │  │
│       │ θ ← θ - β ∇_θ Σ_i L_i(θ_i')                                │  │
│       │                                                              │  │
│       │ (This gradient flows through the inner loop!)               │  │
│       └─────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  DEPLOYMENT PHASE:                                                       │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  When popularity shifts detected:                                       │
│                                                                          │
│    1. Start with meta-learned θ*                                        │
│    2. Collect ~50-100 requests from new distribution                   │
│    3. Take K=5 gradient steps: θ_new = θ* - α ∇ L_new(θ*)             │
│    4. Deploy θ_new                                                      │
│                                                                          │
│  Result: Adaptation in 50-100 requests instead of 5000+                │
│                                                                          │
│  HYPERPARAMETERS:                                                        │
│  ─────────────────────────────────────────────────────────────────────  │
│  Inner learning rate α: 0.01                                            │
│  Outer learning rate β: 0.001                                           │
│  Inner gradient steps K: 5                                              │
│  Tasks per batch B: 4-8                                                 │
│  Meta-training iterations: 10,000-50,000                               │
│  Policy architecture: Same as PPO (shared backbone)                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Reptile: A Simpler Alternative

Reptile is a first-order meta-learning algorithm that's simpler than MAML:

Algorithm:

  1. Sample task Ti\mathcal{T}_i
  2. Initialize θ~=θ\tilde{\theta} = \theta
  3. Run KK gradient steps on Ti\mathcal{T}_i: θ~θ~αLTi(θ~)\tilde{\theta} \leftarrow \tilde{\theta} - \alpha \nabla \mathcal{L}_{\mathcal{T}_i}(\tilde{\theta})
  4. Meta-update: θθ+ϵ(θ~θ)\theta \leftarrow \theta + \epsilon (\tilde{\theta} - \theta)

Intuition: Move the initialization toward the final parameters after training on each task. The optimal θ\theta^* is equidistant from all task-specific optima.

Reptile vs. MAML:

AspectMAMLReptile
GradientsSecond-order (or FOMAML)First-order only
MemoryHigh (store computation graph)Low
ImplementationComplexSimple
PerformanceSlightly betterNearly as good
Computation~2× slowerFaster

RL² (Learning to Reinforcement Learn)

RL² uses a recurrent neural network that learns to adapt within an episode:

Architecture: LSTM-based policy that receives (st,at1,rt1)(s_t, a_{t-1}, r_{t-1}) as input

Training: Train on many tasks where each episode is a new task. The LSTM learns to:

  1. Explore efficiently at the start of episodes
  2. Exploit learned knowledge later
  3. Adapt its behavior based on observed rewards

For Caching: The LSTM receives:

  • Current cache state
  • Previous caching action
  • Previous reward (hit rate, latency)

Over time, it learns to recognize the current popularity pattern and adapt its caching strategy accordingly—all within a single forward pass, no gradient updates needed at deployment.

RL² Update Rule: Standard policy gradient on the recurrent policy: θJ=E[tθlogπθ(atht)At]\nabla_\theta J = \mathbb{E}\left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | h_t) A_t \right]

where hth_t is the LSTM hidden state encoding the history.

Recent work (2025) on "Fast Adaptive Caching Algorithm for Mobile Edge Networks Based on Meta-reinforcement Learning" shows:

  • 5× faster adaptation to popularity shifts
  • Only 2-3% performance loss during adaptation periods
  • Robust across different types of popularity changes (gradual drift, sudden shift, periodic patterns)

When to Use Which Meta-RL Algorithm

ScenarioRecommended AlgorithmReason
Discrete distribution shiftsMAML/FOMAMLBest for distinct task boundaries
Continuous driftRL²No explicit adaptation step needed
Limited compute budgetReptileSimpler, faster training
Online adaptationRL²No gradient steps at deployment
Periodic patterns (daily/weekly)MAML with time-aware tasksCan pre-learn periodic structure

Part V: Integration with 5G/6G Mobile Edge Computing

MEC Architecture and Caching

Mobile Edge Computing (MEC) places computational and storage resources at the network edge—at or near cellular base stations. This is the natural deployment point for coded caching:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              CODED CACHING IN 5G/6G MEC ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                       ┌─────────────────────┐                           │
│                       │    CORE NETWORK     │                           │
│                       │    (Content CDN)    │                           │
│                       └──────────┬──────────┘                           │
│                                  │                                       │
│                           Backhaul Link                                  │
│                      (Expensive, congested)                             │
│                                  │                                       │
│               ┌──────────────────┼──────────────────┐                   │
│               │                  │                  │                   │
│               ▼                  ▼                  ▼                   │
│        ┌────────────┐     ┌────────────┐     ┌────────────┐            │
│        │   MEC 1    │     │   MEC 2    │     │   MEC 3    │            │
│        │            │     │            │     │            │            │
│        │ ┌────────┐ │     │ ┌────────┐ │     │ ┌────────┐ │            │
│        │ │ CACHE  │ │     │ │ CACHE  │ │     │ │ CACHE  │ │            │
│        │ │        │ │     │ │        │ │     │ │        │ │            │
│        │ │ Coded  │ │     │ │ Coded  │ │     │ │ Coded  │ │            │
│        │ │Content │ │     │ │Content │ │     │ │Content │ │            │
│        │ └────────┘ │     │ └────────┘ │     │ └────────┘ │            │
│        │            │     │            │     │            │            │
│        │ ┌────────┐ │     │ ┌────────┐ │     │ ┌────────┐ │            │
│        │ │   AI   │ │     │ │   AI   │ │     │ │   AI   │ │            │
│        │ │ ENGINE │ │     │ │ ENGINE │ │     │ │ ENGINE │ │            │
│        │ └────────┘ │     │ └────────┘ │     │ └────────┘ │            │
│        └─────┬──────┘     └─────┬──────┘     └─────┬──────┘            │
│              │                  │                  │                    │
│         Fronthaul          Fronthaul          Fronthaul                │
│              │                  │                  │                    │
│        ┌─────┴──────┐     ┌─────┴──────┐     ┌─────┴──────┐            │
│        │    gNB 1   │     │    gNB 2   │     │    gNB 3   │            │
│        │            │     │            │     │            │            │
│        └─────┬──────┘     └─────┬──────┘     └─────┬──────┘            │
│              │                  │                  │                    │
│           ┌──┴──┐            ┌──┴──┐            ┌──┴──┐                │
│           │ UEs │            │ UEs │            │ UEs │                │
│           └─────┘            └─────┘            └─────┘                │
│                                                                          │
│  AI ENGINE FUNCTIONS:                                                   │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  1. POPULARITY PREDICTION                                               │
│     • Federated learning across edges                                  │
│     • GNN for user-content modeling                                    │
│     • LSTM/Transformer for temporal patterns                          │
│                                                                          │
│  2. CACHE PLACEMENT OPTIMIZATION                                        │
│     • Deep RL for dynamic placement                                    │
│     • Coordinated across MEC nodes for coding gain                    │
│     • Trade-off: local hits vs. coding opportunity                    │
│                                                                          │
│  3. CODED DELIVERY SCHEDULING                                          │
│     • Identify multicast groups                                        │
│     • Construct optimal coded messages                                 │
│     • Schedule transmissions across time/frequency                    │
│                                                                          │
│  4. USER MOBILITY PREDICTION                                            │
│     • Predict user movement across cells                              │
│     • Proactive content migration                                      │
│     • Handover-aware cache coordination                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BENEFITS OF CODED CACHING IN MEC:                                      │
│                                                                          │
│  • Reduced backhaul: Coded multicasting reduces core traffic          │
│  • Lower latency: Edge-served content, no core round-trip             │
│  • Improved QoE: Proactive caching before user requests               │
│  • Network efficiency: Same backhaul serves more users                │
│  • Scalability: Global caching gain grows with user density           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Semantic Communication and Coded Caching in 6G

6G networks will introduce semantic communication—transmitting meaning rather than bits. This fundamentally changes caching:

Traditional caching: Store exact file bits. Transmit missing bits.

Semantic caching: Store semantic representations (compressed meaning). Transmit semantic differences. Receiver reconstructs content using local knowledge base.

Example: Instead of caching a full 4K video file:

  • Cache a semantic description + key frames + neural decoder model
  • When user requests video, transmit only the scene-specific information
  • Local AI reconstructs the full video using cached knowledge

Coded semantic caching combines these ideas:

  • Cache semantic representations (100-1000× smaller than raw content)
  • Code across semantic representations for delivery
  • Achieve both compression gain AND coding gain

The potential improvement is multiplicative: if semantic compression gives 100× and coded caching gives 10×, the combined gain approaches 1000×.

Performance Analysis: Putting It All Together

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              CODED CACHING PERFORMANCE IN 5G/6G MEC                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYSTEM PARAMETERS:                                                      │
│  • N = 1000 files (video library)                                      │
│  • K = 100 users per MEC node                                          │
│  • M = 100 files per user cache (10% of library)                       │
│  • 10 MEC nodes in deployment                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CACHING STRATEGY COMPARISON:                                           │
│                                                                          │
│  Strategy                  Delivery Rate    Backhaul         Latency   │
│                            (files/period)   Reduction        (ms)      │
│  ─────────────────────────────────────────────────────────────────────  │
│  No caching                    100           0%              150       │
│  LRU (uncoded)                  75           25%             120       │
│  LFU (uncoded)                  68           32%             115       │
│  Popularity-optimal (uncoded)   60           40%             105       │
│  MAN coded caching              12           88%              35       │
│  AI-optimized coded             10           90%              30       │
│  Semantic coded (6G)             1           99%              10       │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHTS:                                                           │
│                                                                          │
│  1. Coded caching provides 5-6× improvement over best uncoded policy   │
│     This is the "global caching gain" predicted by theory              │
│                                                                          │
│  2. AI optimization adds another 15-20% on top of theoretical scheme   │
│     By adapting to actual (non-uniform, time-varying) popularity       │
│                                                                          │
│  3. Semantic communication offers order-of-magnitude further gains     │
│     By transmitting meaning rather than bits                           │
│                                                                          │
│  4. Latency reduction is dramatic                                       │
│     From 150ms (core) to 10ms (edge semantic) — enables new apps      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DEPLOYMENT CONSIDERATIONS:                                              │
│                                                                          │
│  Subpacketization: With K=100, t=10, need C(100,10) ≈ 10^13 subfiles  │
│  Solution: User grouping (10 groups of 10) reduces to C(10,1) = 10    │
│  Trade-off: Lose some coding gain (factor ~2) but remain practical    │
│                                                                          │
│  Coordination overhead: MEC nodes must share cache state               │
│  Solution: Periodic synchronization (every 1-10 minutes sufficient)   │
│  AI handles short-term dynamics; coordination handles long-term       │
│                                                                          │
│  Heterogeneous caches: Different MEC nodes have different sizes       │
│  Solution: Weighted placement—larger caches store more popular files  │
│  AI learns optimal allocation across heterogeneous infrastructure     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VI: Current Research Frontiers

Coded Caching with Security and Privacy

Problem: Coded delivery reveals information about what other users requested. If I receive A1B2A_1 \oplus B_2, I learn that someone requested file B.

Solutions being researched:

  • Secret sharing: Split cache content so no single coded message reveals complete information
  • Private information retrieval (PIR): Users download content without revealing which file they want
  • Differential privacy: Add noise to coded messages that masks individual requests while preserving utility

Coded Caching with Heterogeneous Files

Real content libraries have diverse file sizes—short clips, long movies, live streams. The MAN scheme assumes equal file sizes.

Extensions:

  • Weighted placement: Allocate cache space proportional to file size × popularity
  • Layered coding: Split large files into layers, apply coded caching at each layer
  • Adaptive subpacketization: Different subpacketization for different file sizes

Online Coded Caching

The MAN scheme assumes batch processing—all requests arrive together, then delivery phase begins. Real systems have continuous request arrivals.

Online approaches:

  • Sliding window: Group requests in time windows, apply coded delivery within windows
  • Queue-based: Delay requests briefly to accumulate coding opportunities
  • Predictive delivery: Start delivering predicted requests before they're actually made

Coded Caching for Machine Learning

A emerging application: distributed ML training generates massive data movement (gradients, parameters). Coded caching principles can help:

  • Gradient caching: Workers cache parts of the gradient; coordinator sends coded combinations
  • Model caching: Different workers cache different model components; federated aggregation uses coding
  • Data shuffling: Shuffle training data across workers using coded communication


Sources:

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles

AI for CommArchitecture

AI-RAN: The AI-Native Foundation for 6G Networks

In-depth tour of AI-Radio Access Networks (AI-RAN)—the foundational architecture transforming 5G and enabling 6G. From traditional RAN to AI-native systems, understand the RAN Intelligent Controller (RIC), real-time optimization, and production deployment patterns.

9 min read
AI for CommArchitecture

6G Network Architecture: AI at Every Layer - A Complete Technical Vision for IMT-2030

Detailed look at 6G (IMT-2030) network architecture—from AI-native air interfaces and semantic communication to integrated sensing, digital twins, and self-evolving protocols. The complete technical roadmap for next-generation wireless beyond 2030.

4 min read
AI for CommDeep Learning

Deep Learning for Channel Estimation in Massive MIMO Systems

In-depth technical deep dive into deep learning approaches for channel estimation in massive MIMO—from traditional methods to state-of-the-art CNN-LSTM-Transformer hybrid architectures. Complete with equations, implementations, and performance analysis showing 90%+ NMSE reduction.

14 min read
AI for CommArchitecture

Building Intelligent RAN: O-RAN and RIC Architecture Deep Dive

A practical deep dive into Open RAN and RAN Intelligent Controller architecture—from E2 interface specifications to xApp/rApp development, deployment patterns, and real-world production implementations powering modern 5G networks.

5 min read
AI for CommDeep Learning

AI for Channel Coding: Neural Decoders and End-to-End Learned Codes

In-depth exploration of AI-powered channel coding—from neural belief propagation decoders for LDPC and Polar codes to end-to-end learned codes with Turbo Autoencoders. Deep theoretical foundations, architectural innovations, performance analysis, and the path toward 6G learned physical layers.

14 min read