How does subpacketization affect practical deployment?

The MAN scheme requires $\binom{K}{t}$ subfiles per file. For $K=100$, $t=10$: over $10^{13}$ subfiles—impractical for indexing, storage, and delivery coordination. **Practical solutions**: - **User grouping**: Divide $K$ users into smaller groups, apply coded caching within groups. With groups of 10, subpacketization drops to $\binom{10}{1} = 10$. - **Placement Delivery Arrays (PDAs)**: Combinatorial designs achieving polynomial subpacketization while retaining most coding gain. - **Trade-off**: Typically 2-4× increase in rate compared to optimal, but tractable implementation.

Can coded caching work with encrypted content?

Yes, with modifications. The key insight is that XOR operations on encrypted content produce valid encrypted content. **Approach**: - Files are encrypted before caching - Coded delivery operates on ciphertext: $\text{Enc}(A) \oplus \text{Enc}(B)$ - Users decrypt after decoding: $\text{Dec}(\text{Enc}(A))$ **Challenges**: - Key management becomes complex (different keys for different subfiles) - Some coding gain may be lost due to encryption overhead - Adaptive schemes need care to avoid leaking information through coding patterns

How does mobility affect coded caching?

User mobility creates two challenges: 1. **Cached content becomes irrelevant**: User moves from cell A to cell B; content cached at A is no longer accessible 2. **Request patterns change**: Different locations have different popular content **Solutions**: - **Predictive caching**: Anticipate user movement, pre-position content at destination cell - **Multi-access coded caching**: User maintains connections to multiple cells during handover - **Content migration**: Proactively move cached content along predicted user trajectory - **Mobility-aware placement**: Cache content likely to be useful across user's typical movement pattern

What's the latency impact of coded caching?

**Positive impacts** (latency reduction): - Cache hits: ~1ms for local content vs. ~100ms for core network retrieval - Reduced backhaul congestion: Faster delivery when backhaul isn't bottleneck - Proactive caching: Content available before request if prediction is accurate **Potential negative impacts**: - Coding/decoding computation: Small overhead (~1ms) for XOR operations - Subpacketization overhead: Need to identify and retrieve specific subfiles - Coordination delay: Wait for coding opportunities (batch requests) **Net effect**: For cached content, latency improves dramatically. The key is high cache hit rate through good popularity prediction.

When is coded caching most beneficial?

**Best scenarios**: - Many users with overlapping interests (high global caching gain potential) - Content popularity spread across library (not just top 10 files) - Cache size is moderate (10-30% of library) - Peak-to-average traffic ratio is high (clear separation of placement/delivery) **Less beneficial scenarios**: - Very few users (limited coding opportunity) - Highly skewed popularity (better to just cache top files completely) - Very small or very large caches (extremes reduce coding gain) - Uniform traffic (no clear off-peak for placement)

Back to Blog

AI for Comm Deep Learning Edge AI Caching

Coded Caching: From Information Theory to AI-Optimized Edge Networks

Q: What's the difference between coded caching and network coding?

**Coded caching**: Coding is applied *across files* during the delivery phase, exploiting cached content at receivers. Works in a broadcast setting, doesn't require intermediate nodes. **Network coding**: Coding is applied *at intermediate nodes* during transmission, creating new combinations as data traverses the network. Requires network topology awareness and in-network processing. The two can be combined: coded caching for last-hop delivery, network coding for multi-hop transport.

Detailed look at coded caching—from Maddah-Ali & Niesen's seminal information-theoretic foundations to modern AI-driven cache optimization. Deep analysis of local and global caching gains, decentralized schemes, and the integration of deep reinforcement learning, federated learning, and graph neural networks for 5G/6G MEC systems.

January 14, 202626 min read

Introduction

The growth of mobile data traffic presents one of the most pressing challenges in modern telecommunications. Video streaming alone accounts for over 70% of mobile traffic, with users increasingly expecting instant access to high-definition content anywhere, anytime. Traditional network architectures—designed around the principle of fetching content from distant servers on demand—simply cannot scale to meet this demand economically or sustainably.

Caching offers a fundamentally different paradigm: rather than repeatedly transmitting the same popular content across congested backhaul links, we store it closer to users during off-peak hours and serve requests locally during peak times. This simple idea powers content delivery networks (CDNs), browser caches, and increasingly, mobile edge computing (MEC) systems.

But conventional caching has a fundamental limitation: each cache operates independently. If User A and User B both want different files, and each file is partially cached at both users, conventional caching cannot exploit this. The server must transmit missing portions separately to each user.

Coded caching, pioneered by Mohammad Ali Maddah-Ali and Urs Niesen in their seminal 2014 paper "Fundamental Limits of Caching," revolutionizes this paradigm. Through clever use of coding theory, coded caching achieves something remarkable: a global caching gain that scales with the aggregate cache size across all users, even when users don't cooperate or communicate with each other.

The implications are profound. In a system with 30 users, each caching 10 out of 30 files, conventional uncoded caching requires transmitting 20 file-equivalents during peak hours. Coded caching reduces this to just 1.8 file-equivalents—an 11× improvement. This isn't a small optimization; it's a fundamental change in what's achievable.

The AI revolution takes coded caching from theory to practice. While information theory provides optimal schemes for idealized settings with known, static popularity distributions, real networks face:

Dynamic popularity: Content trends change rapidly; viral videos emerge unpredictably
Heterogeneous users: Different users have different preferences and access patterns
Limited coordination: Practical systems can't implement the full complexity of optimal schemes
Privacy constraints: Users may not want to share their viewing preferences

Modern AI/ML techniques—deep reinforcement learning for adaptive cache placement, federated learning for privacy-preserving popularity prediction, graph neural networks for modeling user-content relationships—are essential for practical deployment.

This post provides a comprehensive technical exploration of coded caching: the information-theoretic foundations that explain why it works, the mathematical analysis of local and global caching gains, extensions to decentralized and multi-access settings, AI-driven optimization for practical deployment, and integration with 5G/6G mobile edge computing architectures.

Prerequisites: Information theory basics (entropy, mutual information), probability theory, familiarity with optimization concepts.

Key Papers:

Part I: The Information-Theoretic Foundations

The Caching Problem: A Precise Formulation

Consider a content delivery system with the following components:

A server with access to a library of $N$ files: $\{W_1, W_2, \ldots, W_N\}$ , each of size $F$ bits
$K$ users, each equipped with a cache capable of storing $MF$ bits (equivalently, $M$ complete files)
A shared broadcast link connecting the server to all users

The system operates in two distinct phases:

Placement Phase (Off-peak hours): The server fills user caches without knowing what files users will request later. This constraint is crucial—if the server knew future requests, optimal caching would be trivial (just cache what will be requested).

Delivery Phase (Peak hours): Each user $k$ requests exactly one file $d_k \in \{1, \ldots, N\}$ . The server must transmit enough information over the shared link for every user to recover their requested file, using both the transmission and their cached content.

The goal: Minimize the rate $R$ —the number of file-equivalents transmitted during delivery. Lower rate means less backhaul congestion, lower latency, and better scalability.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE CODED CACHING SYSTEM MODEL                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                          ┌─────────────────────┐                        │
│                          │       SERVER        │                        │
│                          │                     │                        │
│                          │  File Library:      │                        │
│                          │  W₁, W₂, ..., Wₙ    │                        │
│                          │                     │                        │
│                          │  Each file: F bits  │                        │
│                          │  Total: N files     │                        │
│                          └──────────┬──────────┘                        │
│                                     │                                    │
│                          PHASE 1: PLACEMENT                             │
│                          (Server fills caches                           │
│                           without knowing                               │
│                           future requests)                              │
│                                     │                                    │
│                          PHASE 2: DELIVERY                              │
│                          (Server broadcasts                             │
│                           to satisfy all                                │
│                           requests)                                     │
│                                     │                                    │
│                      ┌──────────────┴──────────────┐                    │
│                      │      SHARED LINK            │                    │
│                      │      Rate = R files         │                    │
│                      └──────────────┬──────────────┘                    │
│                                     │                                    │
│         ┌───────────────────────────┼───────────────────────────┐       │
│         │                           │                           │       │
│         ▼                           ▼                           ▼       │
│   ┌───────────┐               ┌───────────┐               ┌───────────┐ │
│   │  USER 1   │               │  USER 2   │               │  USER K   │ │
│   │           │               │           │               │           │ │
│   │  Cache:   │               │  Cache:   │               │  Cache:   │ │
│   │  MF bits  │               │  MF bits  │               │  MF bits  │ │
│   │           │               │           │               │           │ │
│   │ Request:  │               │ Request:  │               │ Request:  │ │
│   │   d₁      │               │   d₂      │               │   dₖ      │ │
│   └───────────┘               └───────────┘               └───────────┘ │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FUNDAMENTAL PARAMETERS:                                                 │
│                                                                          │
│  • N: Number of files in the library                                    │
│  • K: Number of users in the system                                     │
│  • M: Normalized cache size (can store M file-equivalents)             │
│  • F: Size of each file in bits                                         │
│  • R: Delivery rate (file-equivalents transmitted)                     │
│                                                                          │
│  CONSTRAINTS:                                                            │
│  • 0 ≤ M ≤ N (cache size between empty and full library)               │
│  • Placement is UNCODED: caches store actual file bits                 │
│  • Delivery CAN BE CODED: XOR combinations allowed                     │
│                                                                          │
│  THE QUESTION: What is the minimum achievable rate R for given N,K,M?  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Uncoded Caching: The Conventional Approach

Before exploring coded caching, let's understand what conventional (uncoded) caching achieves.

Strategy: Each user independently caches portions of files, typically the most popular ones. When a user requests a file, they receive what's already in their cache locally, and the server transmits only the missing portions.

Best case: All users request files that are fully cached → $R = 0$

Worst case (distinct requests, nothing cached): Server must transmit $K$ complete files → $R = K$

Average case (assuming uniform popularity): Each user has cached fraction $M/N$ of each file. The missing fraction is $(1 - M/N)$ . With $K$ users making independent requests:

$R_{\text{uncoded}} = K \cdot \left(1 - \frac{M}{N}\right)$

This formula reveals the local caching gain: each user benefits from their own cache, reducing their individual demand by factor $M/N$ . But there's no benefit from what other users have cached—the caches operate in isolation.

Example: With $K = 30$ users, $N = 30$ files, and $M = 10$ : $R_{\text{uncoded}} = 30 \cdot (1 - 10/30) = 30 \cdot (2/3) = 20 \text{ files}$

Even with substantial caching (each user stores 1/3 of the library), we still need to transmit 20 file-equivalents during peak hours.

The Maddah-Ali & Niesen Breakthrough

The key insight of Maddah-Ali and Niesen was that caches can be coordinated—not through user cooperation, but through clever placement design—so that a single coded transmission simultaneously benefits multiple users.

The Central Idea: Split each file into carefully designed pieces. Place different pieces at different users such that when the server transmits coded combinations (XORs) of file pieces, each user can decode the piece they need using other pieces they already have cached.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              THE MADDAH-ALI & NIESEN SCHEME: DETAILED EXAMPLE           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SETUP: K = 3 users, N = 3 files (A, B, C), M = 1 (each caches 1 file) │
│                                                                          │
│  STEP 1: COMPUTE THE CACHING RATIO                                      │
│                                                                          │
│  t = KM/N = (3 × 1)/3 = 1                                              │
│                                                                          │
│  This parameter t determines the subpacketization: each file           │
│  is split into C(K,t) = C(3,1) = 3 pieces.                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STEP 2: PLACEMENT PHASE                                                 │
│                                                                          │
│  Split each file into 3 pieces, indexed by which user caches them:     │
│                                                                          │
│  File A = { A₁, A₂, A₃ }  where Aᵢ is cached by user i                │
│  File B = { B₁, B₂, B₃ }                                               │
│  File C = { C₁, C₂, C₃ }                                               │
│                                                                          │
│  Cache contents after placement:                                        │
│                                                                          │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │
│  │   USER 1    │    │   USER 2    │    │   USER 3    │                 │
│  │             │    │             │    │             │                 │
│  │ Caches:     │    │ Caches:     │    │ Caches:     │                 │
│  │  A₁, B₁, C₁ │    │  A₂, B₂, C₂ │    │  A₃, B₃, C₃ │                 │
│  │             │    │             │    │             │                 │
│  │ (1/3 of     │    │ (1/3 of     │    │ (1/3 of     │                 │
│  │  each file) │    │  each file) │    │  each file) │                 │
│  └─────────────┘    └─────────────┘    └─────────────┘                 │
│                                                                          │
│  Each user caches 1/3 of A + 1/3 of B + 1/3 of C = 1 file total ✓     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STEP 3: DELIVERY PHASE                                                  │
│                                                                          │
│  Suppose: User 1 wants A, User 2 wants B, User 3 wants C               │
│  (Worst case: all different requests)                                   │
│                                                                          │
│  What each user NEEDS (doesn't have cached):                           │
│  • User 1 needs: A₂, A₃  (has A₁, wants complete A)                   │
│  • User 2 needs: B₁, B₃  (has B₂, wants complete B)                   │
│  • User 3 needs: C₁, C₂  (has C₃, wants complete C)                   │
│                                                                          │
│  UNCODED approach: Send A₂, A₃, B₁, B₃, C₁, C₂ = 6 pieces = 2 files  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CODED DELIVERY (the magic):                                            │
│                                                                          │
│  Transmission 1: X₁ = A₂ ⊕ B₁                                          │
│                                                                          │
│  • User 1 receives X₁, has B₁ cached                                   │
│    → Computes A₂ = X₁ ⊕ B₁ ✓                                          │
│                                                                          │
│  • User 2 receives X₁, has A₂ cached                                   │
│    → Computes B₁ = X₁ ⊕ A₂ ✓                                          │
│                                                                          │
│  ONE transmission satisfies TWO users!                                  │
│                                                                          │
│  Transmission 2: X₂ = A₃ ⊕ C₁                                          │
│  • User 1 gets A₃ (has C₁), User 3 gets C₁ (has A₃)                   │
│                                                                          │
│  Transmission 3: X₃ = B₃ ⊕ C₂                                          │
│  • User 2 gets B₃ (has C₂), User 3 gets C₂ (has B₃)                   │
│                                                                          │
│  TOTAL: 3 pieces transmitted = 1 file worth of data!                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMPARISON:                                                             │
│                                                                          │
│  Uncoded rate:  R = K(1 - M/N) = 3(1 - 1/3) = 2 files                 │
│  Coded rate:    R = K(1 - M/N)/(1 + KM/N) = 2/(1 + 1) = 1 file        │
│                                                                          │
│  Improvement factor: 2× (global caching gain = 1 + t = 2)              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The General Scheme and Achievable Rate

The example above illustrates the principle; the general scheme works as follows:

Placement: Divide each file $W_n$ into $\binom{K}{t}$ subfiles, where $t = KM/N$ is the caching ratio. Index each subfile by a subset $\mathcal{S} \subseteq \{1, \ldots, K\}$ of size $t$ :

$W_n = \{W_{n,\mathcal{S}} : \mathcal{S} \subseteq [K], |\mathcal{S}| = t\}$

User $k$ caches all subfiles whose index contains $k$ :

$Z_k = \{W_{n,\mathcal{S}} : k \in \mathcal{S}, n \in [N]\}$

Delivery: For each subset $\mathcal{T}$ of size $t+1$ , transmit the coded message:

$X_{\mathcal{T}} = \bigoplus_{k \in \mathcal{T}} W_{d_k, \mathcal{T} \setminus \{k\}}$

Each user $k \in \mathcal{T}$ needs $W_{d_k, \mathcal{T} \setminus \{k\}}$ and has all other terms in the XOR cached. They can decode their needed piece by XORing with cached content.

Achievable Rate:

$R_{\text{coded}} = \frac{K(1 - M/N)}{1 + KM/N}$

This beautiful formula reveals the structure of the gain.

Local vs. Global Caching Gain: A Deep Dive

The coded caching rate can be decomposed to understand where the gain comes from:

$R_{\text{coded}} = \frac{K(1 - M/N)}{1 + KM/N} = \underbrace{K(1 - M/N)}_{\text{Uncoded demand}} \cdot \underbrace{\frac{1}{1 + KM/N}}_{\text{Global gain factor}}$

Local Caching Gain ( $1/(1 - M/N)$ ): This is what conventional caching provides. Each user caches fraction $M/N$ of the library, so they need only fraction $(1 - M/N)$ from the network. This gain is additive across users—it reduces each user's demand independently.

Global Caching Gain ( $1 + KM/N = 1 + t$ ): This is unique to coded caching. It represents the multicast opportunity—each coded transmission benefits $t+1$ users simultaneously. This gain is multiplicative and scales with the aggregate cache size $KM$ , not individual cache size $M$ .

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              UNDERSTANDING LOCAL vs GLOBAL CACHING GAIN                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYSTEM: K = 30 users, N = 30 files, M = 10 (33% cache ratio)          │
│                                                                          │
│  LOCAL CACHING GAIN (what conventional caching provides):               │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Each user caches 10/30 = 33% of each file.                            │
│  They need 67% of their requested file from the network.               │
│                                                                          │
│  Without local caching: Each user needs 1 full file → 30 files total  │
│  With local caching: Each user needs 0.67 file → 20 files total       │
│                                                                          │
│  Local gain = 30/20 = 1.5×                                             │
│                                                                          │
│  This gain is INDEPENDENT across users—my cache helps me, not you.    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GLOBAL CACHING GAIN (unique to coded caching):                         │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Caching ratio t = KM/N = 30 × 10 / 30 = 10                           │
│  Global gain = 1 + t = 1 + 10 = 11×                                    │
│                                                                          │
│  Each coded transmission benefits t + 1 = 11 users simultaneously!     │
│                                                                          │
│  This is because:                                                       │
│  • When server sends X = A₁ ⊕ B₂ ⊕ C₃ ⊕ ... (11 terms)               │
│  • User wanting A has all terms except A₁ cached                       │
│  • User wanting B has all terms except B₂ cached                       │
│  • ... and so on for all 11 users in this group                       │
│                                                                          │
│  The server makes ONE transmission; 11 users each extract their need.  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMBINED EFFECT:                                                        │
│                                                                          │
│  Uncoded rate:  R = K(1 - M/N) = 30 × 0.67 = 20 files                 │
│  Coded rate:    R = 20 / (1 + 10) = 20/11 ≈ 1.8 files                 │
│                                                                          │
│  Total improvement: 20 / 1.8 ≈ 11× better than uncoded!               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE REMARKABLE INSIGHT:                                                 │
│                                                                          │
│  Global gain scales with AGGREGATE cache size KM, not individual M.    │
│                                                                          │
│  Double the number of users K (same M per user)?                       │
│  → Global gain INCREASES (more multicast opportunity)                  │
│  → Even though total demand also increases                             │
│  → Net effect: Rate grows sublinearly with K                          │
│                                                                          │
│  This is counter-intuitive: More users → More efficient delivery!     │
│                                                                          │
│  In conventional systems, more users = more load = worse performance.  │
│  In coded caching, more users = more coding opportunity = better gain. │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Information-Theoretic Optimality

A natural question: Is the Maddah-Ali & Niesen scheme optimal? Can we do better?

The authors proved a converse bound—a lower limit on the rate that any scheme must achieve:

$R^* \geq \max_{s \in \{1, \ldots, \min(K, N)\}} \frac{s - sM/N}{1 + sM/N}$

The gap between achievable rate and lower bound is at most a factor of 12 for all parameter values. This means coded caching is order-optimal: we cannot hope to do fundamentally better (by more than a constant factor).

Converse Bound Derivation (Cut-Set Argument)

The converse uses a cut-set bound argument, a fundamental technique in network information theory. Here's the rigorous derivation:

Setup: Consider any subset $\mathcal{S} \subseteq \{1, \ldots, K\}$ of $s = |\mathcal{S}|$ users. We'll lower bound the rate by considering what information must cross the "cut" separating the server from these $s$ users.

Key Insight: Consider the worst-case demand where each user in $\mathcal{S}$ requests a distinct file. The server must transmit enough information so that together with the cached content, each user can decode their file.

Step 1: Total Information Requirement

The $s$ users collectively need $s$ complete files ( $sF$ bits). They collectively have cached at most $sM$ files worth of content ( $sMF$ bits). Therefore, the transmission must provide at least:

$H(X) \geq sF - sMF = sF(1 - M/N) \cdot \frac{N}{N}$

But this ignores the structure. A tighter bound uses the following:

Step 2: Cut-Set Entropy Bound

Let $Z_{\mathcal{S}} = (Z_k : k \in \mathcal{S})$ denote the caches of users in $\mathcal{S}$ , and $X$ denote the transmitted message. For the users to decode their requested files $W_{d_k}$ , we need:

$H(W_{d_k} : k \in \mathcal{S} | X, Z_{\mathcal{S}}) = 0$

By the chain rule and independence of files:

$H(W_{d_k} : k \in \mathcal{S}) = H(W_{d_k} : k \in \mathcal{S} | X, Z_{\mathcal{S}}) + I(W_{d_k} : k \in \mathcal{S}; X, Z_{\mathcal{S}})$

$sF = 0 + I(W_{d_k} : k \in \mathcal{S}; X, Z_{\mathcal{S}})$

Step 3: Bounding the Mutual Information

$I(W_{d_k} : k \in \mathcal{S}; X, Z_{\mathcal{S}}) \leq H(X) + H(Z_{\mathcal{S}})$

The cache entropy is bounded by: $H(Z_{\mathcal{S}}) \leq sMF$

Thus: $sF \leq H(X) + sMF$ $H(X) \geq sF - sMF = sF(1 - M/N)$

Step 4: Accounting for Broadcast Advantage

The above bound doesn't exploit the fact that $X$ is a broadcast—all users receive it. A refined argument considers that the transmission must help each user independently:

For any user $k \in \mathcal{S}$ : $F = H(W_{d_k}) \leq H(X) + H(Z_k) = H(X) + MF$

Averaging and using the shared broadcast: $sF \leq sH(X) + sMF$

But since $X$ is common to all: $sF \leq H(X) + sMF + (s-1)H(X|Z_{\mathcal{S}})$

A careful analysis using the submodularity of entropy yields:

$R \cdot F = H(X) \geq \frac{s(1 - M/N)}{1 + s \cdot M/N} \cdot F$

Step 5: Optimizing Over $s$

Since this bound holds for any subset size $s$ , we take the maximum:

$R^* \geq \max_{s \in \{1, \ldots, \min(K, N)\}} \frac{s(1 - M/N)}{1 + sM/N}$

Tightness Analysis: Let $g(s) = \frac{s(1-M/N)}{1+sM/N}$ . Taking the derivative:

$g'(s) = \frac{(1-M/N)(1+sM/N) - s(1-M/N)(M/N)}{(1+sM/N)^2} = \frac{1-M/N}{(1+sM/N)^2}$

Since $g'(s) > 0$ for $M < N$ , the maximum is achieved at $s^* = \min(K, N)$ .

The Optimality Gap: Comparing achievable rate to lower bound:

$\frac{R_{\text{MAN}}}{R_{\text{lower}}} = \frac{K(1-M/N)/(1+KM/N)}{s^*(1-M/N)/(1+s^*M/N)} \leq \frac{K(1+s^*M/N)}{s^*(1+KM/N)}$

When $K = N$ and $s^* = N$ , this ratio is exactly 1—the MAN scheme is optimal.

More recent work has tightened this gap for many parameter regimes, showing that the MAN scheme is often exactly optimal or within a factor of 2.

Part II: Extensions and Practical Variants

Decentralized Coded Caching

The centralized MAN scheme requires the server to coordinate placement across all users—each user must cache specific subfiles determined by a global plan. This is impractical for large systems where users join and leave dynamically.

Decentralized coded caching (also by Maddah-Ali and Niesen) removes this requirement: each user independently caches a random fraction $M/N$ of each file. Remarkably, coded delivery still works!

Placement: Each user independently caches each bit of each file with probability $M/N$ .

Delivery: The server identifies which users have which bits cached (through feedback), then constructs coded transmissions that exploit random overlaps.

Achievable Rate (for large file size $F$ ):

$R_{\text{decentralized}} \approx \frac{K(1 - M/N)}{1 + KM/N} \cdot (1 + o(1))$

Remarkably, the decentralized scheme achieves essentially the same rate as the coordinated scheme, with only a vanishing penalty as file size grows.

Why it works: With high probability, the random placement creates sufficient overlap structure. Any $t+1$ users will have complementary cached content that enables coding. The law of large numbers ensures this works well for large files.

Multi-Access Coded Caching

In practical networks, users often have access to multiple caches—for example, a user at the boundary between two cells might connect to both base stations.

Multi-access coded caching exploits this:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTI-ACCESS CODED CACHING                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                            SERVER                                        │
│                               │                                          │
│               ┌───────────────┼───────────────┐                         │
│               │               │               │                         │
│               ▼               ▼               ▼                         │
│          ┌────────┐      ┌────────┐      ┌────────┐                    │
│          │Cache 1 │      │Cache 2 │      │Cache 3 │                    │
│          │  (C₁)  │      │  (C₂)  │      │  (C₃)  │                    │
│          └───┬────┘      └───┬────┘      └───┬────┘                    │
│              │               │               │                          │
│              │    ┌──────────┼───────────┐   │                          │
│              │    │          │           │   │                          │
│              ▼    ▼          ▼           ▼   ▼                          │
│           ┌─────────┐    ┌─────────┐    ┌─────────┐                     │
│           │ User 1  │    │ User 2  │    │ User 3  │                     │
│           │         │    │         │    │         │                     │
│           │ Access: │    │ Access: │    │ Access: │                     │
│           │ C₁, C₂  │    │ C₂, C₃  │    │ C₁, C₃  │                     │
│           └─────────┘    └─────────┘    └─────────┘                     │
│                                                                          │
│  Each user can retrieve content from L = 2 caches.                     │
│                                                                          │
│  BENEFITS:                                                               │
│  • Effective cache size per user increases                             │
│  • More coding opportunities from overlapping access                   │
│  • Natural model for cell-edge users                                   │
│                                                                          │
│  The MAN scheme extends to this setting with modified placement        │
│  that accounts for shared cache access patterns.                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The multi-access setting offers additional gains because users can combine content from multiple sources, increasing effective cache size and creating new coding opportunities.

Device-to-Device (D2D) Coded Caching

In D2D coded caching, users can also receive content from nearby users through direct device-to-device links. This adds a spatial dimension:

Server broadcasts reach all users (but limited capacity)
D2D transmissions have limited range but can occur simultaneously in different locations

The combination of coded caching with D2D spatial reuse can provide multiplicative gains: coding gain from cached content coordination, plus spatial reuse gain from simultaneous D2D transmissions in different areas.

Hierarchical Coded Caching

Real networks often have multiple cache layers:

Core caches: Large, centrally located (CDN servers)
Edge caches: Medium-sized, at base stations (MEC)
User caches: Small, on end devices

Hierarchical coded caching applies the coded caching principle at each layer, with careful coordination to avoid redundant caching. Content that's already coded for delivery at layer $i$ need not be re-cached at layer $i+1$ .

The analysis becomes complex, but the fundamental insight remains: coordinated placement plus coded delivery provides multiplicative gains at each layer, compounding through the hierarchy.

Part III: Challenges in Practical Deployment

The Subpacketization Problem

The MAN scheme requires splitting each file into $\binom{K}{KM/N}$ subfiles. For realistic parameters, this number explodes:

$K = 100$ users, $M/N = 0.1$ (cache 10% of library)
$t = KM/N = 10$
Subfiles per file: $\binom{100}{10} \approx 1.7 \times 10^{13}$

Each subfile needs a unique index, and the delivery phase must identify and combine specific subfiles. This is clearly impractical.

Solutions:

Placement Delivery Arrays (PDAs): Combinatorial designs that achieve coded caching gain with polynomial subpacketization
User grouping: Divide users into smaller groups, apply coded caching within groups
Approximate schemes: Accept some loss in coding gain for practical subpacketization

Placement Delivery Arrays: A Rigorous Construction

A Placement Delivery Array (PDA) is a combinatorial structure that specifies both cache placement and coded delivery with controlled subpacketization.

Definition: A $(K, F, Z, S)$ -PDA is a $K \times F$ array $\mathbf{P}$ where:

$K$ = number of users
$F$ = number of subfiles per file (subpacketization)
$Z$ = number of entries in each row that are "*" (determines cache size)
$S$ = number of distinct integers used (determines delivery transmissions)

The array entries are either "*" (indicating cached content) or integers from $\{1, 2, \ldots, S\}$ .

Properties:

Each row contains exactly $Z$ stars → cache size $M = ZN/F$
Each integer appears at most once per column
For each integer $s$ , the set of rows where $s$ appears (excluding column with $s$ ) forms a valid coding group

Example: A $(4, 4, 2, 4)$ -PDA for $K=4$ Users

$\mathbf{P} = \begin{pmatrix} * & * & 1 & 2 \\ * & 3 & * & 4 \\ 1 & * & * & 3 \\ 2 & 4 & 3 & * \end{pmatrix}$

Interpretation:

Rows = Users (1, 2, 3, 4)
Columns = Subfiles ( $W_{n,1}, W_{n,2}, W_{n,3}, W_{n,4}$ for each file $W_n$ )
"*" = User caches this subfile
Integer $s$ = This subfile is delivered in transmission $s$

Cache contents (from stars):

User 1 caches: $W_{n,1}, W_{n,2}$ for all $n$ → 2/4 = 50% of each file
User 2 caches: $W_{n,1}, W_{n,3}$ for all $n$
User 3 caches: $W_{n,2}, W_{n,3}$ for all $n$
User 4 caches: $W_{n,3}, W_{n,4}$ for all $n$

Delivery (for demand $d_1, d_2, d_3, d_4$ ):

Transmission 1 (integer 1 appears in positions $(1,3)$ and $(3,1)$ ): $X_1 = W_{d_1, 3} \oplus W_{d_3, 1}$ User 1 wants $W_{d_1,3}$ , has $W_{d_3,1}$ cached (from column 1, row 3) User 3 wants $W_{d_3,1}$ , has $W_{d_1,3}$ cached
Similar for transmissions 2, 3, 4.

Rate achieved: $R = S/F = 4/4 = 1$ file

Comparison to MAN scheme: For $K=4$ , $M/N=1/2$ :

MAN scheme: $F = \binom{4}{2} = 6$ subfiles, $R = 4(1-1/2)/(1+2) = 2/3$
PDA scheme: $F = 4$ subfiles, $R = 1$

The PDA trades slightly higher rate (1 vs 2/3) for much lower subpacketization (4 vs 6).

General Constructions: Several PDA families achieve polynomial subpacketization:

Maddah-Ali Niesen PDA: $F = \binom{K}{t}$ , $S = \binom{K}{t+1}$ , $R = \frac{K-t}{t+1}$
Resolvable design PDA: $F = O(K^2)$ , with rate penalty $\leq 2$
Ruzsa-Szemerédi graph PDA: $F = O(K^{1+\epsilon})$ for any $\epsilon > 0$

The Fundamental Trade-off:

$R \cdot F \geq K(1-M/N) \cdot \frac{F}{1 + KM/N}$

Lower subpacketization $F$ generally requires higher rate $R$ , but the product is bounded below.

Recent work has reduced subpacketization from exponential to polynomial in $K$ while retaining most of the coding gain.

Non-Uniform Popularity

The MAN scheme assumes all files are equally likely to be requested. Real content follows highly skewed popularity distributions (Zipf's law):

$P(\text{file } n \text{ requested}) = p_n = \frac{n^{-\alpha}}{\sum_{j=1}^N j^{-\alpha}}$

where $\alpha \approx 0.8$ for typical video content (empirically validated across Netflix, YouTube, and CDN traces).

Implication: Most requests concentrate on a few popular files. Should we:

Cache popular files completely (maximize local hits)?
Cache diverse content (maximize coding opportunities)?

Optimal Placement Under Zipf: A Convex Optimization

For non-uniform popularity, the optimal cache allocation can be formulated as a convex optimization problem.

Decision Variables: Let $m_n \in [0,1]$ denote the fraction of file $n$ cached at each user.

Constraint: Total cache size is $M$ files: $\sum_{n=1}^N m_n = M$

Objective: Minimize expected delivery rate. For uncoded caching: $R_{\text{uncoded}} = K \sum_{n=1}^N p_n (1 - m_n)$

For coded caching with non-uniform popularity, the rate expression becomes more complex. A tractable approximation uses the grouped coded caching framework:

Grouped Coded Caching Formulation:

Partition files into groups $\mathcal{G}_1, \mathcal{G}_2, \ldots$ based on popularity. Within each group, apply coded caching with uniform treatment.

$\min_{m_n, \mathcal{G}} \sum_{g} \sum_{n \in \mathcal{G}_g} p_n \cdot \frac{K(1 - m_n)}{1 + K \bar{m}_g}$

where $\bar{m}_g$ is the average cache fraction for files in group $g$ .

Closed-Form Solution for Special Cases:

Case 1: Pure Local Caching (no coding gain exploited)

The optimal allocation follows a water-filling solution: $m_n^* = \left[ \frac{1}{\lambda} \cdot p_n^{1/2} - 1 \right]_0^1$

where $\lambda$ is the Lagrange multiplier set so that $\sum_n m_n^* = M$ , and $[x]_a^b = \max(a, \min(b, x))$ .

Case 2: Two-Group Approximation

Partition files into "popular" ( $n \leq N_p$ ) and "unpopular" ( $n > N_p$ ):

Popular files: Cache completely ( $m_n = 1$ )
Unpopular files: Apply coded caching with $m_n = m_u$ for all $n > N_p$

The optimization becomes: $\min_{N_p, m_u} \sum_{n=1}^{N_p} p_n \cdot 0 + \sum_{n=N_p+1}^{N} p_n \cdot \frac{K(1-m_u)}{1 + Km_u}$

subject to $N_p + (N - N_p) m_u = M$

The Critical Trade-off:

Define the coding opportunity cost: $\text{COC}(n) = p_n \cdot \frac{\partial}{\partial m_n} \left[ \frac{K(1-m_n)}{1 + Km_n} \right] = -p_n \cdot \frac{K(1 + K)}{(1 + Km_n)^2}$

Files with high $p_n$ benefit more from being fully cached (no coding), while files with moderate $p_n$ benefit from participating in coded delivery.

Numerical Example: For $N = 100$ , $K = 20$ , $M = 10$ , $\alpha = 0.8$ :

Optimal strategy caches top ~5 files completely
Applies coded caching to files 6-40
Ignores files 41-100 (too unpopular)
Rate reduction vs. uniform MAN: ~25%

AI/ML contribution: Predicting popularity and optimally allocating cache space across files is precisely where machine learning excels.

Unknown and Time-Varying Popularity

Real popularity isn't just non-uniform—it's also unknown and changes over time. Viral content emerges suddenly; trending topics shift daily.

Online learning approaches:

Bandit algorithms: Explore-exploit tradeoff for cache updates
Prediction models: LSTM, transformers to forecast future popularity from past patterns
Reinforcement learning: Learn cache replacement policies that adapt to popularity dynamics

This is where the information-theoretic foundations meet practical machine learning.

Part IV: AI/ML for Coded Caching

Why AI is Essential

Information theory tells us what is achievable—the minimum rate under idealized assumptions. But practical deployment requires:

Predicting content popularity without complete information
Adapting to time-varying demands in real-time
Handling heterogeneous user preferences across locations
Respecting privacy constraints in data collection
Making decisions under computational constraints

AI/ML provides the tools to bridge theory and practice.

Deep Reinforcement Learning for Cache Placement

Cache placement can be formulated as a Markov Decision Process (MDP):

State: Current cache contents, recent request history, time of day, user context

Action: Which content to cache, which to evict, how to split cache across files

Reward: Negative delivery rate (or equivalently, cache hit rate, latency reduction, backhaul savings)

The challenge: The action space is enormous (all possible cache configurations), and the reward depends on future requests that are stochastic.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              DEEP RL FOR CODED CACHING OPTIMIZATION                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE MDP FORMULATION:                                                    │
│                                                                          │
│  STATE SPACE S:                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                                                                  │   │
│  │  s = ( C, H, P̂, τ )                                             │   │
│  │                                                                  │   │
│  │  C: Current cache configuration                                 │   │
│  │     • Which files are cached                                    │   │
│  │     • What fraction of each file                                │   │
│  │     • Which subfiles (for coded caching)                        │   │
│  │                                                                  │   │
│  │  H: Request history                                              │   │
│  │     • Recent requests from each user                            │   │
│  │     • Temporal patterns (time of day, day of week)             │   │
│  │     • User context (location, device type)                     │   │
│  │                                                                  │   │
│  │  P̂: Estimated popularity distribution                          │   │
│  │     • Current best estimate of file popularities               │   │
│  │     • Uncertainty/confidence in estimates                       │   │
│  │                                                                  │   │
│  │  τ: Time features                                                │   │
│  │     • Hour of day, day of week (cyclic encoding)               │   │
│  │     • Special events (sports, news, holidays)                  │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ACTION SPACE A:                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                                                                  │   │
│  │  For conventional caching:                                       │   │
│  │  • a = (evict_file, cache_file) discrete choices               │   │
│  │                                                                  │   │
│  │  For coded caching:                                              │   │
│  │  • a = (placement_strategy, subfile_allocation)                │   │
│  │  • More complex: must maintain coding structure                 │   │
│  │                                                                  │   │
│  │  For practical systems:                                          │   │
│  │  • a = (cache_probabilities) continuous vector                 │   │
│  │  • Probabilistic placement: cache file n with probability aₙ  │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  REWARD FUNCTION:                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                                                                  │   │
│  │  r(s, a) = -R(s, a)  (negative delivery rate)                  │   │
│  │                                                                  │   │
│  │  Or multi-objective:                                             │   │
│  │  r = α·(cache_hit_rate) + β·(latency_reduction) +              │   │
│  │      γ·(backhaul_savings) - λ·(update_cost)                    │   │
│  │                                                                  │   │
│  │  The reward captures both immediate benefit (hits on current   │   │
│  │  requests) and long-term value (positioned for future demand). │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ALGORITHM: Advantage Actor-Critic (A2C) or Proximal Policy            │
│  Optimization (PPO) work well for this continuous control problem.     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RESULTS FROM RECENT RESEARCH (2025):                                   │
│                                                                          │
│  • A2C-based coded caching achieves 15-25% rate reduction             │
│    over heuristic policies (LRU, LFU)                                  │
│                                                                          │
│  • Converges in ~5000 episodes for moderate-sized systems             │
│                                                                          │
│  • Adapts automatically to popularity shifts within ~100 requests     │
│                                                                          │
│  • Outperforms optimal static policy under time-varying popularity    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Bellman Equation for Optimal Caching

The optimal caching policy satisfies the Bellman optimality equation. For infinite-horizon discounted reward with discount factor $\gamma \in (0,1)$ :

Value Function: The optimal value function $V^*(s)$ represents the maximum expected cumulative reward starting from state $s$ :

$V^*(s) = \max_{a \in \mathcal{A}} \left[ r(s, a) + \gamma \sum_{s'} P(s'|s, a) V^*(s') \right]$

Q-Function: The action-value function $Q^*(s, a)$ represents the value of taking action $a$ in state $s$ :

$Q^*(s, a) = r(s, a) + \gamma \sum_{s'} P(s'|s, a) \max_{a'} Q^*(s', a')$

For Caching Specifically:

Let $s = (C, H)$ be the state (cache configuration $C$ , history $H$ ), and $a = (\text{evict}, \text{cache})$ be the caching action.

The reward for caching configuration $C$ when request $d$ arrives:

$r(C, d) = +1 \text{ if } d \in C \text{ (cache hit)}, \quad -c_{\text{backhaul}} \text{ if } d \notin C \text{ (cache miss)}$

For coded caching, the reward becomes:

$r(C, \mathbf{d}) = -R_{\text{coded}}(C, \mathbf{d})$

where $R_{\text{coded}}$ is the delivery rate given cache configuration $C$ and demand vector $\mathbf{d}$ .

State Transition: Under demand distribution $\mathbf{p}$ :

$P(s'|s, a) = \sum_{\mathbf{d}} P(\mathbf{d}) \cdot \mathbf{1}[s' = T(s, a, \mathbf{d})]$

where $T$ is the deterministic state transition function.

Function Approximation: For large state spaces, we approximate $Q^*(s,a) \approx Q_\theta(s,a)$ using neural networks. The A2C algorithm updates:

Critic (value function): $\mathcal{L}_{\text{critic}} = \mathbb{E}\left[ (r + \gamma V_\phi(s') - V_\phi(s))^2 \right]$

Actor (policy): $\nabla_\theta J = \mathbb{E}\left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A(s, a) \right]$

where the advantage $A(s,a) = Q(s,a) - V(s) \approx r + \gamma V(s') - V(s)$ .

Competitive Ratio and Regret Analysis

For online caching algorithms, we compare against the optimal offline algorithm (which knows all future requests) using competitive ratio and regret.

Competitive Ratio: An online algorithm $\mathcal{A}$ has competitive ratio $c$ if:

$\text{Cost}(\mathcal{A}) \leq c \cdot \text{Cost}(\text{OPT}) + b$

for all request sequences, where $b$ is an additive constant.

Classical Results:

LRU (Least Recently Used): Competitive ratio = $k$ for cache size $k$
LFU (Least Frequently Used): Competitive ratio = $O(\log n)$
Marker algorithm: Competitive ratio = $2H_k \approx 2\ln k$
Lower bound: No deterministic algorithm achieves ratio better than $k$

Regret for Learning Algorithms: For RL-based caching, we measure regret—the difference between cumulative reward of the learning agent and the optimal policy:

$\text{Regret}(T) = \sum_{t=1}^T \left[ V^*(s_t) - r_t \right]$

Sublinear Regret: A good learning algorithm achieves $\text{Regret}(T) = o(T)$ , meaning the average per-step regret vanishes as $T \to \infty$ .

Results for RL Caching:

For contextual bandit formulation (stateless): $\text{Regret}(T) = O(\sqrt{NT \log N})$

where $N$ is the number of files.

For full MDP formulation with finite state space $|\mathcal{S}|$ and action space $|\mathcal{A}|$ : $\text{Regret}(T) = \tilde{O}(\sqrt{|\mathcal{S}||\mathcal{A}|T})$

Practical Implications:

After $T = 10^4$ requests, regret per request is ~ $O(1/\sqrt{T}) \approx 1\%$
RL policies become near-optimal within ~1000 requests for moderate-sized systems
The learning phase cost is amortized over the deployment lifetime

Comparison: RL vs. Heuristics:

Algorithm	Competitive Ratio	Adaptivity	Complexity
LRU	$k$	Low	$O(1)$
LFU	$O(\log N)$	Medium	$O(\log N)$
Static Optimal	1 (known dist.)	None	$O(N)$
RL (after training)	~1.1-1.3	High	$O(d)$ (NN)

Deep Q-Networks (DQN) for Discrete Caching Decisions

For caching problems with discrete action spaces (cache/evict specific files), Deep Q-Networks (DQN) provide a powerful framework. DQN combines Q-learning with deep neural networks to handle large state spaces.

The DQN Algorithm

Core Idea: Approximate the optimal Q-function $Q^*(s,a)$ with a neural network $Q_\theta(s,a)$ .

Loss Function: Minimize the temporal difference (TD) error:

$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q_{\theta^-}(s',a') - Q_\theta(s,a) \right)^2 \right]$

where $\mathcal{D}$ is the replay buffer storing past experiences, and $\theta^-$ are the target network parameters (updated slowly).

Key DQN Innovations for Stability:

Experience Replay: Store transitions $(s, a, r, s')$ in buffer $\mathcal{D}$ of size $|\mathcal{D}| = 10^5 - 10^6$ . Sample mini-batches uniformly to break correlation between consecutive samples.
Target Network: Maintain a separate target network $Q_{\theta^-}$ updated every $C = 1000-10000$ steps: $\theta^- \leftarrow \theta \quad \text{(hard update)}$ Or with soft updates (Polyak averaging): $\theta^- \leftarrow \tau \theta + (1-\tau) \theta^- \quad \text{where } \tau = 0.001$
$\epsilon$ -Greedy Exploration: With probability $\epsilon$ , select random action; otherwise select $\arg\max_a Q_\theta(s,a)$ . Anneal $\epsilon$ from 1.0 to 0.01 over training.

DQN Variants for Caching

Double DQN: Reduces overestimation bias by decoupling action selection and evaluation:

$y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_\theta(s', a'))$

instead of $y = r + \gamma \max_{a'} Q_{\theta^-}(s', a')$ .

Dueling DQN: Separates state value $V(s)$ from action advantage $A(s,a)$ :

$Q_\theta(s,a) = V_\theta(s) + \left( A_\theta(s,a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A_\theta(s,a') \right)$

This architecture learns that some states are inherently good/bad regardless of action—useful for caching where cache state strongly determines performance.

Prioritized Experience Replay: Sample transitions with probability proportional to TD error:

$P(i) \propto |\delta_i|^\alpha + \epsilon$

where $\delta_i = r + \gamma \max_{a'} Q(s',a') - Q(s,a)$ is the TD error and $\alpha \in [0,1]$ controls prioritization strength.

DQN Architecture for Caching

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DQN ARCHITECTURE FOR CACHING                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT STATE REPRESENTATION:                                             │
│                                                                          │
│  s = [ cache_vector | popularity_estimate | temporal_features ]         │
│      └─────────────┘ └──────────────────┘ └─────────────────┘          │
│         N-dim 0/1        N-dim float         d_t-dim                    │
│                                                                          │
│  NETWORK ARCHITECTURE:                                                   │
│                                                                          │
│  State s ∈ ℝ^{N + N + d_t}                                              │
│      │                                                                   │
│      ▼                                                                   │
│  ┌─────────────────────────────────────┐                                │
│  │ Input Layer                          │                                │
│  │ Dimension: 2N + d_t (e.g., 2×1000 + 24 = 2024)                      │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│                   ▼                                                      │
│  ┌─────────────────────────────────────┐                                │
│  │ Hidden Layer 1                       │                                │
│  │ • Linear(2024, 512)                 │                                │
│  │ • ReLU activation                    │                                │
│  │ • LayerNorm (optional)              │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│                   ▼                                                      │
│  ┌─────────────────────────────────────┐                                │
│  │ Hidden Layer 2                       │                                │
│  │ • Linear(512, 256)                  │                                │
│  │ • ReLU activation                    │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│                   ▼                                                      │
│  ┌─────────────────────────────────────┐                                │
│  │ Hidden Layer 3                       │                                │
│  │ • Linear(256, 128)                  │                                │
│  │ • ReLU activation                    │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│          ┌───────┴───────┐  (Dueling architecture)                     │
│          │               │                                              │
│          ▼               ▼                                              │
│   ┌────────────┐  ┌────────────┐                                       │
│   │ Value Head │  │ Adv. Head  │                                       │
│   │ Linear(128,│  │ Linear(128,│                                       │
│   │    1)      │  │    |A|)    │                                       │
│   └─────┬──────┘  └─────┬──────┘                                       │
│         │               │                                               │
│         └───────┬───────┘                                               │
│                 │                                                        │
│                 ▼                                                        │
│         Q(s,a) = V(s) + (A(s,a) - mean(A))                             │
│                                                                          │
│  OUTPUT: Q-values for each action                                       │
│  Dimension: |A| = N (for single-file decisions) or 2N (cache/evict)   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

DQN Hyperparameters for Caching (Recommended Values)

Hyperparameter	Symbol	Value	Notes
Learning rate	$\eta$	$3 \times 10^{-4}$	Adam optimizer
Discount factor	$\gamma$	0.99	Long-term reward focus
Replay buffer size	$	\mathcal{D}	$
Batch size	$B$	64-128
Target update frequency	$C$	1000 steps	Hard update
Soft update rate	$\tau$	0.001	If using Polyak
Initial exploration	$\epsilon_0$	1.0
Final exploration	$\epsilon_f$	0.01
Exploration decay	-	Linear over $10^5$ steps
Hidden dimensions	-	[512, 256, 128]
Prioritization exponent	$\alpha$	0.6	For PER
Importance sampling	$\beta$	0.4 → 1.0	Annealed

Proximal Policy Optimization (PPO) for Continuous Cache Allocation

For continuous action spaces (allocating fractional cache to each file), Proximal Policy Optimization (PPO) is the state-of-the-art algorithm, combining sample efficiency with stable training.

The PPO Algorithm

PPO is an actor-critic method that constrains policy updates to prevent destructive large changes.

Policy (Actor): $\pi_\theta(a|s)$ outputs a probability distribution over actions. For continuous actions:

$a \sim \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2)$

where $\mu_\theta$ and $\sigma_\theta$ are neural networks.

Value Function (Critic): $V_\phi(s)$ estimates expected return from state $s$ .

PPO-Clip Objective: The key innovation is clipping the probability ratio to prevent large updates:

$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$

where:

$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio
$\hat{A}_t$ is the estimated advantage
$\epsilon = 0.2$ is the clip range

Generalized Advantage Estimation (GAE): Compute advantages using a weighted combination of n-step returns:

$\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD residual.

Typically $\lambda = 0.95$ balances bias (low $\lambda$ ) and variance (high $\lambda$ ).

Full PPO Loss:

$\mathcal{L}(\theta, \phi) = \mathcal{L}^{\text{CLIP}}(\theta) - c_1 \mathcal{L}^{\text{VF}}(\phi) + c_2 S[\pi_\theta]$

where:

$\mathcal{L}^{\text{VF}} = (V_\phi(s_t) - V_t^{\text{target}})^2$ is the value function loss
$S[\pi_\theta] = -\sum_a \pi_\theta(a|s) \log \pi_\theta(a|s)$ is entropy bonus for exploration
$c_1 = 0.5$ , $c_2 = 0.01$ are weighting coefficients

PPO Architecture for Cache Allocation

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PPO ACTOR-CRITIC FOR CACHING                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SHARED FEATURE EXTRACTOR:                                               │
│                                                                          │
│  State s ──► [ MLP: 2N+d_t → 256 → 256 ] ──► Features f(s) ∈ ℝ^{256}  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ACTOR (Policy Network):                                                 │
│                                                                          │
│  f(s) ──► [ Linear(256, 128) → ReLU → Linear(128, 2N) ]                │
│                              │                                           │
│                    ┌─────────┴─────────┐                                │
│                    │                   │                                │
│                    ▼                   ▼                                │
│              μ(s) ∈ ℝ^N         log σ(s) ∈ ℝ^N                        │
│              (mean)            (log std dev)                            │
│                    │                   │                                │
│                    └─────────┬─────────┘                                │
│                              │                                           │
│                              ▼                                           │
│               a ~ N(μ(s), exp(log σ(s))²)                              │
│               Cache allocation: a_n = sigmoid(a_n) × M/N               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRITIC (Value Network):                                                 │
│                                                                          │
│  f(s) ──► [ Linear(256, 128) → ReLU → Linear(128, 1) ] ──► V(s)       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TRAINING LOOP (per epoch):                                              │
│                                                                          │
│  1. Collect T timesteps using current policy π_θ                        │
│  2. Compute advantages Â_t using GAE                                    │
│  3. For K epochs (typically K=10):                                      │
│     a. Sample mini-batches from collected data                         │
│     b. Compute PPO-Clip loss                                           │
│     c. Update θ, φ using Adam                                          │
│  4. Repeat                                                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PPO Hyperparameters for Caching (Recommended Values)

Hyperparameter	Symbol	Value	Notes
Learning rate	$\eta$	$3 \times 10^{-4}$	Adam, can decay
Clip range	$\epsilon$	0.2	PPO-Clip parameter
GAE lambda	$\lambda$	0.95	Advantage estimation
Discount factor	$\gamma$	0.99
Value loss coefficient	$c_1$	0.5
Entropy coefficient	$c_2$	0.01	Encourage exploration
Epochs per update	$K$	10	Reuse collected data
Mini-batch size	$B$	64
Rollout length	$T$	2048	Steps before update
Number of parallel envs	-	8-16	For variance reduction
Hidden dimensions	-	[256, 256, 128]	Shared + heads
Max gradient norm	-	0.5	Gradient clipping

A2C vs. PPO: When to Use Which

Criterion	A2C	PPO
Sample efficiency	Lower	Higher (reuses data)
Training stability	Moderate	High (clipping)
Wall-clock time	Faster per update	Slower per update
Hyperparameter sensitivity	Higher	Lower
Best for	Fast prototyping	Production deployment
Parallelization	Synchronous workers	Parallel envs

Recommendation: Use PPO for production coded caching systems due to its stability. Use A2C for rapid experimentation.

Federated Learning for Privacy-Preserving Popularity Prediction

Predicting content popularity requires observing user behavior—but users may not want their viewing habits shared with a central server.

Federated learning solves this:

Local training: Each edge server trains a popularity prediction model on local user data
Gradient aggregation: Only model gradients (not raw data) are sent to the central server
Global model update: Central server averages gradients to update a global model
Model distribution: Updated global model is sent back to edges

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              FEDERATED LEARNING FOR CACHING                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                         ┌───────────────┐                               │
│                         │    CENTRAL    │                               │
│                         │    SERVER     │                               │
│                         │               │                               │
│                         │  Global Model │                               │
│                         │  θ_global     │                               │
│                         └───────┬───────┘                               │
│                                 │                                        │
│              ┌──────────────────┼──────────────────┐                    │
│              │                  │                  │                    │
│      Distribute θ       Aggregate              Distribute θ            │
│              │           Gradients                 │                    │
│              │                  │                  │                    │
│              ▼                  │                  ▼                    │
│       ┌────────────┐    ┌──────┴──────┐    ┌────────────┐             │
│       │   EDGE 1   │    │             │    │   EDGE K   │             │
│       │            │    │  Gradients  │    │            │             │
│       │ Local Data │────│  ∇θ₁,...,∇θₖ│────│ Local Data │             │
│       │ (private)  │    │             │    │ (private)  │             │
│       │            │    │             │    │            │             │
│       │ Train on   │    │  Aggregate: │    │ Train on   │             │
│       │ local      │    │  θ ← θ -    │    │ local      │             │
│       │ requests   │    │  η·Σ∇θᵢ/K   │    │ requests   │             │
│       └────────────┘    └─────────────┘    └────────────┘             │
│                                                                          │
│  PRIVACY GUARANTEES:                                                    │
│                                                                          │
│  • Raw user data NEVER leaves the edge                                 │
│  • Central server sees only aggregated gradients                       │
│  • With differential privacy noise, even gradients don't leak         │
│    individual user preferences                                          │
│                                                                          │
│  BENEFITS FOR CODED CACHING:                                            │
│                                                                          │
│  • Learn global popularity patterns (what's trending everywhere)       │
│  • Preserve local personalization (edge models fine-tune)             │
│  • Scale to millions of users without central data collection         │
│  • Comply with privacy regulations (GDPR, CCPA)                       │
│                                                                          │
│  RECENT WORK (Graph Federated Learning, arXiv Feb 2025):               │
│                                                                          │
│  • Integrate Graph Neural Networks with Federated Learning             │
│  • LightGCN captures user-content relationships locally               │
│  • Global aggregation learns cross-user patterns                      │
│  • 20-30% improvement in cache hit rate over non-federated baseline   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Graph Neural Networks for User-Content Relationships

Content requests aren't independent—they form a graph structure:

User-content edges: User $u$ requested content $c$
Content-content edges: Contents $c_1$ and $c_2$ are often requested together
User-user edges: Users $u_1$ and $u_2$ have similar preferences

Graph Neural Networks (GNNs) can learn representations that capture this structure, enabling powerful popularity prediction for cache placement.

The Bipartite User-Content Graph

Graph Definition: Let $\mathcal{G} = (\mathcal{U} \cup \mathcal{C}, \mathcal{E})$ where:

$\mathcal{U} = \{u_1, \ldots, u_K\}$ are user nodes
$\mathcal{C} = \{c_1, \ldots, c_N\}$ are content nodes
$\mathcal{E} = \{(u, c) : \text{user } u \text{ requested content } c\}$ are edges

Adjacency Matrix: The bipartite adjacency matrix $\mathbf{A} \in \{0,1\}^{(K+N) \times (K+N)}$ :

$\mathbf{A} = \begin{pmatrix} \mathbf{0} & \mathbf{R} \\ \mathbf{R}^T & \mathbf{0} \end{pmatrix}$

where $\mathbf{R} \in \{0,1\}^{K \times N}$ is the user-content interaction matrix.

GNN Message Passing Framework

The general GNN update rule for node $v$ at layer $l$ :

$\mathbf{h}_v^{(l+1)} = \text{UPDATE}\left( \mathbf{h}_v^{(l)}, \text{AGGREGATE}\left( \{ \mathbf{h}_u^{(l)} : u \in \mathcal{N}(v) \} \right) \right)$

Different GNN architectures define different AGGREGATE and UPDATE functions.

Graph Convolutional Networks (GCN) for Caching

GCN Layer: Propagates information using normalized adjacency:

$\mathbf{H}^{(l+1)} = \sigma\left( \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \right)$

where:

$\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ (add self-loops)
$\tilde{\mathbf{D}}_{ii} = \sum_j \tilde{\mathbf{A}}_{ij}$ (degree matrix)
$\mathbf{W}^{(l)} \in \mathbb{R}^{d_l \times d_{l+1}}$ is learnable

Per-Node Update:

$\mathbf{h}_v^{(l+1)} = \sigma\left( \sum_{u \in \mathcal{N}(v) \cup \{v\}} \frac{1}{\sqrt{|\mathcal{N}(v)||\mathcal{N}(u)|}} \mathbf{W}^{(l)} \mathbf{h}_u^{(l)} \right)$

The normalization prevents nodes with high degree from dominating.

LightGCN: State-of-the-Art for Recommendation/Caching

LightGCN simplifies GCN by removing nonlinearities and feature transformations, which proves more effective for collaborative filtering:

Layer Propagation: $\mathbf{e}_u^{(l+1)} = \sum_{c \in \mathcal{N}(u)} \frac{1}{\sqrt{|\mathcal{N}(u)||\mathcal{N}(c)|}} \mathbf{e}_c^{(l)}$

$\mathbf{e}_c^{(l+1)} = \sum_{u \in \mathcal{N}(c)} \frac{1}{\sqrt{|\mathcal{N}(c)||\mathcal{N}(u)|}} \mathbf{e}_u^{(l)}$

Final Embedding: Average over all layers: $\mathbf{e}_u = \frac{1}{L+1} \sum_{l=0}^{L} \mathbf{e}_u^{(l)}, \quad \mathbf{e}_c = \frac{1}{L+1} \sum_{l=0}^{L} \mathbf{e}_c^{(l)}$

Prediction: Request probability via dot product: $\hat{y}_{uc} = \mathbf{e}_u^T \mathbf{e}_c$

Loss Function: Bayesian Personalized Ranking (BPR): $\mathcal{L}_{\text{BPR}} = -\sum_{(u,c^+,c^-)} \log \sigma(\hat{y}_{uc^+} - \hat{y}_{uc^-}) + \lambda \|\mathbf{E}\|_2^2$

where $(u, c^+, c^-)$ are triplets with $c^+$ a positive (requested) and $c^-$ a negative (not requested) content.

Graph Attention Networks (GAT) for Caching

GAT learns attention weights to determine neighbor importance:

Attention Coefficient: $e_{vu} = \text{LeakyReLU}\left( \mathbf{a}^T [\mathbf{W}\mathbf{h}_v \| \mathbf{W}\mathbf{h}_u] \right)$

Normalized Attention: $\alpha_{vu} = \frac{\exp(e_{vu})}{\sum_{k \in \mathcal{N}(v)} \exp(e_{vk})}$

Aggregation: $\mathbf{h}_v' = \sigma\left( \sum_{u \in \mathcal{N}(v)} \alpha_{vu} \mathbf{W} \mathbf{h}_u \right)$

Multi-Head Attention: Use $H$ parallel attention heads and concatenate: $\mathbf{h}_v' = \|_{h=1}^{H} \sigma\left( \sum_{u \in \mathcal{N}(v)} \alpha_{vu}^{(h)} \mathbf{W}^{(h)} \mathbf{h}_u \right)$

Why GAT for Caching: Attention learns that some user-content interactions are more informative than others. A user who watched 1000 videos provides less signal per video than one who watched 10.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              GNN ARCHITECTURE FOR CACHE PREDICTION                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT: Bipartite graph G = (Users ∪ Content, Edges)                   │
│                                                                          │
│  User initial embeddings: E_U ∈ ℝ^{K × d}, d = 64                      │
│  Content initial embeddings: E_C ∈ ℝ^{N × d}                           │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ LAYER 0 (Initial)                                                │   │
│  │                                                                  │   │
│  │  e_u^(0) = E_U[u]  (learnable user embedding)                   │   │
│  │  e_c^(0) = E_C[c]  (learnable content embedding)                │   │
│  │                                                                  │   │
│  │  Optional: Concatenate content features (genre, duration, etc.) │   │
│  │  e_c^(0) = [E_C[c] || f_c] where f_c ∈ ℝ^{d_feat}              │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ LAYER 1 (First-hop neighbors)                                    │   │
│  │                                                                  │   │
│  │  For each user u:                                               │   │
│  │    e_u^(1) = AGG({ e_c^(0) : c ∈ N(u) })                       │   │
│  │                                                                  │   │
│  │  For each content c:                                            │   │
│  │    e_c^(1) = AGG({ e_u^(0) : u ∈ N(c) })                       │   │
│  │                                                                  │   │
│  │  AGG = LightGCN (normalized sum) or GAT (attention-weighted)   │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ LAYER 2 (Second-hop = user-user, content-content similarity)    │   │
│  │                                                                  │   │
│  │  User embedding now contains info about:                        │   │
│  │    • Directly watched content (Layer 1)                        │   │
│  │    • Other users who watched same content (Layer 2)            │   │
│  │                                                                  │   │
│  │  Content embedding now contains info about:                     │   │
│  │    • Users who watched it (Layer 1)                            │   │
│  │    • Other content those users watched (Layer 2)               │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ FINAL EMBEDDINGS (LightGCN: average across layers)              │   │
│  │                                                                  │   │
│  │  e_u = (e_u^(0) + e_u^(1) + e_u^(2)) / 3                       │   │
│  │  e_c = (e_c^(0) + e_c^(1) + e_c^(2)) / 3                       │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ PREDICTION & CACHE PLACEMENT                                     │   │
│  │                                                                  │   │
│  │  Popularity prediction for content c:                           │   │
│  │    p_c = Σ_u σ(e_u^T e_c) / K  (average predicted probability) │   │
│  │                                                                  │   │
│  │  Cache placement: Sort by p_c, cache top M files                │   │
│  │                                                                  │   │
│  │  For coded caching: Also consider embedding diversity           │   │
│  │    Maximize: Σ p_c × cache_c + λ × diversity(cached_embeddings)│   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  GNN HYPERPARAMETERS:                                                   │
│  ─────────────────────────────────────────────────────────────────────  │
│  Embedding dimension d: 64                                              │
│  Number of layers L: 2-3                                               │
│  Learning rate: 1e-3                                                    │
│  Batch size: 2048 (user-content pairs)                                 │
│  Negative samples per positive: 4                                      │
│  Regularization λ: 1e-4                                                │
│  Optimizer: Adam                                                        │
│  Training epochs: 100-500                                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

For coded caching, GNN-based popularity prediction informs placement decisions. Content with similar embeddings should be cached together (they'll be requested by similar users). Content with complementary embeddings should be spread across caches (for coding opportunities).

Attention Mechanisms for Content Popularity Prediction

Beyond GNNs, attention mechanisms can directly model user-content interactions without explicit graph structure.

Self-Attention for Request Sequences

Model each user's request history as a sequence and use self-attention to capture patterns:

Input: User $u$ 's request history $[c_1, c_2, \ldots, c_T]$

Embedding: $\mathbf{X} = [\mathbf{e}_{c_1}, \mathbf{e}_{c_2}, \ldots, \mathbf{e}_{c_T}] \in \mathbb{R}^{T \times d}$

Self-Attention: $\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}} \right) \mathbf{V}$

where $\mathbf{Q} = \mathbf{X}\mathbf{W}_Q$ , $\mathbf{K} = \mathbf{X}\mathbf{W}_K$ , $\mathbf{V} = \mathbf{X}\mathbf{W}_V$ .

Multi-Head Attention: $\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H) \mathbf{W}_O$

where $\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_Q^i, \mathbf{X}\mathbf{W}_K^i, \mathbf{X}\mathbf{W}_V^i)$

Position Encoding: Add positional information to distinguish recent vs. older requests:

$\mathbf{PE}_{(t, 2i)} = \sin\left( \frac{t}{10000^{2i/d}} \right), \quad \mathbf{PE}_{(t, 2i+1)} = \cos\left( \frac{t}{10000^{2i/d}} \right)$

For temporal patterns in caching, we can also add time-of-day encoding: $\mathbf{TE}_{(t)} = [\sin(2\pi h/24), \cos(2\pi h/24), \sin(2\pi d/7), \cos(2\pi d/7)]$

where $h$ is hour of day and $d$ is day of week.

Cross-Attention for User-Content Matching

Use cross-attention to match user preferences with content features:

User Query: $\mathbf{q}_u$ = user embedding from GNN or MLP Content Keys/Values: $\mathbf{K}_C, \mathbf{V}_C$ = content embeddings for all content

$\mathbf{h}_u = \text{softmax}\left( \frac{\mathbf{q}_u \mathbf{K}_C^T}{\sqrt{d}} \right) \mathbf{V}_C$

This gives a user representation that's a weighted combination of content embeddings, weighted by relevance.

Transformer-Based Popularity Prediction

Full Transformer architectures can model complex temporal patterns in content requests.

Transformer Encoder for Request Prediction

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              TRANSFORMER FOR POPULARITY PREDICTION                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT: Recent request history across all users                         │
│         X = [request_1, request_2, ..., request_T]                      │
│         Each request = (user_id, content_id, timestamp)                 │
│                                                                          │
│  EMBEDDING LAYER:                                                        │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  e_t = Embed_content(c_t) + Embed_user(u_t) + PE(t) + TE(timestamp)   │
│                                                                          │
│  where PE = positional encoding, TE = temporal encoding                 │
│                                                                          │
│  X_embed ∈ ℝ^{T × d}, d = 256                                          │
│                                                                          │
│  TRANSFORMER ENCODER STACK (N=4 layers):                                │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  For each layer n:                                                       │
│                                                                          │
│  ┌───────────────────────────────────────────────────────────────┐     │
│  │ Multi-Head Self-Attention (8 heads)                           │     │
│  │                                                                │     │
│  │ Q, K, V = X @ W_Q, X @ W_K, X @ W_V                          │     │
│  │ Attn = softmax(QK^T / √d_k) V                                │     │
│  │ X = LayerNorm(X + Dropout(Attn))                             │     │
│  └───────────────────────────────────────────────────────────────┘     │
│                              │                                          │
│                              ▼                                          │
│  ┌───────────────────────────────────────────────────────────────┐     │
│  │ Feed-Forward Network                                          │     │
│  │                                                                │     │
│  │ FFN(x) = ReLU(x @ W_1 + b_1) @ W_2 + b_2                     │     │
│  │ W_1 ∈ ℝ^{d × 4d}, W_2 ∈ ℝ^{4d × d}                          │     │
│  │ X = LayerNorm(X + Dropout(FFN(X)))                           │     │
│  └───────────────────────────────────────────────────────────────┘     │
│                                                                          │
│  OUTPUT HEAD (Popularity Prediction):                                   │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  h_final = X[-1]  (last position embedding)                            │
│                                                                          │
│  For each content c:                                                    │
│    p_c = σ(h_final @ W_out @ e_c)  (predicted request probability)     │
│                                                                          │
│  Or: p = softmax(h_final @ E_content^T)  (distribution over content)   │
│                                                                          │
│  TRAINING:                                                               │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Next-request prediction: Given history, predict next requested content│
│                                                                          │
│  Loss = CrossEntropy(p, c_{T+1})                                       │
│       = -log(p_{c_{T+1}})                                              │
│                                                                          │
│  For multi-step prediction, use autoregressive generation or           │
│  predict distribution over next K requests simultaneously.             │
│                                                                          │
│  HYPERPARAMETERS:                                                        │
│  ─────────────────────────────────────────────────────────────────────  │
│  Model dimension d: 256                                                 │
│  Number of layers N: 4                                                  │
│  Number of attention heads: 8                                          │
│  Feed-forward dimension: 1024 (4×d)                                    │
│  Dropout: 0.1                                                           │
│  Sequence length T: 100-500                                            │
│  Learning rate: 1e-4 (with warmup)                                     │
│  Batch size: 64 sequences                                              │
│  Warmup steps: 4000                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Pre-Training Strategies for Caching Transformers

Masked Content Prediction: Like BERT, mask random requests and predict them: $\mathcal{L}_{\text{MLM}} = -\sum_{t \in \text{masked}} \log P(c_t | \mathbf{X}_{\backslash t})$

Contrastive Learning: Learn representations where similar request patterns are close: $\mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_j)/\tau)}{\sum_k \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k)/\tau)}$

Multi-Task Pre-Training:

Predict next content
Predict time until next request
Predict user of next request
Classify content category

Meta-Reinforcement Learning for Fast Adaptation

Content popularity can shift suddenly—a viral video emerges, breaking news changes demand patterns. Standard RL agents take thousands of interactions to adapt.

Meta-reinforcement learning addresses this by training agents that can learn to adapt quickly:

Meta-training: Train on many different popularity distributions
Meta-objective: Not just high reward, but fast adaptation to new distributions
Deployment: When popularity shifts, the agent adapts in ~10-100 interactions instead of thousands

MAML: Model-Agnostic Meta-Learning for Caching

MAML learns an initialization $\theta^*$ such that a few gradient steps on a new task yield good performance.

Algorithm:

Sample task batch: $\{\mathcal{T}_1, \ldots, \mathcal{T}_B\}$ where each $\mathcal{T}_i$ is a different popularity distribution
Inner loop (task-specific adaptation): For each task $\mathcal{T}_i$ : $\theta_i' = \theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(\theta)$

This is one (or few) gradient steps on task-specific data.
Outer loop (meta-update): Update $\theta$ to improve post-adaptation performance: $\theta \leftarrow \theta - \beta \nabla_\theta \sum_i \mathcal{L}_{\mathcal{T}_i}(\theta_i')$

Note: This requires computing gradients through the inner gradient step (second-order).

First-Order Approximation (FOMAML): Ignore second-order terms: $\theta \leftarrow \theta - \beta \sum_i \nabla_{\theta_i'} \mathcal{L}_{\mathcal{T}_i}(\theta_i')$

Much cheaper computationally, works nearly as well in practice.

MAML for Caching:

$\mathcal{L}_{\mathcal{T}_i}(\theta) = -\mathbb{E}_{(s,a,r,s') \sim \mathcal{T}_i} \left[ \sum_t \gamma^t r_t \right]$

where $\mathcal{T}_i$ represents requests sampled from popularity distribution $i$ .

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              MAML FOR ADAPTIVE CACHING                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  META-TRAINING PHASE:                                                    │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Task Distribution: Different Zipf parameters α ∈ [0.5, 1.5]           │
│                     Different content libraries                         │
│                     Different diurnal patterns                          │
│                                                                          │
│  For each meta-iteration:                                               │
│                                                                          │
│    1. Sample B=4 tasks (popularity distributions)                       │
│                                                                          │
│    2. For each task i:                                                   │
│       ┌─────────────────────────────────────────────────────────────┐  │
│       │ Inner Loop (K=5 gradient steps):                            │  │
│       │                                                              │  │
│       │   θ_i^(0) = θ                                               │  │
│       │   for k = 1 to K:                                           │  │
│       │     Sample batch of requests from task i                    │  │
│       │     θ_i^(k) = θ_i^(k-1) - α ∇ L_i(θ_i^(k-1))               │  │
│       │                                                              │  │
│       │   θ_i' = θ_i^(K)  (adapted parameters)                      │  │
│       └─────────────────────────────────────────────────────────────┘  │
│                                                                          │
│    3. Meta-update:                                                       │
│       ┌─────────────────────────────────────────────────────────────┐  │
│       │ Evaluate L_i(θ_i') on NEW data from task i                  │  │
│       │                                                              │  │
│       │ θ ← θ - β ∇_θ Σ_i L_i(θ_i')                                │  │
│       │                                                              │  │
│       │ (This gradient flows through the inner loop!)               │  │
│       └─────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  DEPLOYMENT PHASE:                                                       │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  When popularity shifts detected:                                       │
│                                                                          │
│    1. Start with meta-learned θ*                                        │
│    2. Collect ~50-100 requests from new distribution                   │
│    3. Take K=5 gradient steps: θ_new = θ* - α ∇ L_new(θ*)             │
│    4. Deploy θ_new                                                      │
│                                                                          │
│  Result: Adaptation in 50-100 requests instead of 5000+                │
│                                                                          │
│  HYPERPARAMETERS:                                                        │
│  ─────────────────────────────────────────────────────────────────────  │
│  Inner learning rate α: 0.01                                            │
│  Outer learning rate β: 0.001                                           │
│  Inner gradient steps K: 5                                              │
│  Tasks per batch B: 4-8                                                 │
│  Meta-training iterations: 10,000-50,000                               │
│  Policy architecture: Same as PPO (shared backbone)                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Reptile: A Simpler Alternative

Reptile is a first-order meta-learning algorithm that's simpler than MAML:

Algorithm:

Sample task $\mathcal{T}_i$
Initialize $\tilde{\theta} = \theta$
Run $K$ gradient steps on $\mathcal{T}_i$ : $\tilde{\theta} \leftarrow \tilde{\theta} - \alpha \nabla \mathcal{L}_{\mathcal{T}_i}(\tilde{\theta})$
Meta-update: $\theta \leftarrow \theta + \epsilon (\tilde{\theta} - \theta)$

Intuition: Move the initialization toward the final parameters after training on each task. The optimal $\theta^*$ is equidistant from all task-specific optima.

Reptile vs. MAML:

Aspect	MAML	Reptile
Gradients	Second-order (or FOMAML)	First-order only
Memory	High (store computation graph)	Low
Implementation	Complex	Simple
Performance	Slightly better	Nearly as good
Computation	~2× slower	Faster

RL² (Learning to Reinforcement Learn)

RL² uses a recurrent neural network that learns to adapt within an episode:

Architecture: LSTM-based policy that receives $(s_t, a_{t-1}, r_{t-1})$ as input

Training: Train on many tasks where each episode is a new task. The LSTM learns to:

Explore efficiently at the start of episodes
Exploit learned knowledge later
Adapt its behavior based on observed rewards

For Caching: The LSTM receives:

Current cache state
Previous caching action
Previous reward (hit rate, latency)

Over time, it learns to recognize the current popularity pattern and adapt its caching strategy accordingly—all within a single forward pass, no gradient updates needed at deployment.

RL² Update Rule: Standard policy gradient on the recurrent policy: $\nabla_\theta J = \mathbb{E}\left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | h_t) A_t \right]$

where $h_t$ is the LSTM hidden state encoding the history.

Recent work (2025) on "Fast Adaptive Caching Algorithm for Mobile Edge Networks Based on Meta-reinforcement Learning" shows:

5× faster adaptation to popularity shifts
Only 2-3% performance loss during adaptation periods
Robust across different types of popularity changes (gradual drift, sudden shift, periodic patterns)

When to Use Which Meta-RL Algorithm

Scenario	Recommended Algorithm	Reason
Discrete distribution shifts	MAML/FOMAML	Best for distinct task boundaries
Continuous drift	RL²	No explicit adaptation step needed
Limited compute budget	Reptile	Simpler, faster training
Online adaptation	RL²	No gradient steps at deployment
Periodic patterns (daily/weekly)	MAML with time-aware tasks	Can pre-learn periodic structure

Part V: Integration with 5G/6G Mobile Edge Computing

MEC Architecture and Caching

Mobile Edge Computing (MEC) places computational and storage resources at the network edge—at or near cellular base stations. This is the natural deployment point for coded caching:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              CODED CACHING IN 5G/6G MEC ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                       ┌─────────────────────┐                           │
│                       │    CORE NETWORK     │                           │
│                       │    (Content CDN)    │                           │
│                       └──────────┬──────────┘                           │
│                                  │                                       │
│                           Backhaul Link                                  │
│                      (Expensive, congested)                             │
│                                  │                                       │
│               ┌──────────────────┼──────────────────┐                   │
│               │                  │                  │                   │
│               ▼                  ▼                  ▼                   │
│        ┌────────────┐     ┌────────────┐     ┌────────────┐            │
│        │   MEC 1    │     │   MEC 2    │     │   MEC 3    │            │
│        │            │     │            │     │            │            │
│        │ ┌────────┐ │     │ ┌────────┐ │     │ ┌────────┐ │            │
│        │ │ CACHE  │ │     │ │ CACHE  │ │     │ │ CACHE  │ │            │
│        │ │        │ │     │ │        │ │     │ │        │ │            │
│        │ │ Coded  │ │     │ │ Coded  │ │     │ │ Coded  │ │            │
│        │ │Content │ │     │ │Content │ │     │ │Content │ │            │
│        │ └────────┘ │     │ └────────┘ │     │ └────────┘ │            │
│        │            │     │            │     │            │            │
│        │ ┌────────┐ │     │ ┌────────┐ │     │ ┌────────┐ │            │
│        │ │   AI   │ │     │ │   AI   │ │     │ │   AI   │ │            │
│        │ │ ENGINE │ │     │ │ ENGINE │ │     │ │ ENGINE │ │            │
│        │ └────────┘ │     │ └────────┘ │     │ └────────┘ │            │
│        └─────┬──────┘     └─────┬──────┘     └─────┬──────┘            │
│              │                  │                  │                    │
│         Fronthaul          Fronthaul          Fronthaul                │
│              │                  │                  │                    │
│        ┌─────┴──────┐     ┌─────┴──────┐     ┌─────┴──────┐            │
│        │    gNB 1   │     │    gNB 2   │     │    gNB 3   │            │
│        │            │     │            │     │            │            │
│        └─────┬──────┘     └─────┬──────┘     └─────┬──────┘            │
│              │                  │                  │                    │
│           ┌──┴──┐            ┌──┴──┐            ┌──┴──┐                │
│           │ UEs │            │ UEs │            │ UEs │                │
│           └─────┘            └─────┘            └─────┘                │
│                                                                          │
│  AI ENGINE FUNCTIONS:                                                   │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  1. POPULARITY PREDICTION                                               │
│     • Federated learning across edges                                  │
│     • GNN for user-content modeling                                    │
│     • LSTM/Transformer for temporal patterns                          │
│                                                                          │
│  2. CACHE PLACEMENT OPTIMIZATION                                        │
│     • Deep RL for dynamic placement                                    │
│     • Coordinated across MEC nodes for coding gain                    │
│     • Trade-off: local hits vs. coding opportunity                    │
│                                                                          │
│  3. CODED DELIVERY SCHEDULING                                          │
│     • Identify multicast groups                                        │
│     • Construct optimal coded messages                                 │
│     • Schedule transmissions across time/frequency                    │
│                                                                          │
│  4. USER MOBILITY PREDICTION                                            │
│     • Predict user movement across cells                              │
│     • Proactive content migration                                      │
│     • Handover-aware cache coordination                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BENEFITS OF CODED CACHING IN MEC:                                      │
│                                                                          │
│  • Reduced backhaul: Coded multicasting reduces core traffic          │
│  • Lower latency: Edge-served content, no core round-trip             │
│  • Improved QoE: Proactive caching before user requests               │
│  • Network efficiency: Same backhaul serves more users                │
│  • Scalability: Global caching gain grows with user density           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Semantic Communication and Coded Caching in 6G

6G networks will introduce semantic communication—transmitting meaning rather than bits. This fundamentally changes caching:

Traditional caching: Store exact file bits. Transmit missing bits.

Semantic caching: Store semantic representations (compressed meaning). Transmit semantic differences. Receiver reconstructs content using local knowledge base.

Example: Instead of caching a full 4K video file:

Cache a semantic description + key frames + neural decoder model
When user requests video, transmit only the scene-specific information
Local AI reconstructs the full video using cached knowledge

Coded semantic caching combines these ideas:

Cache semantic representations (100-1000× smaller than raw content)
Code across semantic representations for delivery
Achieve both compression gain AND coding gain

The potential improvement is multiplicative: if semantic compression gives 100× and coded caching gives 10×, the combined gain approaches 1000×.

Performance Analysis: Putting It All Together

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              CODED CACHING PERFORMANCE IN 5G/6G MEC                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYSTEM PARAMETERS:                                                      │
│  • N = 1000 files (video library)                                      │
│  • K = 100 users per MEC node                                          │
│  • M = 100 files per user cache (10% of library)                       │
│  • 10 MEC nodes in deployment                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CACHING STRATEGY COMPARISON:                                           │
│                                                                          │
│  Strategy                  Delivery Rate    Backhaul         Latency   │
│                            (files/period)   Reduction        (ms)      │
│  ─────────────────────────────────────────────────────────────────────  │
│  No caching                    100           0%              150       │
│  LRU (uncoded)                  75           25%             120       │
│  LFU (uncoded)                  68           32%             115       │
│  Popularity-optimal (uncoded)   60           40%             105       │
│  MAN coded caching              12           88%              35       │
│  AI-optimized coded             10           90%              30       │
│  Semantic coded (6G)             1           99%              10       │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHTS:                                                           │
│                                                                          │
│  1. Coded caching provides 5-6× improvement over best uncoded policy   │
│     This is the "global caching gain" predicted by theory              │
│                                                                          │
│  2. AI optimization adds another 15-20% on top of theoretical scheme   │
│     By adapting to actual (non-uniform, time-varying) popularity       │
│                                                                          │
│  3. Semantic communication offers order-of-magnitude further gains     │
│     By transmitting meaning rather than bits                           │
│                                                                          │
│  4. Latency reduction is dramatic                                       │
│     From 150ms (core) to 10ms (edge semantic) — enables new apps      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DEPLOYMENT CONSIDERATIONS:                                              │
│                                                                          │
│  Subpacketization: With K=100, t=10, need C(100,10) ≈ 10^13 subfiles  │
│  Solution: User grouping (10 groups of 10) reduces to C(10,1) = 10    │
│  Trade-off: Lose some coding gain (factor ~2) but remain practical    │
│                                                                          │
│  Coordination overhead: MEC nodes must share cache state               │
│  Solution: Periodic synchronization (every 1-10 minutes sufficient)   │
│  AI handles short-term dynamics; coordination handles long-term       │
│                                                                          │
│  Heterogeneous caches: Different MEC nodes have different sizes       │
│  Solution: Weighted placement—larger caches store more popular files  │
│  AI learns optimal allocation across heterogeneous infrastructure     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VI: Current Research Frontiers

Coded Caching with Security and Privacy

Problem: Coded delivery reveals information about what other users requested. If I receive $A_1 \oplus B_2$ , I learn that someone requested file B.

Solutions being researched:

Secret sharing: Split cache content so no single coded message reveals complete information
Private information retrieval (PIR): Users download content without revealing which file they want
Differential privacy: Add noise to coded messages that masks individual requests while preserving utility

Coded Caching with Heterogeneous Files

Real content libraries have diverse file sizes—short clips, long movies, live streams. The MAN scheme assumes equal file sizes.

Extensions:

Weighted placement: Allocate cache space proportional to file size × popularity
Layered coding: Split large files into layers, apply coded caching at each layer
Adaptive subpacketization: Different subpacketization for different file sizes

Online Coded Caching

The MAN scheme assumes batch processing—all requests arrive together, then delivery phase begins. Real systems have continuous request arrivals.

Online approaches:

Sliding window: Group requests in time windows, apply coded delivery within windows
Queue-based: Delay requests briefly to accumulate coding opportunities
Predictive delivery: Start delivering predicted requests before they're actually made

Coded Caching for Machine Learning

A emerging application: distributed ML training generates massive data movement (gradients, parameters). Coded caching principles can help:

Gradient caching: Workers cache parts of the gradient; coordinator sends coded combinations
Model caching: Different workers cache different model components; federated aggregation uses coding
Data shuffling: Shuffle training data across workers using coded communication

Sources:

Frequently Asked Questions

Coded caching: Coding is applied across files during the delivery phase, exploiting cached content at receivers. Works in a broadcast setting, doesn't require intermediate nodes.

Network coding: Coding is applied at intermediate nodes during transmission, creating new combinations as data traverses the network. Requires network topology awareness and in-network processing.

The two can be combined: coded caching for last-hop delivery, network coding for multi-hop transport.

The MAN scheme requires $\binom{K}{t}$ subfiles per file. For $K=100$, $t=10$: over $10^{13}$ subfiles—impractical for indexing, storage, and delivery coordination.

Practical solutions:

User grouping: Divide $K$ users into smaller groups, apply coded caching within groups. With groups of 10, subpacketization drops to $\binom{10}{1} = 10$.
Placement Delivery Arrays (PDAs): Combinatorial designs achieving polynomial subpacketization while retaining most coding gain.
Trade-off: Typically 2-4× increase in rate compared to optimal, but tractable implementation.

Yes, with modifications. The key insight is that XOR operations on encrypted content produce valid encrypted content.

Approach:

Files are encrypted before caching
Coded delivery operates on ciphertext: $\text{Enc}(A) \oplus \text{Enc}(B)$
Users decrypt after decoding: $\text{Dec}(\text{Enc}(A))$

Challenges:

Key management becomes complex (different keys for different subfiles)
Some coding gain may be lost due to encryption overhead
Adaptive schemes need care to avoid leaking information through coding patterns

User mobility creates two challenges:

Cached content becomes irrelevant: User moves from cell A to cell B; content cached at A is no longer accessible
Request patterns change: Different locations have different popular content

Solutions:

Predictive caching: Anticipate user movement, pre-position content at destination cell
Multi-access coded caching: User maintains connections to multiple cells during handover
Content migration: Proactively move cached content along predicted user trajectory
Mobility-aware placement: Cache content likely to be useful across user's typical movement pattern

Positive impacts (latency reduction):

Cache hits: ~1ms for local content vs. ~100ms for core network retrieval
Reduced backhaul congestion: Faster delivery when backhaul isn't bottleneck
Proactive caching: Content available before request if prediction is accurate

Potential negative impacts:

Coding/decoding computation: Small overhead (~1ms) for XOR operations
Subpacketization overhead: Need to identify and retrieve specific subfiles
Coordination delay: Wait for coding opportunities (batch requests)

Net effect: For cached content, latency improves dramatically. The key is high cache hit rate through good popularity prediction.

Best scenarios:

Many users with overlapping interests (high global caching gain potential)
Content popularity spread across library (not just top 10 files)
Cache size is moderate (10-30% of library)
Peak-to-average traffic ratio is high (clear separation of placement/delivery)

Less beneficial scenarios:

Very few users (limited coding opportunity)
Highly skewed popularity (better to just cache top files completely)
Very small or very large caches (extremes reduce coding gain)
Uniform traffic (no clear off-peak for placement)

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

AI for CommArchitecture

AI-RAN: The AI-Native Foundation for 6G Networks

In-depth tour of AI-Radio Access Networks (AI-RAN)—the foundational architecture transforming 5G and enabling 6G. From traditional RAN to AI-native systems, understand the RAN Intelligent Controller (RIC), real-time optimization, and production deployment patterns.

9 min read

AI for CommArchitecture

6G Network Architecture: AI at Every Layer - A Complete Technical Vision for IMT-2030

Detailed look at 6G (IMT-2030) network architecture—from AI-native air interfaces and semantic communication to integrated sensing, digital twins, and self-evolving protocols. The complete technical roadmap for next-generation wireless beyond 2030.

4 min read

AI for CommDeep Learning

Deep Learning for Channel Estimation in Massive MIMO Systems

In-depth technical deep dive into deep learning approaches for channel estimation in massive MIMO—from traditional methods to state-of-the-art CNN-LSTM-Transformer hybrid architectures. Complete with equations, implementations, and performance analysis showing 90%+ NMSE reduction.

14 min read

AI for CommArchitecture

Building Intelligent RAN: O-RAN and RIC Architecture Deep Dive

A practical deep dive into Open RAN and RAN Intelligent Controller architecture—from E2 interface specifications to xApp/rApp development, deployment patterns, and real-world production implementations powering modern 5G networks.

5 min read

AI for CommDeep Learning

AI for Channel Coding: Neural Decoders and End-to-End Learned Codes

In-depth exploration of AI-powered channel coding—from neural belief propagation decoders for LDPC and Polar codes to end-to-end learned codes with Turbo Autoencoders. Deep theoretical foundations, architectural innovations, performance analysis, and the path toward 6G learned physical layers.

14 min read

Table of Contents

Introduction

Part I: The Information-Theoretic Foundations

The Caching Problem: A Precise Formulation

Uncoded Caching: The Conventional Approach

The Maddah-Ali & Niesen Breakthrough

The General Scheme and Achievable Rate

Local vs. Global Caching Gain: A Deep Dive

Information-Theoretic Optimality

Converse Bound Derivation (Cut-Set Argument)

Part II: Extensions and Practical Variants

Decentralized Coded Caching

Multi-Access Coded Caching

Device-to-Device (D2D) Coded Caching

Hierarchical Coded Caching

Part III: Challenges in Practical Deployment

The Subpacketization Problem

Placement Delivery Arrays: A Rigorous Construction

Non-Uniform Popularity

Optimal Placement Under Zipf: A Convex Optimization

Unknown and Time-Varying Popularity

Part IV: AI/ML for Coded Caching

Why AI is Essential

Deep Reinforcement Learning for Cache Placement

The Bellman Equation for Optimal Caching

Competitive Ratio and Regret Analysis

Deep Q-Networks (DQN) for Discrete Caching Decisions

The DQN Algorithm

DQN Variants for Caching

DQN Architecture for Caching

DQN Hyperparameters for Caching (Recommended Values)

Proximal Policy Optimization (PPO) for Continuous Cache Allocation

The PPO Algorithm

PPO Architecture for Cache Allocation

PPO Hyperparameters for Caching (Recommended Values)

A2C vs. PPO: When to Use Which

Federated Learning for Privacy-Preserving Popularity Prediction

Graph Neural Networks for User-Content Relationships

The Bipartite User-Content Graph

GNN Message Passing Framework

Graph Convolutional Networks (GCN) for Caching

LightGCN: State-of-the-Art for Recommendation/Caching

Graph Attention Networks (GAT) for Caching

Attention Mechanisms for Content Popularity Prediction

Self-Attention for Request Sequences

Cross-Attention for User-Content Matching

Transformer-Based Popularity Prediction

Transformer Encoder for Request Prediction

Pre-Training Strategies for Caching Transformers

Meta-Reinforcement Learning for Fast Adaptation

MAML: Model-Agnostic Meta-Learning for Caching

Reptile: A Simpler Alternative

RL² (Learning to Reinforcement Learn)

When to Use Which Meta-RL Algorithm

Part V: Integration with 5G/6G Mobile Edge Computing

MEC Architecture and Caching

Semantic Communication and Coded Caching in 6G

Performance Analysis: Putting It All Together

Part VI: Current Research Frontiers

Coded Caching with Security and Privacy

Coded Caching with Heterogeneous Files

Online Coded Caching

Coded Caching for Machine Learning

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

AI-RAN: The AI-Native Foundation for 6G Networks

6G Network Architecture: AI at Every Layer - A Complete Technical Vision for IMT-2030

Deep Learning for Channel Estimation in Massive MIMO Systems

Building Intelligent RAN: O-RAN and RIC Architecture Deep Dive

AI for Channel Coding: Neural Decoders and End-to-End Learned Codes