Skip to main content
Back to Blog

Multi-Armed Bandits: Theory, Algorithms, and Applications

Comprehensive guide to multi-armed bandits from theory to practice. From regret bounds and UCB algorithms to Thompson Sampling, contextual bandits, and real-world applications in A/B testing, recommendations, and beyond.

22 min read
Share:

The Exploration-Exploitation Dilemma

Imagine you're at a casino with multiple slot machines (one-armed bandits). Each machine has an unknown payout probability. You want to maximize your total winnings over many plays. Should you keep playing the machine that's paid well so far (exploit), or try other machines that might be even better (explore)?

This is the exploration-exploitation dilemma—one of the most fundamental problems in decision-making under uncertainty. It appears everywhere:

  • A/B testing: Keep showing the winning variant, or test new ones?
  • Recommendations: Show items users will like, or discover new preferences?
  • Clinical trials: Give patients the best-known treatment, or test alternatives?
  • Ad selection: Display high-CTR ads, or explore new creatives?
  • Hyperparameter tuning: Exploit promising configurations, or explore new regions?

Multi-armed bandits (MABs) provide a principled framework for balancing exploration and exploitation.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    THE EXPLORATION-EXPLOITATION TRADEOFF                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PURE EXPLOITATION:                                                      │
│  ──────────────────                                                      │
│  Always play the arm with highest observed reward                       │
│                                                                          │
│  Problem: May get stuck on suboptimal arm                               │
│                                                                          │
│  Example: Arm A paid $1 on first pull, Arm B paid $0                    │
│  → Always pull A, never discover B actually has 90% win rate            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PURE EXPLORATION:                                                       │
│  ─────────────────                                                       │
│  Pull each arm equally regardless of rewards                            │
│                                                                          │
│  Problem: Wastes pulls on clearly inferior arms                         │
│                                                                          │
│  Example: After 1000 pulls, Arm A averages $0.90, Arm B averages $0.10 │
│  → Still pulling B 50% of the time, losing expected value               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPTIMAL STRATEGY:                                                       │
│  ────────────────                                                        │
│  Explore enough to identify the best arm with high confidence           │
│  Then exploit that arm while occasionally checking others               │
│                                                                          │
│  The balance depends on:                                                │
│  • Time horizon (more time → more exploration affordable)               │
│  • Uncertainty (more uncertain → more exploration needed)               │
│  • Arm differences (similar arms → more exploration to distinguish)     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part I: Foundations and Problem Formulation

The Stochastic Multi-Armed Bandit Problem

Setting: You have KK arms (actions). At each time step t=1,2,...,Tt = 1, 2, ..., T:

  1. You select an arm At{1,2,...,K}A_t \in \{1, 2, ..., K\}
  2. You receive a reward RtR_t drawn from the selected arm's distribution

Arm reward distributions: Each arm aa has an unknown reward distribution νa\nu_a with mean μa\mu_a:

RtAt=aνa,E[RtAt=a]=μaR_t | A_t = a \sim \nu_a, \quad \mathbb{E}[R_t | A_t = a] = \mu_a

Optimal arm: The arm with highest expected reward:

a=argmaxa{1,...,K}μa,μ=μa=maxaμaa^* = \arg\max_{a \in \{1,...,K\}} \mu_a, \quad \mu^* = \mu_{a^*} = \max_a \mu_a

Goal: Maximize cumulative reward over TT rounds:

maxE[t=1TRt]\max \mathbb{E}\left[\sum_{t=1}^{T} R_t\right]

Regret: The Performance Metric

Instead of maximizing reward directly, we typically minimize regret—the difference between what we earned and what we could have earned by always playing the optimal arm.

Cumulative regret:

Regret(T)=TμE[t=1TRt]=E[t=1T(μμAt)]\text{Regret}(T) = T \cdot \mu^* - \mathbb{E}\left[\sum_{t=1}^{T} R_t\right] = \mathbb{E}\left[\sum_{t=1}^{T} (\mu^* - \mu_{A_t})\right]

Interpretation: If the best arm has mean reward μ=0.8\mu^* = 0.8 and we played it every round for T=1000T = 1000 rounds, we'd expect 800800 total reward. If our algorithm achieved expected reward 750750, our regret is 5050.

Gap-based regret decomposition:

Define the gap (or suboptimality) of arm aa:

Δa=μμa\Delta_a = \mu^* - \mu_a

Then regret decomposes as:

Regret(T)=a=1KΔaE[Na(T)]\text{Regret}(T) = \sum_{a=1}^{K} \Delta_a \cdot \mathbb{E}[N_a(T)]

where Na(T)=t=1TI[At=a]N_a(T) = \sum_{t=1}^{T} \mathbb{I}[A_t = a] is the number of times arm aa was pulled.

Insight: To minimize regret, we need to minimize pulls of suboptimal arms, especially those with large gaps.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    REGRET DECOMPOSITION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Example: 3 arms, T = 1000 rounds                                       │
│                                                                          │
│  Arm 1: μ₁ = 0.8 (optimal), Δ₁ = 0                                     │
│  Arm 2: μ₂ = 0.6, Δ₂ = 0.2                                             │
│  Arm 3: μ₃ = 0.3, Δ₃ = 0.5                                             │
│                                                                          │
│  Suppose our algorithm pulled:                                          │
│  N₁(T) = 900, N₂(T) = 80, N₃(T) = 20                                   │
│                                                                          │
│  Regret = Δ₁·N₁ + Δ₂·N₂ + Δ₃·N₃                                        │
│         = 0·900 + 0.2·80 + 0.5·20                                       │
│         = 0 + 16 + 10                                                   │
│         = 26                                                            │
│                                                                          │
│  REGRET SOURCES:                                                         │
│  ───────────────                                                         │
│  • Arm 2 contributed 16 regret (moderate gap, pulled often)             │
│  • Arm 3 contributed 10 regret (large gap, pulled rarely)               │
│  • Arm 1 contributed 0 regret (optimal arm)                             │
│                                                                          │
│  Good algorithms minimize E[Nₐ(T)] for suboptimal arms                  │
│  while pulling them enough to confirm they're suboptimal                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Bayesian vs. Frequentist Regret

There are two fundamentally different ways to define and analyze regret:

Frequentist (Worst-Case) Regret:

The regret we've defined so far is frequentist—it measures performance for a fixed (but unknown) problem instance:

Regretfreq(T)=E[t=1T(μμAt)]\text{Regret}_{\text{freq}}(T) = \mathbb{E}\left[\sum_{t=1}^{T} (\mu^* - \mu_{A_t})\right]

The expectation is only over the algorithm's randomness. Lower bounds (like Lai-Robbins) hold for every problem instance.

Bayesian Regret:

In the Bayesian view, the problem instance itself is drawn from a prior P(θ)P(\theta):

BayesRegret(T)=EθP[E[t=1T(μ(θ)μAt(θ))]]\text{BayesRegret}(T) = \mathbb{E}_{\theta \sim P}\left[\mathbb{E}\left[\sum_{t=1}^{T} (\mu^*(\theta) - \mu_{A_t}(\theta))\right]\right]

The outer expectation is over problem instances. This measures average performance across problems.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FREQUENTIST vs BAYESIAN REGRET                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FREQUENTIST:                                                            │
│  ────────────                                                            │
│  "For ANY problem instance, my algorithm has low regret"                │
│                                                                          │
│  • Stronger guarantee (worst-case)                                      │
│  • UCB analysis is frequentist                                          │
│  • Lai-Robbins is a frequentist lower bound                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BAYESIAN:                                                               │
│  ─────────                                                               │
│  "On AVERAGE over problems from my prior, I have low regret"            │
│                                                                          │
│  • Weaker guarantee (average-case)                                      │
│  • Thompson Sampling analysis is often Bayesian                         │
│  • Can be much easier to prove                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RELATIONSHIP:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Low Bayesian regret does NOT imply low frequentist regret!             │
│                                                                          │
│  Example: An algorithm could do well on "typical" problems              │
│  but fail badly on adversarial problem instances.                       │
│                                                                          │
│  Thompson Sampling has both:                                             │
│  • O(√(KT ln K)) Bayesian regret (easy to prove)                       │
│  • O(K ln T / Δ) frequentist regret (harder to prove)                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why it matters:

  • Thompson Sampling: Original analysis by Agrawal & Goyal (2012) was Bayesian. Frequentist bounds came later and are more complex.
  • Prior mismatch: If your prior doesn't match reality, Bayesian guarantees may not hold.
  • Practical impact: For most applications, both perspectives give similar practical guidance.

Regret Lower Bounds

How well can any algorithm possibly do? The Lai-Robbins lower bound (1985) establishes a fundamental limit.

Theorem (Lai-Robbins): For any consistent policy and any bandit problem with arm distributions from a single-parameter exponential family:

lim infTRegret(T)lnTa:μa<μΔaKL(νa,ν)\liminf_{T \to \infty} \frac{\text{Regret}(T)}{\ln T} \geq \sum_{a: \mu_a < \mu^*} \frac{\Delta_a}{\text{KL}(\nu_a, \nu^*)}

where KL(νa,ν)\text{KL}(\nu_a, \nu^*) is the Kullback-Leibler divergence between arm aa's distribution and the optimal arm's distribution.

For Bernoulli arms with means μa\mu_a and μ\mu^*:

KL(μa,μ)=μalnμaμ+(1μa)ln1μa1μ\text{KL}(\mu_a, \mu^*) = \mu_a \ln\frac{\mu_a}{\mu^*} + (1-\mu_a) \ln\frac{1-\mu_a}{1-\mu^*}

Simplified form (for small gaps, using KL(μ,μ+ϵ)ϵ22μ(1μ)\text{KL}(\mu, \mu + \epsilon) \approx \frac{\epsilon^2}{2\mu(1-\mu)}):

Regret(T)Ω(a:Δa>0lnTΔa)\text{Regret}(T) \geq \Omega\left(\sum_{a: \Delta_a > 0} \frac{\ln T}{\Delta_a}\right)

Key insight: Regret must grow at least logarithmically with TT. We cannot do better than O(lnT)O(\ln T) regret. Algorithms achieving this bound are called optimal.


Problem Variants

Bounded rewards: Rt[0,1]R_t \in [0, 1] (or [a,b][a, b] more generally)

Bernoulli bandits: Rt{0,1}R_t \in \{0, 1\}, P(Rt=1At=a)=μaP(R_t = 1 | A_t = a) = \mu_a

Gaussian bandits: RtAt=aN(μa,σ2)R_t | A_t = a \sim \mathcal{N}(\mu_a, \sigma^2)

Sub-Gaussian rewards: RtμAtR_t - \mu_{A_t} is σ\sigma-sub-Gaussian: E[exp(λ(RtμAt))]exp(λ2σ22)λ\mathbb{E}[\exp(\lambda(R_t - \mu_{A_t}))] \leq \exp\left(\frac{\lambda^2 \sigma^2}{2}\right) \quad \forall \lambda

Sub-Gaussian is a key assumption enabling concentration inequalities.


Part II: Classical Algorithms

ε-Greedy

The simplest exploration strategy: exploit most of the time, explore randomly with probability ϵ\epsilon.

Algorithm:

At each time tt:

  1. With probability ϵ\epsilon: pull a uniformly random arm
  2. With probability 1ϵ1 - \epsilon: pull argmaxaμ^a(t)\arg\max_a \hat{\mu}_a(t)

where μ^a(t)=1Na(t)s<t:As=aRs\hat{\mu}_a(t) = \frac{1}{N_a(t)} \sum_{s < t: A_s = a} R_s is the empirical mean of arm aa.

Regret analysis:

For fixed ϵ\epsilon:

E[Na(T)]ϵTKfor all arms a\mathbb{E}[N_a(T)] \geq \frac{\epsilon T}{K} \quad \text{for all arms } a

This means we pull every arm linearly in TT, giving:

Regret(T)=Ω(ϵT)\text{Regret}(T) = \Omega(\epsilon T)

Linear regret! The algorithm never stops exploring suboptimal arms.

Decaying ε-greedy:

To achieve sublinear regret, decay ϵ\epsilon over time:

ϵt=min(1,cKd2t)\epsilon_t = \min\left(1, \frac{cK}{d^2 t}\right)

where c>0c > 0 and d=mina:Δa>0Δad = \min_{a: \Delta_a > 0} \Delta_a is the minimum gap.

This achieves O(KlnTd)O\left(\frac{K \ln T}{d}\right) regret, but requires knowing the minimum gap dd in advance—usually impractical.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    ε-GREEDY ANALYSIS                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FIXED ε = 0.1:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  Explore probability: 10% of the time                                   │
│  Each suboptimal arm pulled: ≥ 0.1 × T / K times                        │
│                                                                          │
│  For T = 10,000, K = 5 arms:                                            │
│  Each arm pulled ≥ 200 times                                            │
│  If Arm 5 has Δ₅ = 0.5:                                                 │
│  Regret from Arm 5 alone ≥ 0.5 × 200 = 100                             │
│                                                                          │
│  Total regret grows linearly: Regret(T) = Θ(εT)                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DECAYING ε_t = c/(t):                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Early: ε large → lots of exploration                                   │
│  Late: ε small → mostly exploitation                                    │
│                                                                          │
│  Regret = O(K ln T) — much better!                                      │
│                                                                          │
│  But requires tuning c based on unknown problem parameters              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Upper Confidence Bound (UCB)

UCB algorithms embody the principle of optimism in the face of uncertainty: when uncertain about an arm's value, assume the best plausible case.

Key idea: Construct confidence intervals for each arm's mean. Pull the arm with the highest upper confidence bound.

UCB1 Algorithm

Upper confidence bound for arm aa at time tt:

UCBa(t)=μ^a(t)+2lntNa(t)\text{UCB}_a(t) = \hat{\mu}_a(t) + \sqrt{\frac{2 \ln t}{N_a(t)}}

Algorithm: At each time tt, pull:

At=argmaxaUCBa(t)=argmaxa(μ^a(t)+2lntNa(t))A_t = \arg\max_a \text{UCB}_a(t) = \arg\max_a \left(\hat{\mu}_a(t) + \sqrt{\frac{2 \ln t}{N_a(t)}}\right)

Intuition:

  • μ^a(t)\hat{\mu}_a(t): Exploitation term (favor arms with high observed rewards)
  • 2lntNa(t)\sqrt{\frac{2 \ln t}{N_a(t)}}: Exploration bonus (favor under-explored arms)

As Na(t)N_a(t) increases, the bonus shrinks, and we rely more on the empirical mean.

Why this specific form?

The exploration bonus comes from Hoeffding's inequality. For bounded rewards in [0,1][0, 1] and NN i.i.d. samples:

P(μ^μϵ)exp(2Nϵ2)P\left(\hat{\mu} - \mu \geq \epsilon\right) \leq \exp(-2N\epsilon^2)

Setting exp(2Nϵ2)=t4\exp(-2N\epsilon^2) = t^{-4} (small probability) and solving for ϵ\epsilon:

ϵ=2lntN\epsilon = \sqrt{\frac{2 \ln t}{N}}

So with high probability, the true mean is below μ^+2lntN\hat{\mu} + \sqrt{\frac{2 \ln t}{N}}.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    UCB1 CONFIDENCE INTERVALS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Time t = 100                                                           │
│                                                                          │
│  Arm 1: N₁ = 40, μ̂₁ = 0.70                                             │
│  UCB₁ = 0.70 + √(2 ln 100 / 40) = 0.70 + 0.48 = 1.18                   │
│                                                                          │
│  Arm 2: N₂ = 50, μ̂₂ = 0.65                                             │
│  UCB₂ = 0.65 + √(2 ln 100 / 50) = 0.65 + 0.43 = 1.08                   │
│                                                                          │
│  Arm 3: N₃ = 10, μ̂₃ = 0.40                                             │
│  UCB₃ = 0.40 + √(2 ln 100 / 10) = 0.40 + 0.96 = 1.36  ← HIGHEST!       │
│                                                                          │
│  VISUALIZATION:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  Arm 1: |═══════════════════[────]        UCB = 1.18                   │
│                              ↑                                           │
│                           μ̂ = 0.70                                      │
│                                                                          │
│  Arm 2: |════════════════[────]           UCB = 1.08                   │
│                           ↑                                              │
│                        μ̂ = 0.65                                         │
│                                                                          │
│  Arm 3: |═════════[────────────────]      UCB = 1.36  ← SELECTED       │
│                   ↑                                                      │
│                μ̂ = 0.40                                                  │
│                                                                          │
│  Arm 3 selected despite lowest empirical mean because                   │
│  wide confidence interval (few samples) creates high UCB                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

UCB1 Regret Bound

Theorem: For UCB1 with KK arms and rewards in [0,1][0, 1]:

Regret(T)8a:Δa>0lnTΔa+(1+π23)a=1KΔa\text{Regret}(T) \leq 8 \sum_{a: \Delta_a > 0} \frac{\ln T}{\Delta_a} + \left(1 + \frac{\pi^2}{3}\right) \sum_{a=1}^{K} \Delta_a

Asymptotic form:

Regret(T)=O(KlnTΔmin)\text{Regret}(T) = O\left(\frac{K \ln T}{\Delta_{\min}}\right)

where Δmin=mina:Δa>0Δa\Delta_{\min} = \min_{a: \Delta_a > 0} \Delta_a.

Gap-dependent vs. gap-independent bounds:

The bound above is gap-dependent (depends on Δa\Delta_a). A gap-independent bound:

Regret(T)=O(KTlnT)\text{Regret}(T) = O\left(\sqrt{KT \ln T}\right)

This is useful when gaps are small or unknown.

Proof sketch (why UCB1 achieves logarithmic regret):

For a suboptimal arm aa to be pulled at time tt, its UCB must exceed the optimal arm's UCB:

μ^a(t)+2lntNa(t)μ^(t)+2lntN(t)\hat{\mu}_a(t) + \sqrt{\frac{2 \ln t}{N_a(t)}} \geq \hat{\mu}^*(t) + \sqrt{\frac{2 \ln t}{N^*(t)}}

With high probability, μ^a(t)μa+2lntNa(t)\hat{\mu}_a(t) \leq \mu_a + \sqrt{\frac{2 \ln t}{N_a(t)}} and μ^(t)μ2lntN(t)\hat{\mu}^*(t) \geq \mu^* - \sqrt{\frac{2 \ln t}{N^*(t)}}.

For the inequality to hold when the optimal arm is well-estimated:

μa+22lntNa(t)μ\mu_a + 2\sqrt{\frac{2 \ln t}{N_a(t)}} \geq \mu^*

2lntNa(t)Δa2\sqrt{\frac{2 \ln t}{N_a(t)}} \geq \frac{\Delta_a}{2}

Na(t)8lntΔa2N_a(t) \leq \frac{8 \ln t}{\Delta_a^2}

So arm aa is pulled at most O(lnTΔa2)O\left(\frac{\ln T}{\Delta_a^2}\right) times, giving regret contribution O(lnTΔa)O\left(\frac{\ln T}{\Delta_a}\right).


UCB Variants

UCB2: Uses epochs of increasing length to reduce computation.

UCB-V (Variance-aware UCB):

If rewards have variance σa2\sigma_a^2, tighter bounds are possible:

UCB-Va(t)=μ^a(t)+2σ^a2(t)lntNa(t)+3lntNa(t)\text{UCB-V}_a(t) = \hat{\mu}_a(t) + \sqrt{\frac{2 \hat{\sigma}_a^2(t) \ln t}{N_a(t)}} + \frac{3 \ln t}{N_a(t)}

where σ^a2(t)\hat{\sigma}_a^2(t) is the empirical variance.

KL-UCB: Uses KL-divergence for tighter bounds on Bernoulli rewards:

KL-UCBa(t)=max{q:Na(t)KL(μ^a(t),q)lnt+clnlnt}\text{KL-UCB}_a(t) = \max\left\{q : N_a(t) \cdot \text{KL}(\hat{\mu}_a(t), q) \leq \ln t + c \ln \ln t\right\}

KL-UCB is asymptotically optimal (matches Lai-Robbins lower bound).

MOSS (Minimax Optimal Strategy in the Stochastic case):

MOSSa(t)=μ^a(t)+max(0,lnTKNa(t))Na(t)\text{MOSS}_a(t) = \hat{\mu}_a(t) + \sqrt{\frac{\max\left(0, \ln\frac{T}{K \cdot N_a(t)}\right)}{N_a(t)}}

Achieves the minimax optimal O(KT)O(\sqrt{KT}) regret.


Thompson Sampling

Thompson Sampling (TS) is a Bayesian approach: maintain a posterior distribution over each arm's mean, sample from posteriors, and play the arm with highest sample.

Algorithm:

  1. Prior: Start with prior P(μa)P(\mu_a) for each arm (e.g., Beta(1,1)\text{Beta}(1, 1) for Bernoulli)
  2. At each time tt:
    • Sample μ~aP(μahistory)\tilde{\mu}_a \sim P(\mu_a | \text{history}) for each arm
    • Pull At=argmaxaμ~aA_t = \arg\max_a \tilde{\mu}_a
  3. Update: Update posterior with observed reward

For Bernoulli bandits with Beta prior:

Prior: μaBeta(αa,βa)\mu_a \sim \text{Beta}(\alpha_a, \beta_a), typically αa=βa=1\alpha_a = \beta_a = 1 (uniform)

After observing SaS_a successes and FaF_a failures on arm aa:

μadataBeta(αa+Sa,βa+Fa)\mu_a | \text{data} \sim \text{Beta}(\alpha_a + S_a, \beta_a + F_a)

Algorithm for Bernoulli Thompson Sampling:

  1. Initialize: αa=βa=1\alpha_a = \beta_a = 1 for all arms
  2. For t=1,2,...,Tt = 1, 2, ..., T:
    • Sample μ~aBeta(αa,βa)\tilde{\mu}_a \sim \text{Beta}(\alpha_a, \beta_a) for each arm
    • Pull At=argmaxaμ~aA_t = \arg\max_a \tilde{\mu}_a
    • Observe reward Rt{0,1}R_t \in \{0, 1\}
    • Update: αAtαAt+Rt\alpha_{A_t} \leftarrow \alpha_{A_t} + R_t, βAtβAt+(1Rt)\beta_{A_t} \leftarrow \beta_{A_t} + (1 - R_t)
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    THOMPSON SAMPLING INTUITION                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Arm 1: 8 successes, 2 failures → Beta(9, 3)                           │
│  Arm 2: 3 successes, 7 failures → Beta(4, 8)                           │
│  Arm 3: 1 success, 1 failure → Beta(2, 2)                              │
│                                                                          │
│  POSTERIOR DISTRIBUTIONS:                                                │
│  ─────────────────────────                                               │
│                                                                          │
│  P(μ)                                                                   │
│    │      Arm 2                     Arm 1                               │
│    │       ╱╲                         ╱╲                                │
│    │      ╱  ╲        Arm 3         ╱  ╲                               │
│    │     ╱    ╲        ╱╲          ╱    ╲                              │
│    │    ╱      ╲      ╱  ╲        ╱      ╲                             │
│    │   ╱        ╲    ╱    ╲      ╱        ╲                            │
│    │  ╱          ╲  ╱      ╲    ╱          ╲                           │
│    │─────────────────────────────────────────── μ                       │
│    0   0.2  0.3  0.4  0.5  0.6  0.7  0.8  1.0                          │
│                                                                          │
│  SAMPLING PROCESS:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Round 1: Sample μ̃₁ = 0.72, μ̃₂ = 0.28, μ̃₃ = 0.55 → Pull Arm 1        │
│  Round 2: Sample μ̃₁ = 0.68, μ̃₂ = 0.35, μ̃₃ = 0.71 → Pull Arm 3        │
│  Round 3: Sample μ̃₁ = 0.81, μ̃₂ = 0.22, μ̃₃ = 0.48 → Pull Arm 1        │
│                                                                          │
│  KEY INSIGHT:                                                            │
│  ───────────                                                             │
│  • Arm 1: Narrow posterior (many samples) → samples cluster near 0.75  │
│  • Arm 3: Wide posterior (few samples) → samples vary widely           │
│  • Wide posteriors mean occasional high samples → exploration!          │
│                                                                          │
│  Thompson Sampling naturally balances exploration (wide posteriors)     │
│  and exploitation (high posterior means)                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

For Gaussian bandits with known variance σ2\sigma^2:

Prior: μaN(μ0,σ02)\mu_a \sim \mathcal{N}(\mu_0, \sigma_0^2)

Posterior after NaN_a observations with sample mean Xˉa\bar{X}_a:

μadataN(μ0σ02+NaXˉaσ21σ02+Naσ2,(1σ02+Naσ2)1)\mu_a | \text{data} \sim \mathcal{N}\left(\frac{\frac{\mu_0}{\sigma_0^2} + \frac{N_a \bar{X}_a}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{N_a}{\sigma^2}}, \left(\frac{1}{\sigma_0^2} + \frac{N_a}{\sigma^2}\right)^{-1}\right)

With uninformative prior (σ0\sigma_0 \to \infty):

μadataN(Xˉa,σ2Na)\mu_a | \text{data} \sim \mathcal{N}\left(\bar{X}_a, \frac{\sigma^2}{N_a}\right)

Thompson Sampling Regret Bound

Theorem (Agrawal & Goyal, 2012): For Thompson Sampling with Bernoulli rewards:

E[Regret(T)]=O(a:Δa>0lnTΔa)\mathbb{E}[\text{Regret}(T)] = O\left(\sum_{a: \Delta_a > 0} \frac{\ln T}{\Delta_a}\right)

Thompson Sampling achieves the same asymptotic regret as UCB and matches the Lai-Robbins lower bound.

Advantages of Thompson Sampling:

  • Simple and intuitive
  • Naturally handles prior information
  • Often outperforms UCB in practice (better constants)
  • Extends easily to complex settings (contextual, combinatorial)

Disadvantages:

  • Requires specifying a prior
  • Computationally expensive for complex posteriors
  • Harder to analyze theoretically

Comparison: UCB vs. Thompson Sampling

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    UCB vs THOMPSON SAMPLING                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  │ Aspect              │ UCB                  │ Thompson Sampling │     │
│  ├─────────────────────┼──────────────────────┼───────────────────┤     │
│  │ Philosophy          │ Frequentist          │ Bayesian          │     │
│  │ Exploration         │ Deterministic        │ Randomized        │     │
│  │ Computation         │ Simple (max UCB)     │ Sampling required │     │
│  │ Prior knowledge     │ Hard to incorporate  │ Natural via prior │     │
│  │ Regret bound        │ O(K ln T / Δ)        │ O(K ln T / Δ)     │     │
│  │ Practical perf.     │ Good                 │ Often better      │     │
│  │ Analysis difficulty │ Easier               │ Harder            │     │
│  │ Delayed feedback    │ Straightforward      │ Straightforward   │     │
│  │ Batched decisions   │ Needs modification   │ Natural           │     │
│                                                                          │
│  EMPIRICAL COMPARISON (typical):                                         │
│  ───────────────────────────────                                         │
│                                                                          │
│  Regret                                                                  │
│    │                                                                     │
│    │    ε-greedy (linear)                                               │
│    │   ╱                                                                 │
│    │  ╱                                                                  │
│    │ ╱      UCB                                                         │
│    │╱      ╱                                                             │
│    │      ╱    Thompson Sampling                                        │
│    │     ╱    ╱                                                          │
│    │    ╱   ╱                                                            │
│    │   ╱  ╱                                                              │
│    │──╱─╱─────────────────────────────────── Time                       │
│                                                                          │
│  Thompson Sampling often has better constants (lower regret)            │
│  despite same asymptotic rate as UCB                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Information-Directed Sampling (IDS)

While UCB and Thompson Sampling are the most common approaches, Information-Directed Sampling offers a principled alternative that directly optimizes the exploration-exploitation tradeoff.

Key idea: At each round, choose the action that achieves the best tradeoff between expected regret and information gained about the optimal action.

Information ratio:

Ψt=(E[ΔAtHt])2It(A;(At,Rt)Ht)\Psi_t = \frac{(\mathbb{E}[\Delta_{A_t} | H_t])^2}{I_t(A^*; (A_t, R_t) | H_t)}

where:

  • E[ΔAtHt]\mathbb{E}[\Delta_{A_t} | H_t] = expected instantaneous regret
  • It(A;(At,Rt)Ht)I_t(A^*; (A_t, R_t) | H_t) = mutual information between optimal arm and observation
  • HtH_t = history up to time tt

IDS Algorithm:

At each time tt, select distribution πt\pi_t over arms minimizing:

minπ(aπ(a)E[ΔaHt])2aπ(a)It(A;RaHt)\min_{\pi} \frac{(\sum_a \pi(a) \mathbb{E}[\Delta_a | H_t])^2}{\sum_a \pi(a) I_t(A^*; R_a | H_t)}

Then sample AtπtA_t \sim \pi_t.

Intuition:

  • Numerator: Squared expected regret (want low)
  • Denominator: Information gain (want high)
  • Minimizing the ratio balances both objectives
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    IDS vs UCB vs THOMPSON SAMPLING                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  UCB:                                                                    │
│  ────                                                                    │
│  "Be optimistic—assume uncertain arms are good"                         │
│  Exploration is implicit via confidence bounds                          │
│                                                                          │
│  THOMPSON SAMPLING:                                                      │
│  ─────────────────                                                       │
│  "Sample from beliefs, act on sample"                                   │
│  Exploration via posterior uncertainty                                  │
│                                                                          │
│  IDS:                                                                    │
│  ────                                                                    │
│  "Explicitly optimize regret vs. information tradeoff"                  │
│  Directly measures value of information                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN IDS HELPS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  • Many similar arms: IDS identifies which comparisons are informative  │
│  • Structured problems: IDS exploits information structure              │
│  • Sparse rewards: IDS values rare informative observations             │
│                                                                          │
│  Example: 100 arms, 99 have mean 0.5, one has mean 0.6                  │
│  - UCB/TS: Explore all arms roughly equally initially                   │
│  - IDS: Quickly identifies that most arms are similar,                  │
│         focuses on finding the unique good arm                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Regret bound:

For KK-armed bandits with bounded information ratio Ψ=suptE[Ψt]\Psi^* = \sup_t \mathbb{E}[\Psi_t]:

Regret(T)ΨH(A)T\text{Regret}(T) \leq \sqrt{\Psi^* \cdot H(A^*) \cdot T}

where H(A)H(A^*) is the entropy of the optimal arm under the prior.

Key insight: ΨK/2\Psi^* \leq K/2 always, but for structured problems Ψ\Psi^* can be much smaller, leading to better regret than UCB/TS.

Practical considerations:

  • Computing mutual information can be expensive
  • Approximations needed for complex posteriors
  • Most beneficial for structured or many-armed problems

Part III: Contextual Bandits

In many applications, we observe context (features) before making a decision. The optimal action depends on this context.

Problem Formulation

Setting: At each time tt:

  1. Observe context xtXx_t \in \mathcal{X} (e.g., user features)
  2. Select arm At{1,...,K}A_t \in \{1, ..., K\}
  3. Receive reward RtR_t where E[Rtxt,At=a]=fa(xt)\mathbb{E}[R_t | x_t, A_t = a] = f_a(x_t)

Goal: Learn a policy π:X{1,...,K}\pi: \mathcal{X} \to \{1, ..., K\} that maximizes:

E[t=1TRt]=E[t=1TfAt(xt)]\mathbb{E}\left[\sum_{t=1}^{T} R_t\right] = \mathbb{E}\left[\sum_{t=1}^{T} f_{A_t}(x_t)\right]

Regret (compared to best policy in hindsight):

Regret(T)=E[t=1Tmaxafa(xt)fAt(xt)]\text{Regret}(T) = \mathbb{E}\left[\sum_{t=1}^{T} \max_a f_a(x_t) - f_{A_t}(x_t)\right]

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CONTEXTUAL BANDIT SETTING                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE: News Article Recommendation                                    │
│  ────────────────────────────────────                                    │
│                                                                          │
│  Context x_t = [user_age, user_gender, time_of_day, device, ...]       │
│                                                                          │
│  Arms = {Sports, Politics, Tech, Entertainment, ...}                    │
│                                                                          │
│  Reward = 1 if user clicks, 0 otherwise                                 │
│                                                                          │
│  Goal: Learn which article category to show each user                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY CONTEXT MATTERS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  User A: young, male, evening, mobile                                   │
│  → Sports might be best                                                 │
│                                                                          │
│  User B: older, female, morning, desktop                                │
│  → Politics might be best                                               │
│                                                                          │
│  Without context: single "best" arm for everyone (suboptimal)           │
│  With context: personalized arm selection (better)                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMPARED TO STANDARD MAB:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Standard MAB:    A_t → R_t                                             │
│  Contextual MAB:  (x_t, A_t) → R_t                                      │
│                                                                          │
│  Standard MAB learns: μ_a (scalar per arm)                              │
│  Contextual MAB learns: f_a(x) (function per arm)                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Linear Contextual Bandits (LinUCB)

Assume the expected reward is linear in context:

E[Rtxt,At=a]=xtθa\mathbb{E}[R_t | x_t, A_t = a] = x_t^{\top} \theta_a^*

where θaRd\theta_a^* \in \mathbb{R}^d is an unknown parameter vector for arm aa.

LinUCB Algorithm:

For each arm aa, maintain:

  • Aa=Id+s:As=axsxsA_a = I_d + \sum_{s: A_s = a} x_s x_s^{\top} (design matrix)
  • ba=s:As=aRsxsb_a = \sum_{s: A_s = a} R_s x_s (reward-weighted contexts)

Ridge regression estimate:

θ^a=Aa1ba\hat{\theta}_a = A_a^{-1} b_a

Confidence bound:

For θa\theta_a^* with probability 1δ\geq 1 - \delta:

θ^aθaAaβt(δ)\left\|\hat{\theta}_a - \theta_a^*\right\|_{A_a} \leq \beta_t(\delta)

where vM=vMv\|v\|_M = \sqrt{v^{\top} M v} and:

βt(δ)=λθa+σ2ln(1/δ)+dln(1+t/(dλ))\beta_t(\delta) = \sqrt{\lambda} \|\theta_a^*\| + \sigma\sqrt{2 \ln(1/\delta) + d \ln(1 + t/(d\lambda))}

UCB for arm aa given context xtx_t:

UCBa(xt)=xtθ^a+αxtAa1xt\text{UCB}_a(x_t) = x_t^{\top} \hat{\theta}_a + \alpha \sqrt{x_t^{\top} A_a^{-1} x_t}

Algorithm:

  1. Pull At=argmaxaUCBa(xt)A_t = \arg\max_a \text{UCB}_a(x_t)
  2. Observe RtR_t
  3. Update AAtAAt+xtxtA_{A_t} \leftarrow A_{A_t} + x_t x_t^{\top} and bAtbAt+Rtxtb_{A_t} \leftarrow b_{A_t} + R_t x_t

Regret bound:

Regret(T)=O~(dKT)\text{Regret}(T) = \tilde{O}\left(d\sqrt{KT}\right)

where O~\tilde{O} hides logarithmic factors.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LinUCB GEOMETRIC INTUITION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Parameter space for arm a (2D example):                                │
│                                                                          │
│       θ₂                                                                 │
│        │                                                                 │
│        │     ╭─────────────╮                                            │
│        │    ╱               ╲    Confidence ellipsoid                   │
│        │   ╱    ●───────────────→ True θ*                              │
│        │  │     ↑            │                                          │
│        │  │   θ̂ (estimate)   │                                          │
│        │   ╲                ╱                                            │
│        │    ╲              ╱                                             │
│        │     ╰────────────╯                                             │
│        │                                                                 │
│        └────────────────────────── θ₁                                   │
│                                                                          │
│  Ellipsoid shape: A_a^{-1} (inverse of design matrix)                   │
│  Directions with many observations → narrow (confident)                 │
│  Directions with few observations → wide (uncertain)                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  UCB for context x:                                                      │
│                                                                          │
│  UCB_a(x) = x^T θ̂_a + α √(x^T A_a^{-1} x)                               │
│             ─────────   ─────────────────                                │
│             Exploitation    Exploration                                  │
│             (predicted      (uncertainty                                 │
│              reward)         in direction x)                            │
│                                                                          │
│  If x aligns with uncertain directions → large exploration bonus        │
│  If x aligns with well-explored directions → small exploration bonus    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Linear Thompson Sampling

For linear bandits:

Prior: θaN(0,λ1Id)\theta_a \sim \mathcal{N}(0, \lambda^{-1} I_d)

Posterior (Gaussian conjugate update):

θadataN(Aa1ba,σ2Aa1)\theta_a | \text{data} \sim \mathcal{N}\left(A_a^{-1} b_a, \sigma^2 A_a^{-1}\right)

Algorithm:

  1. Sample θ~aN(θ^a,v2Aa1)\tilde{\theta}_a \sim \mathcal{N}(\hat{\theta}_a, v^2 A_a^{-1}) for each arm
  2. Pull At=argmaxaxtθ~aA_t = \arg\max_a x_t^{\top} \tilde{\theta}_a
  3. Update AAtA_{A_t} and bAtb_{A_t}

The parameter vv controls exploration (often set to match confidence width).


Generalized Linear Bandits

When rewards are non-linear but belong to an exponential family:

E[Rtxt,At=a]=g(xtθa)\mathbb{E}[R_t | x_t, A_t = a] = g(x_t^{\top} \theta_a^*)

where gg is a known link function (e.g., sigmoid for logistic).

Examples:

  • Logistic bandits: g(η)=11+eηg(\eta) = \frac{1}{1 + e^{-\eta}} (click-through prediction)
  • Poisson bandits: g(η)=eηg(\eta) = e^{\eta} (count data)

GLM-UCB and GLM-Thompson Sampling extend the linear approaches using maximum likelihood estimation and appropriate confidence bounds.


Neural Contextual Bandits

For complex contexts (images, text), linear models are insufficient. Neural networks can learn rich representations.

NeuralUCB / NeuralTS:

  1. Train neural network fθ(x,a)f_\theta(x, a) to predict rewards
  2. Use neural tangent kernel (NTK) or ensemble methods for uncertainty
  3. Apply UCB or Thompson Sampling on top

Uncertainty estimation methods:

  1. Dropout at test time: Sample multiple predictions with dropout
  2. Ensemble: Train multiple networks, use variance
  3. Neural Tangent Kernel: Treat last layer as linear, apply LinUCB
  4. Bayesian neural networks: Maintain weight distributions

Regret bounds: Under certain assumptions (NTK regime), neural bandits achieve O~(T)\tilde{O}(\sqrt{T}) regret.


Part IV: Advanced Topics

Adversarial Bandits

In adversarial bandits, rewards are not stochastic but chosen by an adversary who may know your algorithm.

Setting: At each time tt:

  1. Adversary secretly assigns rewards rt,1,...,rt,Kr_{t,1}, ..., r_{t,K} to all arms
  2. You select arm AtA_t
  3. You receive and observe rt,Atr_{t, A_t} (or only observe your reward in "bandit feedback")

Regret (vs. best fixed arm in hindsight):

Regret(T)=maxa{1,...,K}t=1Trt,aE[t=1Trt,At]\text{Regret}(T) = \max_{a \in \{1,...,K\}} \sum_{t=1}^{T} r_{t,a} - \mathbb{E}\left[\sum_{t=1}^{T} r_{t, A_t}\right]

EXP3 (Exponential-weight algorithm for Exploration and Exploitation):

Maintain weights wt,aw_{t,a} for each arm. Let Wt=awt,aW_t = \sum_a w_{t,a}.

Probability of pulling arm aa:

pt,a=(1γ)wt,aWt+γKp_{t,a} = (1 - \gamma) \frac{w_{t,a}}{W_t} + \frac{\gamma}{K}

where γ(0,1]\gamma \in (0, 1] is the exploration parameter.

Weight update after observing reward rtr_t for arm AtA_t:

r^t,a=rtI[At=a]pt,a\hat{r}_{t,a} = \frac{r_t \mathbb{I}[A_t = a]}{p_{t,a}}

wt+1,a=wt,aexp(γr^t,aK)w_{t+1,a} = w_{t,a} \cdot \exp\left(\frac{\gamma \hat{r}_{t,a}}{K}\right)

Intuition: r^t,a\hat{r}_{t,a} is an unbiased estimator of rt,ar_{t,a} (importance weighting). High-reward arms get higher weights.

EXP3 Regret Bound:

E[Regret(T)]2KTlnK\mathbb{E}[\text{Regret}(T)] \leq 2\sqrt{KT \ln K}

This is minimax optimal (matches lower bound up to constants).

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    STOCHASTIC vs ADVERSARIAL BANDITS                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STOCHASTIC BANDITS:                                                     │
│  ───────────────────                                                     │
│  Rewards: R_t ~ ν_a (fixed distribution per arm)                        │
│  Adversary: None (nature is i.i.d.)                                     │
│  Best algorithms: UCB, Thompson Sampling                                │
│  Regret: O(K ln T / Δ) or O(√(KT ln K))                                │
│                                                                          │
│  Exploits: Concentration (samples converge to mean)                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ADVERSARIAL BANDITS:                                                    │
│  ────────────────────                                                    │
│  Rewards: r_{t,a} chosen by adversary (possibly adaptive)               │
│  Adversary: Knows your algorithm, can be malicious                      │
│  Best algorithms: EXP3, EXP3.P                                          │
│  Regret: O(√(KT ln K))                                                  │
│                                                                          │
│  Exploits: Randomization (adversary can't predict your choice)          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY EXP3 IS RANDOMIZED:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  If you're deterministic, adversary sets:                               │
│  - Reward 0 for your chosen arm                                         │
│  - Reward 1 for all other arms                                          │
│  → Linear regret!                                                       │
│                                                                          │
│  Randomization prevents adversary from targeting you                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Best Arm Identification

Instead of minimizing regret (cumulative performance), identify the best arm with high confidence.

Fixed-confidence setting: Given confidence δ\delta, find a^\hat{a} such that:

P(a^a)δP(\hat{a} \neq a^*) \leq \delta

using as few samples as possible.

Fixed-budget setting: Given budget TT, find a^\hat{a} minimizing:

P(a^a)P(\hat{a} \neq a^*)

Successive Elimination:

  1. Start with all arms active: A={1,...,K}\mathcal{A} = \{1, ..., K\}
  2. Pull each active arm once per round
  3. Eliminate arms whose UCB is below the LCB of another arm
  4. Stop when one arm remains

Sample complexity (fixed-confidence):

τ=O(a=1K1Δa2lnKδ)\tau^* = O\left(\sum_{a=1}^{K} \frac{1}{\Delta_a^2} \ln\frac{K}{\delta}\right)

Lower bound (any δ\delta-correct algorithm):

E[τ]a:Δa>02Δa2ln12.4δ\mathbb{E}[\tau] \geq \sum_{a: \Delta_a > 0} \frac{2}{\Delta_a^2} \ln\frac{1}{2.4\delta}


Combinatorial Bandits

Select a set of arms (or a combinatorial structure) rather than a single arm.

Examples:

  • Shortest path: Arms are paths, reward is path cost
  • Matching: Arms are matchings, reward is matching value
  • Assortment: Select subset of products, reward is revenue

Semi-bandit feedback: Observe rewards of all selected arms (not just total).

CUCB (Combinatorial UCB):

For super arm S{1,...,K}S \subseteq \{1, ..., K\}:

UCB(S)=aSUCBa(t)\text{UCB}(S) = \sum_{a \in S} \text{UCB}_a(t)

Select St=argmaxSUCB(S)S_t = \arg\max_S \text{UCB}(S) subject to feasibility constraints.

Regret: O~(mKT)\tilde{O}\left(\sqrt{mKT}\right) where mm = size of selected set.


Batched and Delayed Feedback

Batched bandits: Make BB decisions before observing any feedback.

  • Requires more exploration (can't adapt within batch)
  • Regret: O(KTB)O(\sqrt{KT \cdot B}) with BB batches

Delayed feedback: Reward for action at time tt observed at time t+dtt + d_t.

  • Must account for "in-flight" decisions
  • Regret increases with delay: O(KT+dmaxK)O(\sqrt{KT} + d_{\max} K)

Non-Stationary Bandits

Arm distributions change over time.

Abruptly changing: Distributions are piecewise stationary with SS change points.

Discounted UCB: Weight recent observations more heavily:

μ^a(γ)(t)=s<tγtsRsI[As=a]s<tγtsI[As=a]\hat{\mu}_a^{(\gamma)}(t) = \frac{\sum_{s < t} \gamma^{t-s} R_s \mathbb{I}[A_s = a]}{\sum_{s < t} \gamma^{t-s} \mathbb{I}[A_s = a]}

Sliding window UCB: Only use observations from last τ\tau rounds.

Regret: O~(S1/3K1/3T2/3)\tilde{O}\left(S^{1/3} K^{1/3} T^{2/3}\right) is achievable.


Bandit Variants

Beyond standard stochastic bandits, many important variants capture real-world constraints and feedback structures.

Restless Bandits

In restless bandits, arm states evolve even when not pulled—a critical feature for many applications.

Setting: Each arm aa has a state Sa(t)S_a(t) that evolves according to a Markov chain:

  • If arm aa is pulled: Sa(t+1)P1(Sa(t))S_a(t+1) \sim P_1(\cdot | S_a(t))
  • If arm aa is not pulled: Sa(t+1)P0(Sa(t))S_a(t+1) \sim P_0(\cdot | S_a(t))

Reward depends on state: Rt=r(SAt(t))R_t = r(S_{A_t}(t))

Examples:

  • Ad fatigue: Ad effectiveness decreases when shown (pulled) but recovers when not shown
  • Machine maintenance: Machines degrade when used, but also when idle
  • Communication channels: Channel quality fluctuates independently of whether you use it

Whittle index policy: Compute an "index" for each arm-state pair:

Wa(s)=inf{λ:passive is optimal in state s under subsidy λ}W_a(s) = \inf\{\lambda : \text{passive is optimal in state } s \text{ under subsidy } \lambda\}

Pull arms with highest Whittle indices. Asymptotically optimal when arms are "indexable."

Challenge: Computing Whittle indices requires solving per-arm MDPs. Full problem is PSPACE-hard.


Dueling Bandits

In dueling bandits, feedback is pairwise comparisons rather than absolute rewards.

Setting: At each round:

  1. Select two arms At,BtA_t, B_t
  2. Observe comparison outcome: AtBtA_t \succ B_t or BtAtB_t \succ A_t

Preference model (Bradley-Terry-Luce):

P(ab)=μaμa+μbP(a \succ b) = \frac{\mu_a}{\mu_a + \mu_b}

or more generally, a preference matrix Pab=P(ab)P_{ab} = P(a \succ b).

Goal: Find the Condorcet winner—the arm that beats all others:

a:Pab>0.5baa^* : P_{a^* b} > 0.5 \quad \forall b \neq a^*

Regret: Measured in comparisons not involving the Condorcet winner.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DUELING BANDITS                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STANDARD BANDIT:          DUELING BANDIT:                              │
│  ────────────────          ───────────────                              │
│                                                                          │
│  Pull arm A                Compare A vs B                               │
│  Observe R_A = 0.7         Observe: A wins (or B wins)                  │
│                                                                          │
│  Information: Absolute     Information: Relative                        │
│  score of A                preference only                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  APPLICATIONS:                                                           │
│  ─────────────                                                           │
│  • Ranking from user preferences ("Which result is better?")            │
│  • A/B testing with preference feedback                                 │
│  • Tournament design                                                    │
│  • Information retrieval evaluation                                     │
│  • RLHF for LLMs (comparing model outputs)                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Algorithms:

  • Interleaved Filter (IF): Eliminate arms beaten by current champion
  • Beat the Mean (BTM): Compare against average competitor
  • RUCB: UCB-style approach for dueling
  • Double Thompson Sampling: Sample preferences, duel top two

Regret bound: O(KlnT/Δ2)O(K \ln T / \Delta^2) where Δ\Delta is minimum preference gap.


Sleeping Bandits

In sleeping bandits, not all arms are available at every round.

Setting: At each round tt:

  1. Observe available arm set At{1,...,K}\mathcal{A}_t \subseteq \{1, ..., K\}
  2. Select AtAtA_t \in \mathcal{A}_t
  3. Receive reward RtR_t

Examples:

  • Product recommendations: Not all products in stock
  • Ad serving: Some ads have budget constraints
  • Resource allocation: Some resources temporarily unavailable

Regret (against best policy):

Regret(T)=t=1T(maxaAtμaμAt)\text{Regret}(T) = \sum_{t=1}^{T} \left(\max_{a \in \mathcal{A}_t} \mu_a - \mu_{A_t}\right)

Algorithms: Adapt UCB/TS to only consider available arms. Regret bounds depend on availability patterns.


Mortal Bandits

In mortal bandits, arms can permanently disappear.

Setting: Each arm aa has a lifetime LaL_a (possibly random). After LaL_a pulls, arm aa is no longer available.

Examples:

  • Job candidates: Candidates accept other offers
  • Limited inventory: Products sell out
  • Time-sensitive opportunities: Deals expire

Challenge: Must balance learning about an arm vs. exploiting it before it dies.

Key insight: With short lifetimes, pure exploitation can be optimal (no time to learn). With long lifetimes, standard bandit algorithms apply.


Constrained and Safe Bandits

Many real applications require satisfying constraints beyond maximizing reward.

Bandits with Knapsack Constraints

Setting: Each arm pull consumes resources. Stop when budget exhausted.

Formulation:

  • Pulling arm aa gives reward RtR_t and consumes resources CtRdC_t \in \mathbb{R}^d
  • Budget constraint: t=1τCt,iBi\sum_{t=1}^{\tau} C_{t,i} \leq B_i for each resource ii
  • Stopping time τ\tau is when any budget exhausted

Goal: Maximize total reward subject to budget constraints.

LP relaxation:

maxπaπaμas.t.aπaca,iBi/T,aπa=1\max_{\pi} \sum_a \pi_a \mu_a \quad \text{s.t.} \quad \sum_a \pi_a c_{a,i} \leq B_i / T, \quad \sum_a \pi_a = 1

Algorithms:

  • Primal-dual methods: Maintain dual variables for constraints
  • UCB-BwK: UCB with budget-aware exploration
  • Thompson Sampling with budget: Sample, but respect constraints

Regret: O~(T)\tilde{O}(\sqrt{T}) relative to best fixed policy satisfying constraints.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    BANDITS WITH KNAPSACKS                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE: Ad Campaign with Budget                                        │
│  ────────────────────────────────                                        │
│                                                                          │
│  Arms = {Premium placement, Standard placement, Text ad}                │
│                                                                          │
│  Rewards (CTR):    Costs:                                               │
│  Premium: 0.08     Premium: $2.00                                       │
│  Standard: 0.05    Standard: $0.50                                      │
│  Text: 0.02        Text: $0.10                                          │
│                                                                          │
│  Budget: $1000                                                          │
│                                                                          │
│  NAIVE APPROACH:                                                         │
│  ───────────────                                                         │
│  Always pick Premium (highest CTR)                                      │
│  → 500 impressions, 40 clicks                                           │
│                                                                          │
│  OPTIMAL MIX:                                                            │
│  ────────────                                                            │
│  Mix of Standard and Text                                               │
│  → 2500+ impressions, 100+ clicks                                       │
│                                                                          │
│  Bandit must LEARN this while respecting budget!                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Safe Bandits

Safe bandits require that performance never drops below a safety threshold.

Setting: There exists a baseline arm a0a_0 with known performance μ0\mu_0. Constraint:

μAtμ0ϵwith high probability\mu_{A_t} \geq \mu_0 - \epsilon \quad \text{with high probability}

Applications:

  • Medical treatment: New treatments must be at least as good as standard of care
  • Autonomous systems: Exploration must not cause accidents
  • Production systems: A/B tests shouldn't tank conversion rates

Safe UCB: Only pull arms whose lower confidence bound exceeds threshold:

Pull a only if μ^a(t)2lntNa(t)μ0ϵ\text{Pull } a \text{ only if } \hat{\mu}_a(t) - \sqrt{\frac{2 \ln t}{N_a(t)}} \geq \mu_0 - \epsilon

Challenge: Safe exploration is harder—must be confident an arm is safe before trying it.


Conservative Bandits

A variant of safe bandits where cumulative performance must stay above baseline.

Constraint:

s=1tRs(1α)tμ0t\sum_{s=1}^{t} R_s \geq (1 - \alpha) \cdot t \cdot \mu_0 \quad \forall t

Must never fall too far behind what baseline would have achieved.

Algorithm: Explore only when "safety budget" accumulated from exploitation allows it.


Fair Bandits

Fair bandits ensure equitable treatment across groups or arms.

Meritocratic fairness: Pull probability proportional to true quality:

P(At=a)μaP(A_t = a) \propto \mu_a

Group fairness: If arms belong to groups, ensure groups receive fair share of pulls:

Ngroup1(T)Ngroup2(T)group1group2\frac{N_{\text{group}_1}(T)}{N_{\text{group}_2}(T)} \approx \frac{|\text{group}_1|}{|\text{group}_2|}

Individual fairness: Similar arms should be pulled with similar frequency.

Tradeoff: Fairness constraints typically increase regret. The fair-bandit literature studies Pareto frontiers between fairness and efficiency.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FAIRNESS IN BANDITS                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE: Job Candidate Screening                                        │
│  ────────────────────────────────                                        │
│                                                                          │
│  Arms = candidates, Reward = hiring success                             │
│                                                                          │
│  UNCONSTRAINED OPTIMAL:                                                  │
│  ──────────────────────                                                  │
│  Always interview highest-predicted candidates                          │
│  May systematically exclude certain groups                              │
│                                                                          │
│  FAIR BANDIT:                                                            │
│  ────────────                                                            │
│  Ensure demographic groups receive fair interview rates                 │
│  while still learning who are the best candidates                       │
│                                                                          │
│  APPLICATIONS:                                                           │
│  ─────────────                                                           │
│  • Loan approvals across demographics                                   │
│  • Content recommendation diversity                                     │
│  • Resource allocation across regions                                   │
│  • Ad serving to different user groups                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: Applications

A/B Testing and Adaptive Experiments

Traditional A/B testing:

  1. Split traffic 50/50 between A and B
  2. Run for fixed duration
  3. Analyze results, declare winner

Problems:

  • Waste: Shows inferior variant to 50% of users
  • Fixed duration: May stop too early or too late
  • No adaptation: Can't respond to early signals

Bandit-based testing:

Replace fixed allocation with bandit algorithm:

  • Thompson Sampling allocates more traffic to better variants
  • Early stopping when confident
  • Reduced regret (fewer users see bad variants)

Multi-armed bandit approach:

RegretA/B=T/2Δ(fixed allocation)\text{Regret}_{\text{A/B}} = T/2 \cdot \Delta \quad \text{(fixed allocation)} RegretBandit=O(lnT/Δ)(Thompson Sampling)\text{Regret}_{\text{Bandit}} = O(\ln T / \Delta) \quad \text{(Thompson Sampling)}

Caution: Bandit testing complicates statistical analysis (non-uniform allocation, stopping rules). Hybrid approaches maintain valid inference.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    A/B TESTING vs BANDIT TESTING                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL A/B TEST:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Traffic allocation over time:                                          │
│                                                                          │
│  100% │ ████████████████████████████████████████                        │
│   50% │ ████████████████████  Variant A (50%)                           │
│       │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  Variant B (50%)                           │
│    0% │──────────────────────────────────────── Time                    │
│       Start                                  End                         │
│                                                                          │
│  If B is worse: 50% of users get bad experience throughout              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BANDIT-BASED TEST (Thompson Sampling):                                  │
│  ──────────────────────────────────────                                  │
│                                                                          │
│  Traffic allocation over time:                                          │
│                                                                          │
│  100% │ ████████████████████████████████████████                        │
│       │ ███████████████████████████████████████  A dominates            │
│   50% │ ████████████████████                                            │
│       │ ▓▓▓▓▓▓▓▓▓▓                              B fades out             │
│    0% │──────────────────────────────────────── Time                    │
│       Start      ↑                           End                         │
│                  B found inferior                                        │
│                                                                          │
│  Adapts to evidence: worse variants get less traffic over time          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  REGRET COMPARISON (Δ = 0.05 difference, T = 10,000 users):             │
│                                                                          │
│  A/B: Regret = 0.5 × 0.05 × 10,000 = 250                               │
│  Bandit: Regret ≈ 20-50 (logarithmic in T)                             │
│                                                                          │
│  5-10x improvement!                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Recommendation Systems

Cold-start problem: New items/users have no interaction data.

Bandits provide principled exploration:

  • New items get exploration bonus (uncertainty)
  • System learns preferences while maintaining engagement

Explore-exploit in recommendations:

Score(u,i)=Predicted_Rating(u,i)+αUncertainty(u,i)\text{Score}(u, i) = \text{Predicted\_Rating}(u, i) + \alpha \cdot \text{Uncertainty}(u, i)

Cascade bandits for ranked lists:

User examines items from top until satisfied. Model this as bandit over rankings.

Conversational recommendations:

Each question is an arm. Bandit selects questions to efficiently narrow preferences.


Clinical Trials

Adaptive clinical trials: Allocate more patients to more promising treatments.

Ethical motivation: Minimize patients receiving inferior treatment.

Response-adaptive randomization:

P(assign to treatment a)Success_Ratea+Exploration_BonusaP(\text{assign to treatment } a) \propto \text{Success\_Rate}_a + \text{Exploration\_Bonus}_a

Challenges:

  • Regulatory requirements for valid inference
  • Delayed outcomes (treatment effects take time)
  • Non-stationarity (patient population changes)

Gittins index (optimal for discounted infinite-horizon):

Compute index νa(s)\nu_a(s) for each arm state ss. Pull arm with highest index.

νa(s)=supτ>0E[t=0τ1γtRtS0=s]E[t=0τ1γtS0=s]\nu_a(s) = \sup_{\tau > 0} \frac{\mathbb{E}\left[\sum_{t=0}^{\tau-1} \gamma^t R_t | S_0 = s\right]}{\mathbb{E}\left[\sum_{t=0}^{\tau-1} \gamma^t | S_0 = s\right]}


Online Advertising

Ad creative selection (covered in our advertising post):

  • Arms = ad creative variants
  • Context = user features
  • Reward = click (or conversion)

Bid optimization:

  • Arms = bid levels
  • Reward = profit = value - cost
  • Challenge: cost depends on auction outcome

Budget-constrained bandits:

Standard bandits assume unlimited pulls. With budget constraint BB:

t=1TctB\sum_{t=1}^{T} c_t \leq B

Knapsack bandits: Select arms subject to budget, maximize value.


Hyperparameter Optimization

Successive Halving:

  1. Start with nn random configurations
  2. Train each for some iterations
  3. Keep top half, double training iterations
  4. Repeat until one remains

Hyperband: Run Successive Halving with different initial budgets.

Bayesian Optimization (related to bandits):

Model hyperparameter → performance as Gaussian Process. Use UCB-like acquisition:

EI(x)=E[max(0,f(x)f)]\text{EI}(x) = \mathbb{E}[\max(0, f(x) - f^*)]

or UCB:

UCB(x)=μ(x)+βσ(x)\text{UCB}(x) = \mu(x) + \beta \sigma(x)


Resource Allocation

Dynamic pricing: Set prices to maximize revenue while learning demand.

  • Arms = price levels
  • Reward = revenue = price × demand(price)
  • Challenge: demand function unknown

Network routing: Route traffic to minimize congestion while learning link costs.

Crowdsourcing: Assign tasks to workers while learning worker quality.


Offline Policy Evaluation

Before deploying a new bandit policy, we often want to evaluate it using historical data collected by a different policy. This is offline policy evaluation (OPE).

Setting:

  • Historical data: {(xt,At,Rt)}t=1n\{(x_t, A_t, R_t)\}_{t=1}^{n} collected by logging policy π0\pi_0
  • Target policy: π\pi (the policy we want to evaluate)
  • Goal: Estimate V(π)=EAπ(x)[R(x,A)]V(\pi) = \mathbb{E}_{A \sim \pi(x)}[R(x, A)]

Challenge: We only observe rewards for actions taken by π0\pi_0, not π\pi.

Inverse Propensity Scoring (IPS)

Key idea: Reweight samples to correct for distribution mismatch.

IPS estimator:

V^IPS(π)=1nt=1nπ(Atxt)π0(Atxt)Rt\hat{V}_{\text{IPS}}(\pi) = \frac{1}{n} \sum_{t=1}^{n} \frac{\pi(A_t | x_t)}{\pi_0(A_t | x_t)} R_t

Intuition: If π\pi would have taken action AtA_t twice as often as π0\pi_0, count that sample twice.

Properties:

  • Unbiased: E[V^IPS]=V(π)\mathbb{E}[\hat{V}_{\text{IPS}}] = V(\pi)
  • High variance: When π\pi and π0\pi_0 differ significantly, importance weights can be huge

Variance reduction: Clip importance weights or use self-normalized estimator:

V^SNIPS(π)=t=1nπ(Atxt)π0(Atxt)Rtt=1nπ(Atxt)π0(Atxt)\hat{V}_{\text{SNIPS}}(\pi) = \frac{\sum_{t=1}^{n} \frac{\pi(A_t | x_t)}{\pi_0(A_t | x_t)} R_t}{\sum_{t=1}^{n} \frac{\pi(A_t | x_t)}{\pi_0(A_t | x_t)}}

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    INVERSE PROPENSITY SCORING                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  Logging policy π₀: Shows Ad A 80%, Ad B 20%                            │
│  Target policy π:   Shows Ad A 30%, Ad B 70%                            │
│                                                                          │
│  Historical data (1000 samples):                                        │
│  - Ad A shown 800 times, avg reward 0.05                                │
│  - Ad B shown 200 times, avg reward 0.08                                │
│                                                                          │
│  NAIVE ESTIMATE of π:                                                    │
│  ────────────────────                                                    │
│  0.3 × 0.05 + 0.7 × 0.08 = 0.071                                        │
│  (Wrong! Uses biased sample means)                                      │
│                                                                          │
│  IPS ESTIMATE:                                                           │
│  ─────────────                                                           │
│  For Ad A samples: weight = 0.3/0.8 = 0.375                             │
│  For Ad B samples: weight = 0.7/0.2 = 3.5                               │
│                                                                          │
│  Weighted average accounts for sampling bias                            │
│                                                                          │
│  PROBLEM: Ad B weight is 3.5 → high variance!                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Direct Method (DM)

Key idea: Build a reward model, use it to predict counterfactual outcomes.

Direct Method estimator:

V^DM(π)=1nt=1naπ(axt)r^(xt,a)\hat{V}_{\text{DM}}(\pi) = \frac{1}{n} \sum_{t=1}^{n} \sum_a \pi(a | x_t) \hat{r}(x_t, a)

where r^(x,a)\hat{r}(x, a) is a learned reward model.

Properties:

  • Low variance: No importance weighting
  • Biased: If reward model is wrong, estimate is wrong

Tradeoff: DM has low variance but potential bias; IPS has no bias but high variance.


Doubly Robust (DR) Estimator

Key idea: Combine IPS and DM to get the best of both worlds.

DR estimator:

V^DR(π)=1nt=1n[aπ(axt)r^(xt,a)+π(Atxt)π0(Atxt)(Rtr^(xt,At))]\hat{V}_{\text{DR}}(\pi) = \frac{1}{n} \sum_{t=1}^{n} \left[\sum_a \pi(a | x_t) \hat{r}(x_t, a) + \frac{\pi(A_t | x_t)}{\pi_0(A_t | x_t)} (R_t - \hat{r}(x_t, A_t))\right]

Decomposition:

  • First term: Direct method estimate
  • Second term: IPS correction for reward model errors

Properties:

  • Doubly robust: Unbiased if either r^\hat{r} is correct or π0\pi_0 is known
  • Lower variance than IPS: When r^\hat{r} is good, corrections are small
  • Asymptotically efficient: Achieves optimal variance under regularity conditions
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    OFFLINE POLICY EVALUATION COMPARISON                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  │ Method          │ Bias      │ Variance  │ Requirement            │  │
│  ├─────────────────┼───────────┼───────────┼────────────────────────┤  │
│  │ Direct Method   │ Possible  │ Low       │ Good reward model      │  │
│  │ IPS             │ None      │ High      │ Known π₀               │  │
│  │ Doubly Robust   │ Reduced   │ Medium    │ Either works           │  │
│                                                                          │
│  PRACTICAL GUIDANCE:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  1. If π and π₀ are similar → IPS works well                           │
│  2. If you have a good reward model → DM may suffice                   │
│  3. Generally → Use DR for robustness                                  │
│  4. Always → Log propensities π₀(a|x) for future evaluation!           │
│                                                                          │
│  CRITICAL: Design logging policies with evaluation in mind              │
│  - Ensure π₀(a|x) > 0 for all actions π might take                     │
│  - Some exploration in logging enables better evaluation                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Offline Policy Learning

Beyond evaluation, we can learn policies from logged data.

Counterfactual Risk Minimization (CRM):

π^=argminπV^IPS(π)+λcomplexity(π)\hat{\pi} = \arg\min_{\pi} \hat{V}_{\text{IPS}}(\pi) + \lambda \cdot \text{complexity}(\pi)

Challenges:

  • Optimizing over policies is harder than evaluating one
  • Propensity clipping needed for stability
  • Sample complexity depends on logging policy overlap

Practical approach: Use DR estimator with variance regularization:

π^=argminπV^DR(π)+λVar(V^DR(π))\hat{\pi} = \arg\min_{\pi} -\hat{V}_{\text{DR}}(\pi) + \lambda \cdot \text{Var}(\hat{V}_{\text{DR}}(\pi))


Part VI: Theoretical Foundations

Concentration Inequalities

Hoeffding's inequality (bounded random variables):

For X1,...,XnX_1, ..., X_n independent with Xi[ai,bi]X_i \in [a_i, b_i] and Xˉ=1niXi\bar{X} = \frac{1}{n}\sum_i X_i:

P(XˉE[Xˉ]ϵ)exp(2n2ϵ2i(biai)2)P\left(\bar{X} - \mathbb{E}[\bar{X}] \geq \epsilon\right) \leq \exp\left(-\frac{2n^2\epsilon^2}{\sum_i (b_i - a_i)^2}\right)

For Xi[0,1]X_i \in [0, 1]:

P(Xˉμϵ)exp(2nϵ2)P\left(\bar{X} - \mu \geq \epsilon\right) \leq \exp(-2n\epsilon^2)

Chernoff bound (Bernoulli random variables):

For X1,...,XnX_1, ..., X_n i.i.d. Bernoulli(pp):

P(Xˉp+ϵ)exp(nKL(p+ϵ,p))P\left(\bar{X} \geq p + \epsilon\right) \leq \exp(-n \cdot \text{KL}(p + \epsilon, p))

Sub-Gaussian concentration:

If XX is σ\sigma-sub-Gaussian (includes bounded r.v.s, Gaussian):

P(XE[X]t)exp(t22σ2)P(X - \mathbb{E}[X] \geq t) \leq \exp\left(-\frac{t^2}{2\sigma^2}\right)

For sample mean of nn i.i.d. sub-Gaussian:

P(Xˉμt)exp(nt22σ2)P\left(\bar{X} - \mu \geq t\right) \leq \exp\left(-\frac{nt^2}{2\sigma^2}\right)


Martingale Methods

Azuma-Hoeffding inequality:

For martingale (Mt)(M_t) with MtMt1ct|M_t - M_{t-1}| \leq c_t:

P(MTM0ϵ)exp(ϵ22tct2)P(M_T - M_0 \geq \epsilon) \leq \exp\left(-\frac{\epsilon^2}{2\sum_t c_t^2}\right)

Application to bandits: Cumulative reward is a martingale (conditioned on arm selections).


Information-Theoretic Lower Bounds

Change-of-measure argument:

For two bandit instances ν\nu and ν\nu' differing only in arm aa:

t=1TEν[I[At=a]]KL(νa,νa)kl(Eν[S],Eν[S])\sum_{t=1}^{T} \mathbb{E}_\nu[\mathbb{I}[A_t = a]] \cdot \text{KL}(\nu_a, \nu'_a) \geq \text{kl}(\mathbb{E}_\nu[S], \mathbb{E}_{\nu'}[S])

where kl(p,q)=pln(p/q)+(1p)ln((1p)/(1q))\text{kl}(p, q) = p \ln(p/q) + (1-p)\ln((1-p)/(1-q)) is binary KL divergence.

Lai-Robbins bound derivation:

By choosing ν\nu' to make arm aa optimal and using the above, we get:

E[Na(T)]lnTKL(νa,ν)(1o(1))\mathbb{E}[N_a(T)] \geq \frac{\ln T}{\text{KL}(\nu_a, \nu^*)} (1 - o(1))

Multiplying by Δa\Delta_a and summing gives the regret lower bound.


Finite-Time Analysis

High-probability regret bounds:

With probability 1δ\geq 1 - \delta:

Regret(T)O(KTln(KT/δ))\text{Regret}(T) \leq O\left(\sqrt{KT \ln(KT/\delta)}\right)

Anytime algorithms: Regret bounds hold for all TT simultaneously (not just pre-specified TT).

Instance-dependent vs. worst-case bounds:

  • Instance-dependent: O(alnTΔa)O\left(\sum_a \frac{\ln T}{\Delta_a}\right) — better when gaps are large
  • Worst-case: O(KT)O(\sqrt{KT}) — better when gaps are small

Algorithms like MOSS and UCB-V achieve both simultaneously.


Part VII: Connections to Other Fields

Reinforcement Learning

Bandits are one-step RL (single state, single action, immediate reward).

Relationship:

AspectBanditsFull RL
States1Many
ActionsK armsAction space per state
HorizonT stepsT or infinite
TransitionsNone$P(s'
RewardR(a)R(a)R(s,a)R(s,a)

Bandits in RL:

  • Exploration in Q-learning uses bandit ideas
  • UCB for tree search (UCT in Monte Carlo Tree Search)
  • Thompson Sampling for posterior-based exploration

Bayesian Optimization

Global optimization of expensive black-box functions:

x=argmaxxXf(x)x^* = \arg\max_{x \in \mathcal{X}} f(x)

where ff is expensive to evaluate.

Connection to bandits:

  • Arms = points in X\mathcal{X}
  • Infinite arms (continuous space)
  • Gaussian Process models ff
  • Acquisition functions ≈ UCB

Acquisition functions:

  • UCB: α(x)=μ(x)+βσ(x)\alpha(x) = \mu(x) + \beta \sigma(x)
  • Expected Improvement: α(x)=E[max(0,f(x)f)]\alpha(x) = \mathbb{E}[\max(0, f(x) - f^*)]
  • Thompson Sampling: Sample f~GP\tilde{f} \sim GP, optimize f~\tilde{f}

Online Learning

Prediction with expert advice:

  • NN experts give predictions each round
  • Learner combines predictions
  • Regret against best expert

Hedge algorithm (full information, like EXP3 but observe all expert losses):

wt+1,i=wt,iexp(ηt,i)w_{t+1,i} = w_{t,i} \exp(-\eta \ell_{t,i})

Regret: O(TlnN)O(\sqrt{T \ln N})

Online convex optimization:

  • Actions = points in convex set
  • Loss = convex function (revealed after action)
  • Gradient descent achieves O(T)O(\sqrt{T}) regret

Bandits = online learning with bandit feedback (only see own loss).


Game Theory

Regret minimization ↔ Nash equilibrium:

In repeated games, if all players use no-regret algorithms, play converges to Nash equilibrium.

Fictitious play: Best response to empirical opponent distribution (similar to greedy).

Multi-agent bandits: Multiple agents compete for same arms—game-theoretic considerations arise.


Summary and Key Takeaways

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTI-ARMED BANDIT SUMMARY                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM:                                                                │
│  ────────                                                                │
│  Balance exploration (learning) with exploitation (earning)             │
│                                                                          │
│  FUNDAMENTAL LIMITS:                                                     │
│  ───────────────────                                                     │
│  • Lai-Robbins: Regret ≥ Ω(∑ ln T / KL(νₐ, ν*))                        │
│  • Minimax: Regret ≥ Ω(√(KT))                                          │
│  • Bayesian vs Frequentist regret: different guarantees                │
│                                                                          │
│  KEY ALGORITHMS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  │ Algorithm         │ Regret         │ Key Idea              │         │
│  ├───────────────────┼────────────────┼───────────────────────┤         │
│  │ ε-greedy (fixed)  │ O(εT)          │ Random exploration    │         │
│  │ ε-greedy (decay)  │ O(K ln T / Δ)  │ Decreasing ε          │         │
│  │ UCB1              │ O(K ln T / Δ)  │ Optimism under uncert │         │
│  │ KL-UCB            │ Optimal        │ KL-based confidence   │         │
│  │ Thompson Sampling │ O(K ln T / Δ)  │ Posterior sampling    │         │
│  │ IDS               │ O(√(Ψ·H·T))   │ Info-regret tradeoff  │         │
│  │ EXP3 (adversarial)│ O(√(KT ln K)) │ Exponential weights   │         │
│                                                                          │
│  CONTEXTUAL BANDITS:                                                     │
│  ───────────────────                                                     │
│  • LinUCB: Linear rewards, O(d√(KT)) regret                            │
│  • Neural bandits: Complex contexts, use NNs for representation         │
│                                                                          │
│  BANDIT VARIANTS:                                                        │
│  ────────────────                                                        │
│  • Restless: Arms evolve when not pulled                                │
│  • Dueling: Pairwise comparison feedback                                │
│  • Sleeping/Mortal: Arms not always available                           │
│  • Constrained: Knapsacks, safety, fairness                             │
│                                                                          │
│  APPLICATIONS:                                                           │
│  ─────────────                                                           │
│  • A/B testing (adaptive experiments)                                   │
│  • Recommendations (explore-exploit)                                    │
│  • Clinical trials (ethical allocation)                                 │
│  • Ad selection (CTR optimization)                                      │
│  • Hyperparameter tuning (Hyperband)                                    │
│                                                                          │
│  OFFLINE EVALUATION:                                                     │
│  ───────────────────                                                     │
│  • IPS: Unbiased but high variance                                      │
│  • Direct Method: Low variance but biased                               │
│  • Doubly Robust: Best of both worlds                                   │
│                                                                          │
│  PRACTICAL ADVICE:                                                       │
│  ────────────────                                                        │
│  1. Start with Thompson Sampling (simple, effective)                    │
│  2. Use LinUCB for linear contextual problems                           │
│  3. Consider IDS for structured/many-armed problems                     │
│  4. Use DR estimator for offline evaluation                             │
│  5. Log propensities for future policy evaluation                       │
│  6. Consider constraints: safety, fairness, budgets                     │
│  7. Monitor for non-stationarity                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘



Sources

Foundational Papers:

Thompson Sampling:

Contextual Bandits:

Neural Bandits:

Applications:

Information-Directed Sampling:

Bandit Variants:

Constrained and Safe Bandits:

Offline Policy Evaluation:

Books and Surveys:

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles

RecSysDeep Learning

Machine Learning for Advertising: CTR Prediction, Ad Ranking, and Bidding Systems

Comprehensive guide to ML systems powering digital advertising. From logistic regression to deep CTR models, user behavior sequences to multi-task learning, and real-time bidding optimization—understand the algorithms behind the $600B+ ad industry.

16 min read
RecSysPersonalization

Recommendation Systems: From Collaborative Filtering to Deep Learning

In-depth journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.

30 min read
RecSysPersonalization

Transformers for Recommendation Systems: From SASRec to HSTU

In-depth tour of transformer-based recommendation systems. From the fundamentals of sequential recommendation to Meta's trillion-parameter HSTU, understand how attention mechanisms revolutionized personalization.

30 min read
EducationAgentic AI

Building Agentic AI Systems: A Complete Implementation Guide

Hands-on guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.

29 min read
LLMsResearch

Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning

Production-focused guide to inference-time scaling techniques—Chain of Thought, Tree of Thoughts, Monte Carlo Tree Search, Process Reward Models, and the HuggingFace search-and-learn framework.

11 min read
LLMsResearch

RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond

Clear walkthrough of reinforcement learning algorithms for LLM alignment—PPO, GRPO, GSPO, REINFORCE++, DPO, and their variants. Understanding the tradeoffs that power modern AI assistants.

12 min read