When should I use bandits vs. A/B testing?

Use bandits when: (1) you want to minimize regret during the experiment, (2) you have many variants to test, (3) you can accept adaptive allocation. Use traditional A/B testing when: (1) you need simple, interpretable statistics, (2) regulatory requirements demand fixed allocation, (3) you're testing few variants for a short time.

UCB or Thompson Sampling?

Both achieve similar theoretical guarantees. Thompson Sampling often performs better empirically (better constants), handles prior information naturally, and extends more easily to complex settings. UCB is simpler to implement and analyze, and can be more computationally efficient. When in doubt, start with Thompson Sampling.

How do I handle delayed feedback?

For moderate delays, standard algorithms often work with slightly increased regret. For significant delays: (1) increase exploration, (2) use algorithms designed for delays, (3) consider the "in-flight" pulls when making decisions. Regret typically increases by $O(d \cdot K)$ where $d$ is the delay.

How do I choose the exploration parameter?

For UCB: The theoretical value (e.g., $\sqrt{2 \ln t / N_a}$) is often conservative. Tune on historical data or start with theoretical value and decrease if over-exploring. For Thompson Sampling: Usually no tuning needed—just use the posterior. For ε-greedy: Start with ε ≈ 0.1-0.2, decay over time.

Can bandits handle continuous action spaces?

Yes, through: (1) discretization (convert to finite arms), (2) Bayesian optimization (GP + acquisition functions), (3) continuous-armed bandits (Lipschitz assumptions). The regret scales with the discretization or smoothness, typically $O(T^{(d+1)/(d+2)})$ for $d$-dimensional continuous space.

What if the best arm changes over time?

Use non-stationary bandit algorithms: (1) sliding window approaches (forget old data), (2) discounted estimates (weight recent data more), (3) change detection + restart, (4) adversarial algorithms like EXP3 (handle arbitrary changes). Regret depends on number/magnitude of changes.

How do bandits relate to reinforcement learning?

Bandits are single-state, single-step RL. They appear as components of RL: exploration strategies (ε-greedy, UCB in MCTS), policy gradient variance reduction, and curiosity-driven exploration. Understanding bandits provides foundation for RL exploration.

Multi-Armed Bandits: Theory, Algorithms, and Applications | Enrico Piovano

The Exploration-Exploitation Dilemma

Imagine you're at a casino with multiple slot machines (one-armed bandits). Each machine has an unknown payout probability. You want to maximize your total winnings over many plays. Should you keep playing the machine that's paid well so far (exploit), or try other machines that might be even better (explore)?

This is the exploration-exploitation dilemma—one of the most fundamental problems in decision-making under uncertainty. It appears everywhere:

A/B testing: Keep showing the winning variant, or test new ones?
Recommendations: Show items users will like, or discover new preferences?
Clinical trials: Give patients the best-known treatment, or test alternatives?
Ad selection: Display high-CTR ads, or explore new creatives?
Hyperparameter tuning: Exploit promising configurations, or explore new regions?

Multi-armed bandits (MABs) provide a principled framework for balancing exploration and exploitation.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE EXPLORATION-EXPLOITATION TRADEOFF                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PURE EXPLOITATION:                                                      │
│  ──────────────────                                                      │
│  Always play the arm with highest observed reward                       │
│                                                                          │
│  Problem: May get stuck on suboptimal arm                               │
│                                                                          │
│  Example: Arm A paid $1 on first pull, Arm B paid $0                    │
│  → Always pull A, never discover B actually has 90% win rate            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PURE EXPLORATION:                                                       │
│  ─────────────────                                                       │
│  Pull each arm equally regardless of rewards                            │
│                                                                          │
│  Problem: Wastes pulls on clearly inferior arms                         │
│                                                                          │
│  Example: After 1000 pulls, Arm A averages $0.90, Arm B averages $0.10 │
│  → Still pulling B 50% of the time, losing expected value               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPTIMAL STRATEGY:                                                       │
│  ────────────────                                                        │
│  Explore enough to identify the best arm with high confidence           │
│  Then exploit that arm while occasionally checking others               │
│                                                                          │
│  The balance depends on:                                                │
│  • Time horizon (more time → more exploration affordable)               │
│  • Uncertainty (more uncertain → more exploration needed)               │
│  • Arm differences (similar arms → more exploration to distinguish)     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part I: Foundations and Problem Formulation

The Stochastic Multi-Armed Bandit Problem

Setting: You have $K$ arms (actions). At each time step $t = 1, 2, ..., T$ :

You select an arm $A_t \in \{1, 2, ..., K\}$
You receive a reward $R_t$ drawn from the selected arm's distribution

Arm reward distributions: Each arm $a$ has an unknown reward distribution $\nu_a$ with mean $\mu_a$ :

$R_t | A_t = a \sim \nu_a, \quad \mathbb{E}[R_t | A_t = a] = \mu_a$

Optimal arm: The arm with highest expected reward:

$a^* = \arg\max_{a \in \{1,...,K\}} \mu_a, \quad \mu^* = \mu_{a^*} = \max_a \mu_a$

Goal: Maximize cumulative reward over $T$ rounds:

$\max \mathbb{E}\left[\sum_{t=1}^{T} R_t\right]$

Regret: The Performance Metric

Instead of maximizing reward directly, we typically minimize regret—the difference between what we earned and what we could have earned by always playing the optimal arm.

Cumulative regret:

$\text{Regret}(T) = T \cdot \mu^* - \mathbb{E}\left[\sum_{t=1}^{T} R_t\right] = \mathbb{E}\left[\sum_{t=1}^{T} (\mu^* - \mu_{A_t})\right]$

Interpretation: If the best arm has mean reward $\mu^* = 0.8$ and we played it every round for $T = 1000$ rounds, we'd expect $800$ total reward. If our algorithm achieved expected reward $750$ , our regret is $50$ .

Gap-based regret decomposition:

Define the gap (or suboptimality) of arm $a$ :

$\Delta_a = \mu^* - \mu_a$

Then regret decomposes as:

$\text{Regret}(T) = \sum_{a=1}^{K} \Delta_a \cdot \mathbb{E}[N_a(T)]$

where $N_a(T) = \sum_{t=1}^{T} \mathbb{I}[A_t = a]$ is the number of times arm $a$ was pulled.

Insight: To minimize regret, we need to minimize pulls of suboptimal arms, especially those with large gaps.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    REGRET DECOMPOSITION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Example: 3 arms, T = 1000 rounds                                       │
│                                                                          │
│  Arm 1: μ₁ = 0.8 (optimal), Δ₁ = 0                                     │
│  Arm 2: μ₂ = 0.6, Δ₂ = 0.2                                             │
│  Arm 3: μ₃ = 0.3, Δ₃ = 0.5                                             │
│                                                                          │
│  Suppose our algorithm pulled:                                          │
│  N₁(T) = 900, N₂(T) = 80, N₃(T) = 20                                   │
│                                                                          │
│  Regret = Δ₁·N₁ + Δ₂·N₂ + Δ₃·N₃                                        │
│         = 0·900 + 0.2·80 + 0.5·20                                       │
│         = 0 + 16 + 10                                                   │
│         = 26                                                            │
│                                                                          │
│  REGRET SOURCES:                                                         │
│  ───────────────                                                         │
│  • Arm 2 contributed 16 regret (moderate gap, pulled often)             │
│  • Arm 3 contributed 10 regret (large gap, pulled rarely)               │
│  • Arm 1 contributed 0 regret (optimal arm)                             │
│                                                                          │
│  Good algorithms minimize E[Nₐ(T)] for suboptimal arms                  │
│  while pulling them enough to confirm they're suboptimal                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Bayesian vs. Frequentist Regret

There are two fundamentally different ways to define and analyze regret:

Frequentist (Worst-Case) Regret:

The regret we've defined so far is frequentist—it measures performance for a fixed (but unknown) problem instance:

$\text{Regret}_{\text{freq}}(T) = \mathbb{E}\left[\sum_{t=1}^{T} (\mu^* - \mu_{A_t})\right]$

The expectation is only over the algorithm's randomness. Lower bounds (like Lai-Robbins) hold for every problem instance.

Bayesian Regret:

In the Bayesian view, the problem instance itself is drawn from a prior $P(\theta)$ :

$\text{BayesRegret}(T) = \mathbb{E}_{\theta \sim P}\left[\mathbb{E}\left[\sum_{t=1}^{T} (\mu^*(\theta) - \mu_{A_t}(\theta))\right]\right]$

The outer expectation is over problem instances. This measures average performance across problems.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FREQUENTIST vs BAYESIAN REGRET                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FREQUENTIST:                                                            │
│  ────────────                                                            │
│  "For ANY problem instance, my algorithm has low regret"                │
│                                                                          │
│  • Stronger guarantee (worst-case)                                      │
│  • UCB analysis is frequentist                                          │
│  • Lai-Robbins is a frequentist lower bound                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BAYESIAN:                                                               │
│  ─────────                                                               │
│  "On AVERAGE over problems from my prior, I have low regret"            │
│                                                                          │
│  • Weaker guarantee (average-case)                                      │
│  • Thompson Sampling analysis is often Bayesian                         │
│  • Can be much easier to prove                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RELATIONSHIP:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Low Bayesian regret does NOT imply low frequentist regret!             │
│                                                                          │
│  Example: An algorithm could do well on "typical" problems              │
│  but fail badly on adversarial problem instances.                       │
│                                                                          │
│  Thompson Sampling has both:                                             │
│  • O(√(KT ln K)) Bayesian regret (easy to prove)                       │
│  • O(K ln T / Δ) frequentist regret (harder to prove)                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why it matters:

Thompson Sampling: Original analysis by Agrawal & Goyal (2012) was Bayesian. Frequentist bounds came later and are more complex.
Prior mismatch: If your prior doesn't match reality, Bayesian guarantees may not hold.
Practical impact: For most applications, both perspectives give similar practical guidance.

Regret Lower Bounds

How well can any algorithm possibly do? The Lai-Robbins lower bound (1985) establishes a fundamental limit.

Theorem (Lai-Robbins): For any consistent policy and any bandit problem with arm distributions from a single-parameter exponential family:

$\liminf_{T \to \infty} \frac{\text{Regret}(T)}{\ln T} \geq \sum_{a: \mu_a < \mu^*} \frac{\Delta_a}{\text{KL}(\nu_a, \nu^*)}$

where $\text{KL}(\nu_a, \nu^*)$ is the Kullback-Leibler divergence between arm $a$ 's distribution and the optimal arm's distribution.

For Bernoulli arms with means $\mu_a$ and $\mu^*$ :

$\text{KL}(\mu_a, \mu^*) = \mu_a \ln\frac{\mu_a}{\mu^*} + (1-\mu_a) \ln\frac{1-\mu_a}{1-\mu^*}$

Simplified form (for small gaps, using $\text{KL}(\mu, \mu + \epsilon) \approx \frac{\epsilon^2}{2\mu(1-\mu)}$ ):

$\text{Regret}(T) \geq \Omega\left(\sum_{a: \Delta_a > 0} \frac{\ln T}{\Delta_a}\right)$

Key insight: Regret must grow at least logarithmically with $T$ . We cannot do better than $O(\ln T)$ regret. Algorithms achieving this bound are called optimal.

Problem Variants

Bounded rewards: $R_t \in [0, 1]$ (or $[a, b]$ more generally)

Bernoulli bandits: $R_t \in \{0, 1\}$ , $P(R_t = 1 | A_t = a) = \mu_a$

Gaussian bandits: $R_t | A_t = a \sim \mathcal{N}(\mu_a, \sigma^2)$

Sub-Gaussian rewards: $R_t - \mu_{A_t}$ is $\sigma$ -sub-Gaussian: $\mathbb{E}[\exp(\lambda(R_t - \mu_{A_t}))] \leq \exp\left(\frac{\lambda^2 \sigma^2}{2}\right) \quad \forall \lambda$

Sub-Gaussian is a key assumption enabling concentration inequalities.

Part II: Classical Algorithms

ε-Greedy

The simplest exploration strategy: exploit most of the time, explore randomly with probability $\epsilon$ .

Algorithm:

At each time $t$ :

With probability $\epsilon$ : pull a uniformly random arm
With probability $1 - \epsilon$ : pull $\arg\max_a \hat{\mu}_a(t)$

where $\hat{\mu}_a(t) = \frac{1}{N_a(t)} \sum_{s < t: A_s = a} R_s$ is the empirical mean of arm $a$ .

Regret analysis:

For fixed $\epsilon$ :

$\mathbb{E}[N_a(T)] \geq \frac{\epsilon T}{K} \quad \text{for all arms } a$

This means we pull every arm linearly in $T$ , giving:

$\text{Regret}(T) = \Omega(\epsilon T)$

Linear regret! The algorithm never stops exploring suboptimal arms.

Decaying ε-greedy:

To achieve sublinear regret, decay $\epsilon$ over time:

$\epsilon_t = \min\left(1, \frac{cK}{d^2 t}\right)$

where $c > 0$ and $d = \min_{a: \Delta_a > 0} \Delta_a$ is the minimum gap.

This achieves $O\left(\frac{K \ln T}{d}\right)$ regret, but requires knowing the minimum gap $d$ in advance—usually impractical.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ε-GREEDY ANALYSIS                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FIXED ε = 0.1:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  Explore probability: 10% of the time                                   │
│  Each suboptimal arm pulled: ≥ 0.1 × T / K times                        │
│                                                                          │
│  For T = 10,000, K = 5 arms:                                            │
│  Each arm pulled ≥ 200 times                                            │
│  If Arm 5 has Δ₅ = 0.5:                                                 │
│  Regret from Arm 5 alone ≥ 0.5 × 200 = 100                             │
│                                                                          │
│  Total regret grows linearly: Regret(T) = Θ(εT)                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DECAYING ε_t = c/(t):                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Early: ε large → lots of exploration                                   │
│  Late: ε small → mostly exploitation                                    │
│                                                                          │
│  Regret = O(K ln T) — much better!                                      │
│                                                                          │
│  But requires tuning c based on unknown problem parameters              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Upper Confidence Bound (UCB)

UCB algorithms embody the principle of optimism in the face of uncertainty: when uncertain about an arm's value, assume the best plausible case.

Key idea: Construct confidence intervals for each arm's mean. Pull the arm with the highest upper confidence bound.

UCB1 Algorithm

Upper confidence bound for arm $a$ at time $t$ :

$\text{UCB}_a(t) = \hat{\mu}_a(t) + \sqrt{\frac{2 \ln t}{N_a(t)}}$

Algorithm: At each time $t$ , pull:

$A_t = \arg\max_a \text{UCB}_a(t) = \arg\max_a \left(\hat{\mu}_a(t) + \sqrt{\frac{2 \ln t}{N_a(t)}}\right)$

Intuition:

$\hat{\mu}_a(t)$ : Exploitation term (favor arms with high observed rewards)
$\sqrt{\frac{2 \ln t}{N_a(t)}}$ : Exploration bonus (favor under-explored arms)

As $N_a(t)$ increases, the bonus shrinks, and we rely more on the empirical mean.

Why this specific form?

The exploration bonus comes from Hoeffding's inequality. For bounded rewards in $[0, 1]$ and $N$ i.i.d. samples:

$P\left(\hat{\mu} - \mu \geq \epsilon\right) \leq \exp(-2N\epsilon^2)$

Setting $\exp(-2N\epsilon^2) = t^{-4}$ (small probability) and solving for $\epsilon$ :

$\epsilon = \sqrt{\frac{2 \ln t}{N}}$

So with high probability, the true mean is below $\hat{\mu} + \sqrt{\frac{2 \ln t}{N}}$ .

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    UCB1 CONFIDENCE INTERVALS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Time t = 100                                                           │
│                                                                          │
│  Arm 1: N₁ = 40, μ̂₁ = 0.70                                             │
│  UCB₁ = 0.70 + √(2 ln 100 / 40) = 0.70 + 0.48 = 1.18                   │
│                                                                          │
│  Arm 2: N₂ = 50, μ̂₂ = 0.65                                             │
│  UCB₂ = 0.65 + √(2 ln 100 / 50) = 0.65 + 0.43 = 1.08                   │
│                                                                          │
│  Arm 3: N₃ = 10, μ̂₃ = 0.40                                             │
│  UCB₃ = 0.40 + √(2 ln 100 / 10) = 0.40 + 0.96 = 1.36  ← HIGHEST!       │
│                                                                          │
│  VISUALIZATION:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  Arm 1: |═══════════════════[────]        UCB = 1.18                   │
│                              ↑                                           │
│                           μ̂ = 0.70                                      │
│                                                                          │
│  Arm 2: |════════════════[────]           UCB = 1.08                   │
│                           ↑                                              │
│                        μ̂ = 0.65                                         │
│                                                                          │
│  Arm 3: |═════════[────────────────]      UCB = 1.36  ← SELECTED       │
│                   ↑                                                      │
│                μ̂ = 0.40                                                  │
│                                                                          │
│  Arm 3 selected despite lowest empirical mean because                   │
│  wide confidence interval (few samples) creates high UCB                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

UCB1 Regret Bound

Theorem: For UCB1 with $K$ arms and rewards in $[0, 1]$ :

$\text{Regret}(T) \leq 8 \sum_{a: \Delta_a > 0} \frac{\ln T}{\Delta_a} + \left(1 + \frac{\pi^2}{3}\right) \sum_{a=1}^{K} \Delta_a$

Asymptotic form:

$\text{Regret}(T) = O\left(\frac{K \ln T}{\Delta_{\min}}\right)$

where $\Delta_{\min} = \min_{a: \Delta_a > 0} \Delta_a$ .

Gap-dependent vs. gap-independent bounds:

The bound above is gap-dependent (depends on $\Delta_a$ ). A gap-independent bound:

$\text{Regret}(T) = O\left(\sqrt{KT \ln T}\right)$

This is useful when gaps are small or unknown.

Proof sketch (why UCB1 achieves logarithmic regret):

For a suboptimal arm $a$ to be pulled at time $t$ , its UCB must exceed the optimal arm's UCB:

$\hat{\mu}_a(t) + \sqrt{\frac{2 \ln t}{N_a(t)}} \geq \hat{\mu}^*(t) + \sqrt{\frac{2 \ln t}{N^*(t)}}$

With high probability, $\hat{\mu}_a(t) \leq \mu_a + \sqrt{\frac{2 \ln t}{N_a(t)}}$ and $\hat{\mu}^*(t) \geq \mu^* - \sqrt{\frac{2 \ln t}{N^*(t)}}$ .

For the inequality to hold when the optimal arm is well-estimated:

$\mu_a + 2\sqrt{\frac{2 \ln t}{N_a(t)}} \geq \mu^*$

$\sqrt{\frac{2 \ln t}{N_a(t)}} \geq \frac{\Delta_a}{2}$

$N_a(t) \leq \frac{8 \ln t}{\Delta_a^2}$

So arm $a$ is pulled at most $O\left(\frac{\ln T}{\Delta_a^2}\right)$ times, giving regret contribution $O\left(\frac{\ln T}{\Delta_a}\right)$ .

UCB Variants

UCB2: Uses epochs of increasing length to reduce computation.

UCB-V (Variance-aware UCB):

If rewards have variance $\sigma_a^2$ , tighter bounds are possible:

$\text{UCB-V}_a(t) = \hat{\mu}_a(t) + \sqrt{\frac{2 \hat{\sigma}_a^2(t) \ln t}{N_a(t)}} + \frac{3 \ln t}{N_a(t)}$

where $\hat{\sigma}_a^2(t)$ is the empirical variance.

KL-UCB: Uses KL-divergence for tighter bounds on Bernoulli rewards:

$\text{KL-UCB}_a(t) = \max\left\{q : N_a(t) \cdot \text{KL}(\hat{\mu}_a(t), q) \leq \ln t + c \ln \ln t\right\}$

KL-UCB is asymptotically optimal (matches Lai-Robbins lower bound).

MOSS (Minimax Optimal Strategy in the Stochastic case):

$\text{MOSS}_a(t) = \hat{\mu}_a(t) + \sqrt{\frac{\max\left(0, \ln\frac{T}{K \cdot N_a(t)}\right)}{N_a(t)}}$

Achieves the minimax optimal $O(\sqrt{KT})$ regret.

Thompson Sampling

Thompson Sampling (TS) is a Bayesian approach: maintain a posterior distribution over each arm's mean, sample from posteriors, and play the arm with highest sample.

Algorithm:

Prior: Start with prior $P(\mu_a)$ for each arm (e.g., $\text{Beta}(1, 1)$ for Bernoulli)
At each time $t$ :
- Sample $\tilde{\mu}_a \sim P(\mu_a | \text{history})$ for each arm
- Pull $A_t = \arg\max_a \tilde{\mu}_a$
Update: Update posterior with observed reward

For Bernoulli bandits with Beta prior:

Prior: $\mu_a \sim \text{Beta}(\alpha_a, \beta_a)$ , typically $\alpha_a = \beta_a = 1$ (uniform)

After observing $S_a$ successes and $F_a$ failures on arm $a$ :

$\mu_a | \text{data} \sim \text{Beta}(\alpha_a + S_a, \beta_a + F_a)$

Algorithm for Bernoulli Thompson Sampling:

Initialize: $\alpha_a = \beta_a = 1$ for all arms
For $t = 1, 2, ..., T$ $t = 1, 2, ..., T$ :
- Sample $\tilde{\mu}_a \sim \text{Beta}(\alpha_a, \beta_a)$ for each arm
- Pull $A_t = \arg\max_a \tilde{\mu}_a$
- Observe reward $R_t \in \{0, 1\}$
- Update: $\alpha_{A_t} \leftarrow \alpha_{A_t} + R_t$ , $\beta_{A_t} \leftarrow \beta_{A_t} + (1 - R_t)$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THOMPSON SAMPLING INTUITION                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Arm 1: 8 successes, 2 failures → Beta(9, 3)                           │
│  Arm 2: 3 successes, 7 failures → Beta(4, 8)                           │
│  Arm 3: 1 success, 1 failure → Beta(2, 2)                              │
│                                                                          │
│  POSTERIOR DISTRIBUTIONS:                                                │
│  ─────────────────────────                                               │
│                                                                          │
│  P(μ)                                                                   │
│    │      Arm 2                     Arm 1                               │
│    │       ╱╲                         ╱╲                                │
│    │      ╱  ╲        Arm 3         ╱  ╲                               │
│    │     ╱    ╲        ╱╲          ╱    ╲                              │
│    │    ╱      ╲      ╱  ╲        ╱      ╲                             │
│    │   ╱        ╲    ╱    ╲      ╱        ╲                            │
│    │  ╱          ╲  ╱      ╲    ╱          ╲                           │
│    │─────────────────────────────────────────── μ                       │
│    0   0.2  0.3  0.4  0.5  0.6  0.7  0.8  1.0                          │
│                                                                          │
│  SAMPLING PROCESS:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Round 1: Sample μ̃₁ = 0.72, μ̃₂ = 0.28, μ̃₃ = 0.55 → Pull Arm 1        │
│  Round 2: Sample μ̃₁ = 0.68, μ̃₂ = 0.35, μ̃₃ = 0.71 → Pull Arm 3        │
│  Round 3: Sample μ̃₁ = 0.81, μ̃₂ = 0.22, μ̃₃ = 0.48 → Pull Arm 1        │
│                                                                          │
│  KEY INSIGHT:                                                            │
│  ───────────                                                             │
│  • Arm 1: Narrow posterior (many samples) → samples cluster near 0.75  │
│  • Arm 3: Wide posterior (few samples) → samples vary widely           │
│  • Wide posteriors mean occasional high samples → exploration!          │
│                                                                          │
│  Thompson Sampling naturally balances exploration (wide posteriors)     │
│  and exploitation (high posterior means)                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

For Gaussian bandits with known variance $\sigma^2$ :

Prior: $\mu_a \sim \mathcal{N}(\mu_0, \sigma_0^2)$

Posterior after $N_a$ observations with sample mean $\bar{X}_a$ :

$\mu_a | \text{data} \sim \mathcal{N}\left(\frac{\frac{\mu_0}{\sigma_0^2} + \frac{N_a \bar{X}_a}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{N_a}{\sigma^2}}, \left(\frac{1}{\sigma_0^2} + \frac{N_a}{\sigma^2}\right)^{-1}\right)$

With uninformative prior ( $\sigma_0 \to \infty$ ):

$\mu_a | \text{data} \sim \mathcal{N}\left(\bar{X}_a, \frac{\sigma^2}{N_a}\right)$

Thompson Sampling Regret Bound

Theorem (Agrawal & Goyal, 2012): For Thompson Sampling with Bernoulli rewards:

$\mathbb{E}[\text{Regret}(T)] = O\left(\sum_{a: \Delta_a > 0} \frac{\ln T}{\Delta_a}\right)$

Thompson Sampling achieves the same asymptotic regret as UCB and matches the Lai-Robbins lower bound.

Advantages of Thompson Sampling:

Simple and intuitive
Naturally handles prior information
Often outperforms UCB in practice (better constants)
Extends easily to complex settings (contextual, combinatorial)

Disadvantages:

Requires specifying a prior
Computationally expensive for complex posteriors
Harder to analyze theoretically

Comparison: UCB vs. Thompson Sampling

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    UCB vs THOMPSON SAMPLING                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  │ Aspect              │ UCB                  │ Thompson Sampling │     │
│  ├─────────────────────┼──────────────────────┼───────────────────┤     │
│  │ Philosophy          │ Frequentist          │ Bayesian          │     │
│  │ Exploration         │ Deterministic        │ Randomized        │     │
│  │ Computation         │ Simple (max UCB)     │ Sampling required │     │
│  │ Prior knowledge     │ Hard to incorporate  │ Natural via prior │     │
│  │ Regret bound        │ O(K ln T / Δ)        │ O(K ln T / Δ)     │     │
│  │ Practical perf.     │ Good                 │ Often better      │     │
│  │ Analysis difficulty │ Easier               │ Harder            │     │
│  │ Delayed feedback    │ Straightforward      │ Straightforward   │     │
│  │ Batched decisions   │ Needs modification   │ Natural           │     │
│                                                                          │
│  EMPIRICAL COMPARISON (typical):                                         │
│  ───────────────────────────────                                         │
│                                                                          │
│  Regret                                                                  │
│    │                                                                     │
│    │    ε-greedy (linear)                                               │
│    │   ╱                                                                 │
│    │  ╱                                                                  │
│    │ ╱      UCB                                                         │
│    │╱      ╱                                                             │
│    │      ╱    Thompson Sampling                                        │
│    │     ╱    ╱                                                          │
│    │    ╱   ╱                                                            │
│    │   ╱  ╱                                                              │
│    │──╱─╱─────────────────────────────────── Time                       │
│                                                                          │
│  Thompson Sampling often has better constants (lower regret)            │
│  despite same asymptotic rate as UCB                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Information-Directed Sampling (IDS)

While UCB and Thompson Sampling are the most common approaches, Information-Directed Sampling offers a principled alternative that directly optimizes the exploration-exploitation tradeoff.

Key idea: At each round, choose the action that achieves the best tradeoff between expected regret and information gained about the optimal action.

Information ratio:

$\Psi_t = \frac{(\mathbb{E}[\Delta_{A_t} | H_t])^2}{I_t(A^*; (A_t, R_t) | H_t)}$

where:

$\mathbb{E}[\Delta_{A_t} | H_t]$ = expected instantaneous regret
$I_t(A^*; (A_t, R_t) | H_t)$ = mutual information between optimal arm and observation
$H_t$ = history up to time $t$

IDS Algorithm:

At each time $t$ , select distribution $\pi_t$ over arms minimizing:

$\min_{\pi} \frac{(\sum_a \pi(a) \mathbb{E}[\Delta_a | H_t])^2}{\sum_a \pi(a) I_t(A^*; R_a | H_t)}$

Then sample $A_t \sim \pi_t$ .

Intuition:

Numerator: Squared expected regret (want low)
Denominator: Information gain (want high)
Minimizing the ratio balances both objectives

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    IDS vs UCB vs THOMPSON SAMPLING                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  UCB:                                                                    │
│  ────                                                                    │
│  "Be optimistic—assume uncertain arms are good"                         │
│  Exploration is implicit via confidence bounds                          │
│                                                                          │
│  THOMPSON SAMPLING:                                                      │
│  ─────────────────                                                       │
│  "Sample from beliefs, act on sample"                                   │
│  Exploration via posterior uncertainty                                  │
│                                                                          │
│  IDS:                                                                    │
│  ────                                                                    │
│  "Explicitly optimize regret vs. information tradeoff"                  │
│  Directly measures value of information                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN IDS HELPS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  • Many similar arms: IDS identifies which comparisons are informative  │
│  • Structured problems: IDS exploits information structure              │
│  • Sparse rewards: IDS values rare informative observations             │
│                                                                          │
│  Example: 100 arms, 99 have mean 0.5, one has mean 0.6                  │
│  - UCB/TS: Explore all arms roughly equally initially                   │
│  - IDS: Quickly identifies that most arms are similar,                  │
│         focuses on finding the unique good arm                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Regret bound:

For $K$ -armed bandits with bounded information ratio $\Psi^* = \sup_t \mathbb{E}[\Psi_t]$ :

$\text{Regret}(T) \leq \sqrt{\Psi^* \cdot H(A^*) \cdot T}$

where $H(A^*)$ is the entropy of the optimal arm under the prior.

Key insight: $\Psi^* \leq K/2$ always, but for structured problems $\Psi^*$ can be much smaller, leading to better regret than UCB/TS.

Practical considerations:

Computing mutual information can be expensive
Approximations needed for complex posteriors
Most beneficial for structured or many-armed problems

Part III: Contextual Bandits

In many applications, we observe context (features) before making a decision. The optimal action depends on this context.

Problem Formulation

Setting: At each time $t$ :

Observe context $x_t \in \mathcal{X}$ (e.g., user features)
Select arm $A_t \in \{1, ..., K\}$
Receive reward $R_t$ where $\mathbb{E}[R_t | x_t, A_t = a] = f_a(x_t)$

Goal: Learn a policy $\pi: \mathcal{X} \to \{1, ..., K\}$ that maximizes:

$\mathbb{E}\left[\sum_{t=1}^{T} R_t\right] = \mathbb{E}\left[\sum_{t=1}^{T} f_{A_t}(x_t)\right]$

Regret (compared to best policy in hindsight):

$\text{Regret}(T) = \mathbb{E}\left[\sum_{t=1}^{T} \max_a f_a(x_t) - f_{A_t}(x_t)\right]$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CONTEXTUAL BANDIT SETTING                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE: News Article Recommendation                                    │
│  ────────────────────────────────────                                    │
│                                                                          │
│  Context x_t = [user_age, user_gender, time_of_day, device, ...]       │
│                                                                          │
│  Arms = {Sports, Politics, Tech, Entertainment, ...}                    │
│                                                                          │
│  Reward = 1 if user clicks, 0 otherwise                                 │
│                                                                          │
│  Goal: Learn which article category to show each user                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY CONTEXT MATTERS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  User A: young, male, evening, mobile                                   │
│  → Sports might be best                                                 │
│                                                                          │
│  User B: older, female, morning, desktop                                │
│  → Politics might be best                                               │
│                                                                          │
│  Without context: single "best" arm for everyone (suboptimal)           │
│  With context: personalized arm selection (better)                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMPARED TO STANDARD MAB:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Standard MAB:    A_t → R_t                                             │
│  Contextual MAB:  (x_t, A_t) → R_t                                      │
│                                                                          │
│  Standard MAB learns: μ_a (scalar per arm)                              │
│  Contextual MAB learns: f_a(x) (function per arm)                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Linear Contextual Bandits (LinUCB)

Assume the expected reward is linear in context:

$\mathbb{E}[R_t | x_t, A_t = a] = x_t^{\top} \theta_a^*$

where $\theta_a^* \in \mathbb{R}^d$ is an unknown parameter vector for arm $a$ .

LinUCB Algorithm:

For each arm $a$ , maintain:

$A_a = I_d + \sum_{s: A_s = a} x_s x_s^{\top}$ (design matrix)
$b_a = \sum_{s: A_s = a} R_s x_s$ (reward-weighted contexts)

Ridge regression estimate:

$\hat{\theta}_a = A_a^{-1} b_a$

Confidence bound:

For $\theta_a^*$ with probability $\geq 1 - \delta$ :

$\left\|\hat{\theta}_a - \theta_a^*\right\|_{A_a} \leq \beta_t(\delta)$

where $\|v\|_M = \sqrt{v^{\top} M v}$ and:

$\beta_t(\delta) = \sqrt{\lambda} \|\theta_a^*\| + \sigma\sqrt{2 \ln(1/\delta) + d \ln(1 + t/(d\lambda))}$

UCB for arm $a$ given context $x_t$ :

$\text{UCB}_a(x_t) = x_t^{\top} \hat{\theta}_a + \alpha \sqrt{x_t^{\top} A_a^{-1} x_t}$

Algorithm:

Pull $A_t = \arg\max_a \text{UCB}_a(x_t)$
Observe $R_t$
Update $A_{A_t} \leftarrow A_{A_t} + x_t x_t^{\top}$ and $b_{A_t} \leftarrow b_{A_t} + R_t x_t$

Regret bound:

$\text{Regret}(T) = \tilde{O}\left(d\sqrt{KT}\right)$

where $\tilde{O}$ hides logarithmic factors.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LinUCB GEOMETRIC INTUITION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Parameter space for arm a (2D example):                                │
│                                                                          │
│       θ₂                                                                 │
│        │                                                                 │
│        │     ╭─────────────╮                                            │
│        │    ╱               ╲    Confidence ellipsoid                   │
│        │   ╱    ●───────────────→ True θ*                              │
│        │  │     ↑            │                                          │
│        │  │   θ̂ (estimate)   │                                          │
│        │   ╲                ╱                                            │
│        │    ╲              ╱                                             │
│        │     ╰────────────╯                                             │
│        │                                                                 │
│        └────────────────────────── θ₁                                   │
│                                                                          │
│  Ellipsoid shape: A_a^{-1} (inverse of design matrix)                   │
│  Directions with many observations → narrow (confident)                 │
│  Directions with few observations → wide (uncertain)                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  UCB for context x:                                                      │
│                                                                          │
│  UCB_a(x) = x^T θ̂_a + α √(x^T A_a^{-1} x)                               │
│             ─────────   ─────────────────                                │
│             Exploitation    Exploration                                  │
│             (predicted      (uncertainty                                 │
│              reward)         in direction x)                            │
│                                                                          │
│  If x aligns with uncertain directions → large exploration bonus        │
│  If x aligns with well-explored directions → small exploration bonus    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Linear Thompson Sampling

For linear bandits:

Prior: $\theta_a \sim \mathcal{N}(0, \lambda^{-1} I_d)$

Posterior (Gaussian conjugate update):

$\theta_a | \text{data} \sim \mathcal{N}\left(A_a^{-1} b_a, \sigma^2 A_a^{-1}\right)$

Algorithm:

Sample $\tilde{\theta}_a \sim \mathcal{N}(\hat{\theta}_a, v^2 A_a^{-1})$ for each arm
Pull $A_t = \arg\max_a x_t^{\top} \tilde{\theta}_a$
Update $A_{A_t}$ and $b_{A_t}$

The parameter $v$ controls exploration (often set to match confidence width).

Generalized Linear Bandits

When rewards are non-linear but belong to an exponential family:

$\mathbb{E}[R_t | x_t, A_t = a] = g(x_t^{\top} \theta_a^*)$

where $g$ is a known link function (e.g., sigmoid for logistic).

Examples:

Logistic bandits: $g(\eta) = \frac{1}{1 + e^{-\eta}}$ (click-through prediction)
Poisson bandits: $g(\eta) = e^{\eta}$ (count data)

GLM-UCB and GLM-Thompson Sampling extend the linear approaches using maximum likelihood estimation and appropriate confidence bounds.

Neural Contextual Bandits

For complex contexts (images, text), linear models are insufficient. Neural networks can learn rich representations.

NeuralUCB / NeuralTS:

Train neural network $f_\theta(x, a)$ to predict rewards
Use neural tangent kernel (NTK) or ensemble methods for uncertainty
Apply UCB or Thompson Sampling on top

Uncertainty estimation methods:

Dropout at test time: Sample multiple predictions with dropout
Ensemble: Train multiple networks, use variance
Neural Tangent Kernel: Treat last layer as linear, apply LinUCB
Bayesian neural networks: Maintain weight distributions

Regret bounds: Under certain assumptions (NTK regime), neural bandits achieve $\tilde{O}(\sqrt{T})$ regret.

Part IV: Advanced Topics

Adversarial Bandits

In adversarial bandits, rewards are not stochastic but chosen by an adversary who may know your algorithm.

Setting: At each time $t$ :

Adversary secretly assigns rewards $r_{t,1}, ..., r_{t,K}$ to all arms
You select arm $A_t$
You receive and observe $r_{t, A_t}$ (or only observe your reward in "bandit feedback")

Regret (vs. best fixed arm in hindsight):

$\text{Regret}(T) = \max_{a \in \{1,...,K\}} \sum_{t=1}^{T} r_{t,a} - \mathbb{E}\left[\sum_{t=1}^{T} r_{t, A_t}\right]$

EXP3 (Exponential-weight algorithm for Exploration and Exploitation):

Maintain weights $w_{t,a}$ for each arm. Let $W_t = \sum_a w_{t,a}$ .

Probability of pulling arm $a$ :

$p_{t,a} = (1 - \gamma) \frac{w_{t,a}}{W_t} + \frac{\gamma}{K}$

where $\gamma \in (0, 1]$ is the exploration parameter.

Weight update after observing reward $r_t$ for arm $A_t$ :

$\hat{r}_{t,a} = \frac{r_t \mathbb{I}[A_t = a]}{p_{t,a}}$

$w_{t+1,a} = w_{t,a} \cdot \exp\left(\frac{\gamma \hat{r}_{t,a}}{K}\right)$

Intuition: $\hat{r}_{t,a}$ is an unbiased estimator of $r_{t,a}$ (importance weighting). High-reward arms get higher weights.

EXP3 Regret Bound:

$\mathbb{E}[\text{Regret}(T)] \leq 2\sqrt{KT \ln K}$

This is minimax optimal (matches lower bound up to constants).

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    STOCHASTIC vs ADVERSARIAL BANDITS                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STOCHASTIC BANDITS:                                                     │
│  ───────────────────                                                     │
│  Rewards: R_t ~ ν_a (fixed distribution per arm)                        │
│  Adversary: None (nature is i.i.d.)                                     │
│  Best algorithms: UCB, Thompson Sampling                                │
│  Regret: O(K ln T / Δ) or O(√(KT ln K))                                │
│                                                                          │
│  Exploits: Concentration (samples converge to mean)                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ADVERSARIAL BANDITS:                                                    │
│  ────────────────────                                                    │
│  Rewards: r_{t,a} chosen by adversary (possibly adaptive)               │
│  Adversary: Knows your algorithm, can be malicious                      │
│  Best algorithms: EXP3, EXP3.P                                          │
│  Regret: O(√(KT ln K))                                                  │
│                                                                          │
│  Exploits: Randomization (adversary can't predict your choice)          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY EXP3 IS RANDOMIZED:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  If you're deterministic, adversary sets:                               │
│  - Reward 0 for your chosen arm                                         │
│  - Reward 1 for all other arms                                          │
│  → Linear regret!                                                       │
│                                                                          │
│  Randomization prevents adversary from targeting you                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Best Arm Identification

Instead of minimizing regret (cumulative performance), identify the best arm with high confidence.

Fixed-confidence setting: Given confidence $\delta$ , find $\hat{a}$ such that:

$P(\hat{a} \neq a^*) \leq \delta$

using as few samples as possible.

Fixed-budget setting: Given budget $T$ , find $\hat{a}$ minimizing:

$P(\hat{a} \neq a^*)$

Successive Elimination:

Start with all arms active: $\mathcal{A} = \{1, ..., K\}$
Pull each active arm once per round
Eliminate arms whose UCB is below the LCB of another arm
Stop when one arm remains

Sample complexity (fixed-confidence):

$\tau^* = O\left(\sum_{a=1}^{K} \frac{1}{\Delta_a^2} \ln\frac{K}{\delta}\right)$

Lower bound (any $\delta$ -correct algorithm):

$\mathbb{E}[\tau] \geq \sum_{a: \Delta_a > 0} \frac{2}{\Delta_a^2} \ln\frac{1}{2.4\delta}$

Combinatorial Bandits

Select a set of arms (or a combinatorial structure) rather than a single arm.

Examples:

Shortest path: Arms are paths, reward is path cost
Matching: Arms are matchings, reward is matching value
Assortment: Select subset of products, reward is revenue

Semi-bandit feedback: Observe rewards of all selected arms (not just total).

CUCB (Combinatorial UCB):

For super arm $S \subseteq \{1, ..., K\}$ :

$\text{UCB}(S) = \sum_{a \in S} \text{UCB}_a(t)$

Select $S_t = \arg\max_S \text{UCB}(S)$ subject to feasibility constraints.

Regret: $\tilde{O}\left(\sqrt{mKT}\right)$ where $m$ = size of selected set.

Batched and Delayed Feedback

Batched bandits: Make $B$ decisions before observing any feedback.

Requires more exploration (can't adapt within batch)
Regret: $O(\sqrt{KT \cdot B})$ with $B$ batches

Delayed feedback: Reward for action at time $t$ observed at time $t + d_t$ .

Must account for "in-flight" decisions
Regret increases with delay: $O(\sqrt{KT} + d_{\max} K)$

Non-Stationary Bandits

Arm distributions change over time.

Abruptly changing: Distributions are piecewise stationary with $S$ change points.

Discounted UCB: Weight recent observations more heavily:

$\hat{\mu}_a^{(\gamma)}(t) = \frac{\sum_{s < t} \gamma^{t-s} R_s \mathbb{I}[A_s = a]}{\sum_{s < t} \gamma^{t-s} \mathbb{I}[A_s = a]}$

Sliding window UCB: Only use observations from last $\tau$ rounds.

Regret: $\tilde{O}\left(S^{1/3} K^{1/3} T^{2/3}\right)$ is achievable.

Bandit Variants

Beyond standard stochastic bandits, many important variants capture real-world constraints and feedback structures.

Restless Bandits

In restless bandits, arm states evolve even when not pulled—a critical feature for many applications.

Setting: Each arm $a$ has a state $S_a(t)$ that evolves according to a Markov chain:

If arm $a$ is pulled: $S_a(t+1) \sim P_1(\cdot | S_a(t))$
If arm $a$ is not pulled: $S_a(t+1) \sim P_0(\cdot | S_a(t))$

Reward depends on state: $R_t = r(S_{A_t}(t))$

Examples:

Ad fatigue: Ad effectiveness decreases when shown (pulled) but recovers when not shown
Machine maintenance: Machines degrade when used, but also when idle
Communication channels: Channel quality fluctuates independently of whether you use it

Whittle index policy: Compute an "index" for each arm-state pair:

$W_a(s) = \inf\{\lambda : \text{passive is optimal in state } s \text{ under subsidy } \lambda\}$

Pull arms with highest Whittle indices. Asymptotically optimal when arms are "indexable."

Challenge: Computing Whittle indices requires solving per-arm MDPs. Full problem is PSPACE-hard.

Dueling Bandits

In dueling bandits, feedback is pairwise comparisons rather than absolute rewards.

Setting: At each round:

Select two arms $A_t, B_t$
Observe comparison outcome: $A_t \succ B_t$ or $B_t \succ A_t$

Preference model (Bradley-Terry-Luce):

$P(a \succ b) = \frac{\mu_a}{\mu_a + \mu_b}$

or more generally, a preference matrix $P_{ab} = P(a \succ b)$ .

Goal: Find the Condorcet winner—the arm that beats all others:

$a^* : P_{a^* b} > 0.5 \quad \forall b \neq a^*$

Regret: Measured in comparisons not involving the Condorcet winner.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DUELING BANDITS                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STANDARD BANDIT:          DUELING BANDIT:                              │
│  ────────────────          ───────────────                              │
│                                                                          │
│  Pull arm A                Compare A vs B                               │
│  Observe R_A = 0.7         Observe: A wins (or B wins)                  │
│                                                                          │
│  Information: Absolute     Information: Relative                        │
│  score of A                preference only                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  APPLICATIONS:                                                           │
│  ─────────────                                                           │
│  • Ranking from user preferences ("Which result is better?")            │
│  • A/B testing with preference feedback                                 │
│  • Tournament design                                                    │
│  • Information retrieval evaluation                                     │
│  • RLHF for LLMs (comparing model outputs)                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Algorithms:

Interleaved Filter (IF): Eliminate arms beaten by current champion
Beat the Mean (BTM): Compare against average competitor
RUCB: UCB-style approach for dueling
Double Thompson Sampling: Sample preferences, duel top two

Regret bound: $O(K \ln T / \Delta^2)$ where $\Delta$ is minimum preference gap.

Sleeping Bandits

In sleeping bandits, not all arms are available at every round.

Setting: At each round $t$ :

Observe available arm set $\mathcal{A}_t \subseteq \{1, ..., K\}$
Select $A_t \in \mathcal{A}_t$
Receive reward $R_t$

Examples:

Product recommendations: Not all products in stock
Ad serving: Some ads have budget constraints
Resource allocation: Some resources temporarily unavailable

Regret (against best policy):

$\text{Regret}(T) = \sum_{t=1}^{T} \left(\max_{a \in \mathcal{A}_t} \mu_a - \mu_{A_t}\right)$

Algorithms: Adapt UCB/TS to only consider available arms. Regret bounds depend on availability patterns.

Mortal Bandits

In mortal bandits, arms can permanently disappear.

Setting: Each arm $a$ has a lifetime $L_a$ (possibly random). After $L_a$ pulls, arm $a$ is no longer available.

Examples:

Job candidates: Candidates accept other offers
Limited inventory: Products sell out
Time-sensitive opportunities: Deals expire

Challenge: Must balance learning about an arm vs. exploiting it before it dies.

Key insight: With short lifetimes, pure exploitation can be optimal (no time to learn). With long lifetimes, standard bandit algorithms apply.

Constrained and Safe Bandits

Many real applications require satisfying constraints beyond maximizing reward.

Bandits with Knapsack Constraints

Setting: Each arm pull consumes resources. Stop when budget exhausted.

Formulation:

Pulling arm $a$ gives reward $R_t$ and consumes resources $C_t \in \mathbb{R}^d$
Budget constraint: $\sum_{t=1}^{\tau} C_{t,i} \leq B_i$ for each resource $i$
Stopping time $\tau$ is when any budget exhausted

Goal: Maximize total reward subject to budget constraints.

LP relaxation:

$\max_{\pi} \sum_a \pi_a \mu_a \quad \text{s.t.} \quad \sum_a \pi_a c_{a,i} \leq B_i / T, \quad \sum_a \pi_a = 1$

Algorithms:

Primal-dual methods: Maintain dual variables for constraints
UCB-BwK: UCB with budget-aware exploration
Thompson Sampling with budget: Sample, but respect constraints

Regret: $\tilde{O}(\sqrt{T})$ relative to best fixed policy satisfying constraints.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    BANDITS WITH KNAPSACKS                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE: Ad Campaign with Budget                                        │
│  ────────────────────────────────                                        │
│                                                                          │
│  Arms = {Premium placement, Standard placement, Text ad}                │
│                                                                          │
│  Rewards (CTR):    Costs:                                               │
│  Premium: 0.08     Premium: $2.00                                       │
│  Standard: 0.05    Standard: $0.50                                      │
│  Text: 0.02        Text: $0.10                                          │
│                                                                          │
│  Budget: $1000                                                          │
│                                                                          │
│  NAIVE APPROACH:                                                         │
│  ───────────────                                                         │
│  Always pick Premium (highest CTR)                                      │
│  → 500 impressions, 40 clicks                                           │
│                                                                          │
│  OPTIMAL MIX:                                                            │
│  ────────────                                                            │
│  Mix of Standard and Text                                               │
│  → 2500+ impressions, 100+ clicks                                       │
│                                                                          │
│  Bandit must LEARN this while respecting budget!                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Safe Bandits

Safe bandits require that performance never drops below a safety threshold.

Setting: There exists a baseline arm $a_0$ with known performance $\mu_0$ . Constraint:

$\mu_{A_t} \geq \mu_0 - \epsilon \quad \text{with high probability}$

Applications:

Medical treatment: New treatments must be at least as good as standard of care
Autonomous systems: Exploration must not cause accidents
Production systems: A/B tests shouldn't tank conversion rates

Safe UCB: Only pull arms whose lower confidence bound exceeds threshold:

$\text{Pull } a \text{ only if } \hat{\mu}_a(t) - \sqrt{\frac{2 \ln t}{N_a(t)}} \geq \mu_0 - \epsilon$

Challenge: Safe exploration is harder—must be confident an arm is safe before trying it.

Conservative Bandits

A variant of safe bandits where cumulative performance must stay above baseline.

Constraint:

$\sum_{s=1}^{t} R_s \geq (1 - \alpha) \cdot t \cdot \mu_0 \quad \forall t$

Must never fall too far behind what baseline would have achieved.

Algorithm: Explore only when "safety budget" accumulated from exploitation allows it.

Fair Bandits

Fair bandits ensure equitable treatment across groups or arms.

Meritocratic fairness: Pull probability proportional to true quality:

$P(A_t = a) \propto \mu_a$

Group fairness: If arms belong to groups, ensure groups receive fair share of pulls:

$\frac{N_{\text{group}_1}(T)}{N_{\text{group}_2}(T)} \approx \frac{|\text{group}_1|}{|\text{group}_2|}$

Individual fairness: Similar arms should be pulled with similar frequency.

Tradeoff: Fairness constraints typically increase regret. The fair-bandit literature studies Pareto frontiers between fairness and efficiency.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FAIRNESS IN BANDITS                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE: Job Candidate Screening                                        │
│  ────────────────────────────────                                        │
│                                                                          │
│  Arms = candidates, Reward = hiring success                             │
│                                                                          │
│  UNCONSTRAINED OPTIMAL:                                                  │
│  ──────────────────────                                                  │
│  Always interview highest-predicted candidates                          │
│  May systematically exclude certain groups                              │
│                                                                          │
│  FAIR BANDIT:                                                            │
│  ────────────                                                            │
│  Ensure demographic groups receive fair interview rates                 │
│  while still learning who are the best candidates                       │
│                                                                          │
│  APPLICATIONS:                                                           │
│  ─────────────                                                           │
│  • Loan approvals across demographics                                   │
│  • Content recommendation diversity                                     │
│  • Resource allocation across regions                                   │
│  • Ad serving to different user groups                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: Applications

A/B Testing and Adaptive Experiments

Traditional A/B testing:

Split traffic 50/50 between A and B
Run for fixed duration
Analyze results, declare winner

Problems:

Waste: Shows inferior variant to 50% of users
Fixed duration: May stop too early or too late
No adaptation: Can't respond to early signals

Bandit-based testing:

Replace fixed allocation with bandit algorithm:

Thompson Sampling allocates more traffic to better variants
Early stopping when confident
Reduced regret (fewer users see bad variants)

Multi-armed bandit approach:

$\text{Regret}_{\text{A/B}} = T/2 \cdot \Delta \quad \text{(fixed allocation)}$ $\text{Regret}_{\text{Bandit}} = O(\ln T / \Delta) \quad \text{(Thompson Sampling)}$

Caution: Bandit testing complicates statistical analysis (non-uniform allocation, stopping rules). Hybrid approaches maintain valid inference.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    A/B TESTING vs BANDIT TESTING                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL A/B TEST:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Traffic allocation over time:                                          │
│                                                                          │
│  100% │ ████████████████████████████████████████                        │
│   50% │ ████████████████████  Variant A (50%)                           │
│       │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  Variant B (50%)                           │
│    0% │──────────────────────────────────────── Time                    │
│       Start                                  End                         │
│                                                                          │
│  If B is worse: 50% of users get bad experience throughout              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BANDIT-BASED TEST (Thompson Sampling):                                  │
│  ──────────────────────────────────────                                  │
│                                                                          │
│  Traffic allocation over time:                                          │
│                                                                          │
│  100% │ ████████████████████████████████████████                        │
│       │ ███████████████████████████████████████  A dominates            │
│   50% │ ████████████████████                                            │
│       │ ▓▓▓▓▓▓▓▓▓▓                              B fades out             │
│    0% │──────────────────────────────────────── Time                    │
│       Start      ↑                           End                         │
│                  B found inferior                                        │
│                                                                          │
│  Adapts to evidence: worse variants get less traffic over time          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  REGRET COMPARISON (Δ = 0.05 difference, T = 10,000 users):             │
│                                                                          │
│  A/B: Regret = 0.5 × 0.05 × 10,000 = 250                               │
│  Bandit: Regret ≈ 20-50 (logarithmic in T)                             │
│                                                                          │
│  5-10x improvement!                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Recommendation Systems

Cold-start problem: New items/users have no interaction data.

Bandits provide principled exploration:

New items get exploration bonus (uncertainty)
System learns preferences while maintaining engagement

Explore-exploit in recommendations:

$\text{Score}(u, i) = \text{Predicted\_Rating}(u, i) + \alpha \cdot \text{Uncertainty}(u, i)$

Cascade bandits for ranked lists:

User examines items from top until satisfied. Model this as bandit over rankings.

Conversational recommendations:

Each question is an arm. Bandit selects questions to efficiently narrow preferences.

Clinical Trials

Adaptive clinical trials: Allocate more patients to more promising treatments.

Ethical motivation: Minimize patients receiving inferior treatment.

Response-adaptive randomization:

$P(\text{assign to treatment } a) \propto \text{Success\_Rate}_a + \text{Exploration\_Bonus}_a$

Challenges:

Regulatory requirements for valid inference
Delayed outcomes (treatment effects take time)
Non-stationarity (patient population changes)

Gittins index (optimal for discounted infinite-horizon):

Compute index $\nu_a(s)$ for each arm state $s$ . Pull arm with highest index.

$\nu_a(s) = \sup_{\tau > 0} \frac{\mathbb{E}\left[\sum_{t=0}^{\tau-1} \gamma^t R_t | S_0 = s\right]}{\mathbb{E}\left[\sum_{t=0}^{\tau-1} \gamma^t | S_0 = s\right]}$

Online Advertising

Ad creative selection (covered in our advertising post):

Arms = ad creative variants
Context = user features
Reward = click (or conversion)

Bid optimization:

Arms = bid levels
Reward = profit = value - cost
Challenge: cost depends on auction outcome

Budget-constrained bandits:

Standard bandits assume unlimited pulls. With budget constraint $B$ :

$\sum_{t=1}^{T} c_t \leq B$

Knapsack bandits: Select arms subject to budget, maximize value.

Hyperparameter Optimization

Successive Halving:

Start with $n$ random configurations
Train each for some iterations
Keep top half, double training iterations
Repeat until one remains

Hyperband: Run Successive Halving with different initial budgets.

Bayesian Optimization (related to bandits):

Model hyperparameter → performance as Gaussian Process. Use UCB-like acquisition:

$\text{EI}(x) = \mathbb{E}[\max(0, f(x) - f^*)]$

or UCB:

$\text{UCB}(x) = \mu(x) + \beta \sigma(x)$

Resource Allocation

Dynamic pricing: Set prices to maximize revenue while learning demand.

Arms = price levels
Reward = revenue = price × demand(price)
Challenge: demand function unknown

Network routing: Route traffic to minimize congestion while learning link costs.

Crowdsourcing: Assign tasks to workers while learning worker quality.

Offline Policy Evaluation

Before deploying a new bandit policy, we often want to evaluate it using historical data collected by a different policy. This is offline policy evaluation (OPE).

Setting:

Historical data: $\{(x_t, A_t, R_t)\}_{t=1}^{n}$ collected by logging policy $\pi_0$
Target policy: $\pi$ (the policy we want to evaluate)
Goal: Estimate $V(\pi) = \mathbb{E}_{A \sim \pi(x)}[R(x, A)]$

Challenge: We only observe rewards for actions taken by $\pi_0$ , not $\pi$ .

Inverse Propensity Scoring (IPS)

Key idea: Reweight samples to correct for distribution mismatch.

IPS estimator:

$\hat{V}_{\text{IPS}}(\pi) = \frac{1}{n} \sum_{t=1}^{n} \frac{\pi(A_t | x_t)}{\pi_0(A_t | x_t)} R_t$

Intuition: If $\pi$ would have taken action $A_t$ twice as often as $\pi_0$ , count that sample twice.

Properties:

Unbiased: $\mathbb{E}[\hat{V}_{\text{IPS}}] = V(\pi)$
High variance: When $\pi$ and $\pi_0$ differ significantly, importance weights can be huge

Variance reduction: Clip importance weights or use self-normalized estimator:

$\hat{V}_{\text{SNIPS}}(\pi) = \frac{\sum_{t=1}^{n} \frac{\pi(A_t | x_t)}{\pi_0(A_t | x_t)} R_t}{\sum_{t=1}^{n} \frac{\pi(A_t | x_t)}{\pi_0(A_t | x_t)}}$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    INVERSE PROPENSITY SCORING                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  Logging policy π₀: Shows Ad A 80%, Ad B 20%                            │
│  Target policy π:   Shows Ad A 30%, Ad B 70%                            │
│                                                                          │
│  Historical data (1000 samples):                                        │
│  - Ad A shown 800 times, avg reward 0.05                                │
│  - Ad B shown 200 times, avg reward 0.08                                │
│                                                                          │
│  NAIVE ESTIMATE of π:                                                    │
│  ────────────────────                                                    │
│  0.3 × 0.05 + 0.7 × 0.08 = 0.071                                        │
│  (Wrong! Uses biased sample means)                                      │
│                                                                          │
│  IPS ESTIMATE:                                                           │
│  ─────────────                                                           │
│  For Ad A samples: weight = 0.3/0.8 = 0.375                             │
│  For Ad B samples: weight = 0.7/0.2 = 3.5                               │
│                                                                          │
│  Weighted average accounts for sampling bias                            │
│                                                                          │
│  PROBLEM: Ad B weight is 3.5 → high variance!                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Direct Method (DM)

Key idea: Build a reward model, use it to predict counterfactual outcomes.

Direct Method estimator:

$\hat{V}_{\text{DM}}(\pi) = \frac{1}{n} \sum_{t=1}^{n} \sum_a \pi(a | x_t) \hat{r}(x_t, a)$

where $\hat{r}(x, a)$ is a learned reward model.

Properties:

Low variance: No importance weighting
Biased: If reward model is wrong, estimate is wrong

Tradeoff: DM has low variance but potential bias; IPS has no bias but high variance.

Doubly Robust (DR) Estimator

Key idea: Combine IPS and DM to get the best of both worlds.

DR estimator:

$\hat{V}_{\text{DR}}(\pi) = \frac{1}{n} \sum_{t=1}^{n} \left[\sum_a \pi(a | x_t) \hat{r}(x_t, a) + \frac{\pi(A_t | x_t)}{\pi_0(A_t | x_t)} (R_t - \hat{r}(x_t, A_t))\right]$

Decomposition:

First term: Direct method estimate
Second term: IPS correction for reward model errors

Properties:

Doubly robust: Unbiased if either $\hat{r}$ is correct or $\pi_0$ is known
Lower variance than IPS: When $\hat{r}$ is good, corrections are small
Asymptotically efficient: Achieves optimal variance under regularity conditions

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    OFFLINE POLICY EVALUATION COMPARISON                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  │ Method          │ Bias      │ Variance  │ Requirement            │  │
│  ├─────────────────┼───────────┼───────────┼────────────────────────┤  │
│  │ Direct Method   │ Possible  │ Low       │ Good reward model      │  │
│  │ IPS             │ None      │ High      │ Known π₀               │  │
│  │ Doubly Robust   │ Reduced   │ Medium    │ Either works           │  │
│                                                                          │
│  PRACTICAL GUIDANCE:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  1. If π and π₀ are similar → IPS works well                           │
│  2. If you have a good reward model → DM may suffice                   │
│  3. Generally → Use DR for robustness                                  │
│  4. Always → Log propensities π₀(a|x) for future evaluation!           │
│                                                                          │
│  CRITICAL: Design logging policies with evaluation in mind              │
│  - Ensure π₀(a|x) > 0 for all actions π might take                     │
│  - Some exploration in logging enables better evaluation                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Offline Policy Learning

Beyond evaluation, we can learn policies from logged data.

Counterfactual Risk Minimization (CRM):

$\hat{\pi} = \arg\min_{\pi} \hat{V}_{\text{IPS}}(\pi) + \lambda \cdot \text{complexity}(\pi)$

Challenges:

Optimizing over policies is harder than evaluating one
Propensity clipping needed for stability
Sample complexity depends on logging policy overlap

Practical approach: Use DR estimator with variance regularization:

$\hat{\pi} = \arg\min_{\pi} -\hat{V}_{\text{DR}}(\pi) + \lambda \cdot \text{Var}(\hat{V}_{\text{DR}}(\pi))$

Part VI: Theoretical Foundations

Concentration Inequalities

Hoeffding's inequality (bounded random variables):

For $X_1, ..., X_n$ independent with $X_i \in [a_i, b_i]$ and $\bar{X} = \frac{1}{n}\sum_i X_i$ :

$P\left(\bar{X} - \mathbb{E}[\bar{X}] \geq \epsilon\right) \leq \exp\left(-\frac{2n^2\epsilon^2}{\sum_i (b_i - a_i)^2}\right)$

For $X_i \in [0, 1]$ :

$P\left(\bar{X} - \mu \geq \epsilon\right) \leq \exp(-2n\epsilon^2)$

Chernoff bound (Bernoulli random variables):

For $X_1, ..., X_n$ i.i.d. Bernoulli( $p$ ):

$P\left(\bar{X} \geq p + \epsilon\right) \leq \exp(-n \cdot \text{KL}(p + \epsilon, p))$

Sub-Gaussian concentration:

If $X$ is $\sigma$ -sub-Gaussian (includes bounded r.v.s, Gaussian):

$P(X - \mathbb{E}[X] \geq t) \leq \exp\left(-\frac{t^2}{2\sigma^2}\right)$

For sample mean of $n$ i.i.d. sub-Gaussian:

$P\left(\bar{X} - \mu \geq t\right) \leq \exp\left(-\frac{nt^2}{2\sigma^2}\right)$

Martingale Methods

Azuma-Hoeffding inequality:

For martingale $(M_t)$ with $|M_t - M_{t-1}| \leq c_t$ :

$P(M_T - M_0 \geq \epsilon) \leq \exp\left(-\frac{\epsilon^2}{2\sum_t c_t^2}\right)$

Application to bandits: Cumulative reward is a martingale (conditioned on arm selections).

Information-Theoretic Lower Bounds

Change-of-measure argument:

For two bandit instances $\nu$ and $\nu'$ differing only in arm $a$ :

$\sum_{t=1}^{T} \mathbb{E}_\nu[\mathbb{I}[A_t = a]] \cdot \text{KL}(\nu_a, \nu'_a) \geq \text{kl}(\mathbb{E}_\nu[S], \mathbb{E}_{\nu'}[S])$

where $\text{kl}(p, q) = p \ln(p/q) + (1-p)\ln((1-p)/(1-q))$ is binary KL divergence.

Lai-Robbins bound derivation:

By choosing $\nu'$ to make arm $a$ optimal and using the above, we get:

$\mathbb{E}[N_a(T)] \geq \frac{\ln T}{\text{KL}(\nu_a, \nu^*)} (1 - o(1))$

Multiplying by $\Delta_a$ and summing gives the regret lower bound.

Finite-Time Analysis

High-probability regret bounds:

With probability $\geq 1 - \delta$ :

$\text{Regret}(T) \leq O\left(\sqrt{KT \ln(KT/\delta)}\right)$

Anytime algorithms: Regret bounds hold for all $T$ simultaneously (not just pre-specified $T$ ).

Instance-dependent vs. worst-case bounds:

Instance-dependent: $O\left(\sum_a \frac{\ln T}{\Delta_a}\right)$ — better when gaps are large
Worst-case: $O(\sqrt{KT})$ — better when gaps are small

Algorithms like MOSS and UCB-V achieve both simultaneously.

Part VII: Connections to Other Fields

Reinforcement Learning

Bandits are one-step RL (single state, single action, immediate reward).

Relationship:

Aspect	Bandits	Full RL
States	1	Many
Actions	K arms	Action space per state
Horizon	T steps	T or infinite
Transitions	None	$P(s'
Reward	$R(a)$	$R(s,a)$

Bandits in RL:

Exploration in Q-learning uses bandit ideas
UCB for tree search (UCT in Monte Carlo Tree Search)
Thompson Sampling for posterior-based exploration

Bayesian Optimization

Global optimization of expensive black-box functions:

$x^* = \arg\max_{x \in \mathcal{X}} f(x)$

where $f$ is expensive to evaluate.

Connection to bandits:

Arms = points in $\mathcal{X}$
Infinite arms (continuous space)
Gaussian Process models $f$
Acquisition functions ≈ UCB

Acquisition functions:

UCB: $\alpha(x) = \mu(x) + \beta \sigma(x)$
Expected Improvement: $\alpha(x) = \mathbb{E}[\max(0, f(x) - f^*)]$
Thompson Sampling: Sample $\tilde{f} \sim GP$ , optimize $\tilde{f}$

Online Learning

Prediction with expert advice:

$N$ experts give predictions each round
Learner combines predictions
Regret against best expert

Hedge algorithm (full information, like EXP3 but observe all expert losses):

$w_{t+1,i} = w_{t,i} \exp(-\eta \ell_{t,i})$

Regret: $O(\sqrt{T \ln N})$

Online convex optimization:

Actions = points in convex set
Loss = convex function (revealed after action)
Gradient descent achieves $O(\sqrt{T})$ regret

Bandits = online learning with bandit feedback (only see own loss).

Game Theory

Regret minimization ↔ Nash equilibrium:

In repeated games, if all players use no-regret algorithms, play converges to Nash equilibrium.

Fictitious play: Best response to empirical opponent distribution (similar to greedy).

Multi-agent bandits: Multiple agents compete for same arms—game-theoretic considerations arise.

Summary and Key Takeaways

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTI-ARMED BANDIT SUMMARY                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM:                                                                │
│  ────────                                                                │
│  Balance exploration (learning) with exploitation (earning)             │
│                                                                          │
│  FUNDAMENTAL LIMITS:                                                     │
│  ───────────────────                                                     │
│  • Lai-Robbins: Regret ≥ Ω(∑ ln T / KL(νₐ, ν*))                        │
│  • Minimax: Regret ≥ Ω(√(KT))                                          │
│  • Bayesian vs Frequentist regret: different guarantees                │
│                                                                          │
│  KEY ALGORITHMS:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  │ Algorithm         │ Regret         │ Key Idea              │         │
│  ├───────────────────┼────────────────┼───────────────────────┤         │
│  │ ε-greedy (fixed)  │ O(εT)          │ Random exploration    │         │
│  │ ε-greedy (decay)  │ O(K ln T / Δ)  │ Decreasing ε          │         │
│  │ UCB1              │ O(K ln T / Δ)  │ Optimism under uncert │         │
│  │ KL-UCB            │ Optimal        │ KL-based confidence   │         │
│  │ Thompson Sampling │ O(K ln T / Δ)  │ Posterior sampling    │         │
│  │ IDS               │ O(√(Ψ·H·T))   │ Info-regret tradeoff  │         │
│  │ EXP3 (adversarial)│ O(√(KT ln K)) │ Exponential weights   │         │
│                                                                          │
│  CONTEXTUAL BANDITS:                                                     │
│  ───────────────────                                                     │
│  • LinUCB: Linear rewards, O(d√(KT)) regret                            │
│  • Neural bandits: Complex contexts, use NNs for representation         │
│                                                                          │
│  BANDIT VARIANTS:                                                        │
│  ────────────────                                                        │
│  • Restless: Arms evolve when not pulled                                │
│  • Dueling: Pairwise comparison feedback                                │
│  • Sleeping/Mortal: Arms not always available                           │
│  • Constrained: Knapsacks, safety, fairness                             │
│                                                                          │
│  APPLICATIONS:                                                           │
│  ─────────────                                                           │
│  • A/B testing (adaptive experiments)                                   │
│  • Recommendations (explore-exploit)                                    │
│  • Clinical trials (ethical allocation)                                 │
│  • Ad selection (CTR optimization)                                      │
│  • Hyperparameter tuning (Hyperband)                                    │
│                                                                          │
│  OFFLINE EVALUATION:                                                     │
│  ───────────────────                                                     │
│  • IPS: Unbiased but high variance                                      │
│  • Direct Method: Low variance but biased                               │
│  • Doubly Robust: Best of both worlds                                   │
│                                                                          │
│  PRACTICAL ADVICE:                                                       │
│  ────────────────                                                        │
│  1. Start with Thompson Sampling (simple, effective)                    │
│  2. Use LinUCB for linear contextual problems                           │
│  3. Consider IDS for structured/many-armed problems                     │
│  4. Use DR estimator for offline evaluation                             │
│  5. Log propensities for future policy evaluation                       │
│  6. Consider constraints: safety, fairness, budgets                     │
│  7. Monitor for non-stationarity                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Sources

Foundational Papers:

Thompson Sampling:

Contextual Bandits:

Neural Bandits:

Applications:

Information-Directed Sampling:

Bandit Variants:

Constrained and Safe Bandits:

Offline Policy Evaluation:

Books and Surveys:

Table of Contents

The Exploration-Exploitation Dilemma

Part I: Foundations and Problem Formulation

The Stochastic Multi-Armed Bandit Problem

Regret: The Performance Metric

Bayesian vs. Frequentist Regret

Regret Lower Bounds

Problem Variants

Part II: Classical Algorithms

ε-Greedy

Upper Confidence Bound (UCB)

UCB1 Algorithm

UCB1 Regret Bound

UCB Variants

Thompson Sampling

Thompson Sampling Regret Bound

Comparison: UCB vs. Thompson Sampling

Information-Directed Sampling (IDS)

Part III: Contextual Bandits

Problem Formulation

Linear Contextual Bandits (LinUCB)

Linear Thompson Sampling

Generalized Linear Bandits

Neural Contextual Bandits

Part IV: Advanced Topics

Adversarial Bandits

Best Arm Identification

Combinatorial Bandits

Batched and Delayed Feedback

Non-Stationary Bandits

Bandit Variants

Restless Bandits

Dueling Bandits

Sleeping Bandits

Mortal Bandits

Constrained and Safe Bandits

Bandits with Knapsack Constraints

Safe Bandits

Conservative Bandits

Fair Bandits

Part V: Applications

A/B Testing and Adaptive Experiments

Recommendation Systems

Clinical Trials

Online Advertising

Hyperparameter Optimization

Resource Allocation

Offline Policy Evaluation

Inverse Propensity Scoring (IPS)

Direct Method (DM)

Doubly Robust (DR) Estimator

Offline Policy Learning

Part VI: Theoretical Foundations

Concentration Inequalities

Martingale Methods

Information-Theoretic Lower Bounds

Finite-Time Analysis

Part VII: Connections to Other Fields

Reinforcement Learning

Bayesian Optimization

Online Learning

Game Theory

Summary and Key Takeaways

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Machine Learning for Advertising: CTR Prediction, Ad Ranking, and Bidding Systems

Recommendation Systems: From Collaborative Filtering to Deep Learning

Transformers for Recommendation Systems: From SASRec to HSTU

Building Agentic AI Systems: A Complete Implementation Guide

Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning

RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond