Multi-Armed Bandits: Theory, Algorithms, and Applications
Comprehensive guide to multi-armed bandits from theory to practice. From regret bounds and UCB algorithms to Thompson Sampling, contextual bandits, and real-world applications in A/B testing, recommendations, and beyond.
Table of Contents
The Exploration-Exploitation Dilemma
Imagine you're at a casino with multiple slot machines (one-armed bandits). Each machine has an unknown payout probability. You want to maximize your total winnings over many plays. Should you keep playing the machine that's paid well so far (exploit), or try other machines that might be even better (explore)?
This is the exploration-exploitation dilemma—one of the most fundamental problems in decision-making under uncertainty. It appears everywhere:
- A/B testing: Keep showing the winning variant, or test new ones?
- Recommendations: Show items users will like, or discover new preferences?
- Clinical trials: Give patients the best-known treatment, or test alternatives?
- Ad selection: Display high-CTR ads, or explore new creatives?
- Hyperparameter tuning: Exploit promising configurations, or explore new regions?
Multi-armed bandits (MABs) provide a principled framework for balancing exploration and exploitation.
┌─────────────────────────────────────────────────────────────────────────┐
│ THE EXPLORATION-EXPLOITATION TRADEOFF │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PURE EXPLOITATION: │
│ ────────────────── │
│ Always play the arm with highest observed reward │
│ │
│ Problem: May get stuck on suboptimal arm │
│ │
│ Example: Arm A paid $1 on first pull, Arm B paid $0 │
│ → Always pull A, never discover B actually has 90% win rate │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PURE EXPLORATION: │
│ ───────────────── │
│ Pull each arm equally regardless of rewards │
│ │
│ Problem: Wastes pulls on clearly inferior arms │
│ │
│ Example: After 1000 pulls, Arm A averages $0.90, Arm B averages $0.10 │
│ → Still pulling B 50% of the time, losing expected value │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OPTIMAL STRATEGY: │
│ ──────────────── │
│ Explore enough to identify the best arm with high confidence │
│ Then exploit that arm while occasionally checking others │
│ │
│ The balance depends on: │
│ • Time horizon (more time → more exploration affordable) │
│ • Uncertainty (more uncertain → more exploration needed) │
│ • Arm differences (similar arms → more exploration to distinguish) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part I: Foundations and Problem Formulation
The Stochastic Multi-Armed Bandit Problem
Setting: You have arms (actions). At each time step :
- You select an arm
- You receive a reward drawn from the selected arm's distribution
Arm reward distributions: Each arm has an unknown reward distribution with mean :
Optimal arm: The arm with highest expected reward:
Goal: Maximize cumulative reward over rounds:
Regret: The Performance Metric
Instead of maximizing reward directly, we typically minimize regret—the difference between what we earned and what we could have earned by always playing the optimal arm.
Cumulative regret:
Interpretation: If the best arm has mean reward and we played it every round for rounds, we'd expect total reward. If our algorithm achieved expected reward , our regret is .
Gap-based regret decomposition:
Define the gap (or suboptimality) of arm :
Then regret decomposes as:
where is the number of times arm was pulled.
Insight: To minimize regret, we need to minimize pulls of suboptimal arms, especially those with large gaps.
┌─────────────────────────────────────────────────────────────────────────┐
│ REGRET DECOMPOSITION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Example: 3 arms, T = 1000 rounds │
│ │
│ Arm 1: μ₁ = 0.8 (optimal), Δ₁ = 0 │
│ Arm 2: μ₂ = 0.6, Δ₂ = 0.2 │
│ Arm 3: μ₃ = 0.3, Δ₃ = 0.5 │
│ │
│ Suppose our algorithm pulled: │
│ N₁(T) = 900, N₂(T) = 80, N₃(T) = 20 │
│ │
│ Regret = Δ₁·N₁ + Δ₂·N₂ + Δ₃·N₃ │
│ = 0·900 + 0.2·80 + 0.5·20 │
│ = 0 + 16 + 10 │
│ = 26 │
│ │
│ REGRET SOURCES: │
│ ─────────────── │
│ • Arm 2 contributed 16 regret (moderate gap, pulled often) │
│ • Arm 3 contributed 10 regret (large gap, pulled rarely) │
│ • Arm 1 contributed 0 regret (optimal arm) │
│ │
│ Good algorithms minimize E[Nₐ(T)] for suboptimal arms │
│ while pulling them enough to confirm they're suboptimal │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Bayesian vs. Frequentist Regret
There are two fundamentally different ways to define and analyze regret:
Frequentist (Worst-Case) Regret:
The regret we've defined so far is frequentist—it measures performance for a fixed (but unknown) problem instance:
The expectation is only over the algorithm's randomness. Lower bounds (like Lai-Robbins) hold for every problem instance.
Bayesian Regret:
In the Bayesian view, the problem instance itself is drawn from a prior :
The outer expectation is over problem instances. This measures average performance across problems.
┌─────────────────────────────────────────────────────────────────────────┐
│ FREQUENTIST vs BAYESIAN REGRET │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ FREQUENTIST: │
│ ──────────── │
│ "For ANY problem instance, my algorithm has low regret" │
│ │
│ • Stronger guarantee (worst-case) │
│ • UCB analysis is frequentist │
│ • Lai-Robbins is a frequentist lower bound │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ BAYESIAN: │
│ ───────── │
│ "On AVERAGE over problems from my prior, I have low regret" │
│ │
│ • Weaker guarantee (average-case) │
│ • Thompson Sampling analysis is often Bayesian │
│ • Can be much easier to prove │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RELATIONSHIP: │
│ ───────────── │
│ │
│ Low Bayesian regret does NOT imply low frequentist regret! │
│ │
│ Example: An algorithm could do well on "typical" problems │
│ but fail badly on adversarial problem instances. │
│ │
│ Thompson Sampling has both: │
│ • O(√(KT ln K)) Bayesian regret (easy to prove) │
│ • O(K ln T / Δ) frequentist regret (harder to prove) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Why it matters:
- Thompson Sampling: Original analysis by Agrawal & Goyal (2012) was Bayesian. Frequentist bounds came later and are more complex.
- Prior mismatch: If your prior doesn't match reality, Bayesian guarantees may not hold.
- Practical impact: For most applications, both perspectives give similar practical guidance.
Regret Lower Bounds
How well can any algorithm possibly do? The Lai-Robbins lower bound (1985) establishes a fundamental limit.
Theorem (Lai-Robbins): For any consistent policy and any bandit problem with arm distributions from a single-parameter exponential family:
where is the Kullback-Leibler divergence between arm 's distribution and the optimal arm's distribution.
For Bernoulli arms with means and :
Simplified form (for small gaps, using ):
Key insight: Regret must grow at least logarithmically with . We cannot do better than regret. Algorithms achieving this bound are called optimal.
Problem Variants
Bounded rewards: (or more generally)
Bernoulli bandits: ,
Gaussian bandits:
Sub-Gaussian rewards: is -sub-Gaussian:
Sub-Gaussian is a key assumption enabling concentration inequalities.
Part II: Classical Algorithms
ε-Greedy
The simplest exploration strategy: exploit most of the time, explore randomly with probability .
Algorithm:
At each time :
- With probability : pull a uniformly random arm
- With probability : pull
where is the empirical mean of arm .
Regret analysis:
For fixed :
This means we pull every arm linearly in , giving:
Linear regret! The algorithm never stops exploring suboptimal arms.
Decaying ε-greedy:
To achieve sublinear regret, decay over time:
where and is the minimum gap.
This achieves regret, but requires knowing the minimum gap in advance—usually impractical.
┌─────────────────────────────────────────────────────────────────────────┐
│ ε-GREEDY ANALYSIS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ FIXED ε = 0.1: │
│ ────────────── │
│ │
│ Explore probability: 10% of the time │
│ Each suboptimal arm pulled: ≥ 0.1 × T / K times │
│ │
│ For T = 10,000, K = 5 arms: │
│ Each arm pulled ≥ 200 times │
│ If Arm 5 has Δ₅ = 0.5: │
│ Regret from Arm 5 alone ≥ 0.5 × 200 = 100 │
│ │
│ Total regret grows linearly: Regret(T) = Θ(εT) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DECAYING ε_t = c/(t): │
│ ───────────────────── │
│ │
│ Early: ε large → lots of exploration │
│ Late: ε small → mostly exploitation │
│ │
│ Regret = O(K ln T) — much better! │
│ │
│ But requires tuning c based on unknown problem parameters │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Upper Confidence Bound (UCB)
UCB algorithms embody the principle of optimism in the face of uncertainty: when uncertain about an arm's value, assume the best plausible case.
Key idea: Construct confidence intervals for each arm's mean. Pull the arm with the highest upper confidence bound.
UCB1 Algorithm
Upper confidence bound for arm at time :
Algorithm: At each time , pull:
Intuition:
- : Exploitation term (favor arms with high observed rewards)
- : Exploration bonus (favor under-explored arms)
As increases, the bonus shrinks, and we rely more on the empirical mean.
Why this specific form?
The exploration bonus comes from Hoeffding's inequality. For bounded rewards in and i.i.d. samples:
Setting (small probability) and solving for :
So with high probability, the true mean is below .
┌─────────────────────────────────────────────────────────────────────────┐
│ UCB1 CONFIDENCE INTERVALS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Time t = 100 │
│ │
│ Arm 1: N₁ = 40, μ̂₁ = 0.70 │
│ UCB₁ = 0.70 + √(2 ln 100 / 40) = 0.70 + 0.48 = 1.18 │
│ │
│ Arm 2: N₂ = 50, μ̂₂ = 0.65 │
│ UCB₂ = 0.65 + √(2 ln 100 / 50) = 0.65 + 0.43 = 1.08 │
│ │
│ Arm 3: N₃ = 10, μ̂₃ = 0.40 │
│ UCB₃ = 0.40 + √(2 ln 100 / 10) = 0.40 + 0.96 = 1.36 ← HIGHEST! │
│ │
│ VISUALIZATION: │
│ ────────────── │
│ │
│ Arm 1: |═══════════════════[────] UCB = 1.18 │
│ ↑ │
│ μ̂ = 0.70 │
│ │
│ Arm 2: |════════════════[────] UCB = 1.08 │
│ ↑ │
│ μ̂ = 0.65 │
│ │
│ Arm 3: |═════════[────────────────] UCB = 1.36 ← SELECTED │
│ ↑ │
│ μ̂ = 0.40 │
│ │
│ Arm 3 selected despite lowest empirical mean because │
│ wide confidence interval (few samples) creates high UCB │
│ │
└─────────────────────────────────────────────────────────────────────────┘
UCB1 Regret Bound
Theorem: For UCB1 with arms and rewards in :
Asymptotic form:
where .
Gap-dependent vs. gap-independent bounds:
The bound above is gap-dependent (depends on ). A gap-independent bound:
This is useful when gaps are small or unknown.
Proof sketch (why UCB1 achieves logarithmic regret):
For a suboptimal arm to be pulled at time , its UCB must exceed the optimal arm's UCB:
With high probability, and .
For the inequality to hold when the optimal arm is well-estimated:
So arm is pulled at most times, giving regret contribution .
UCB Variants
UCB2: Uses epochs of increasing length to reduce computation.
UCB-V (Variance-aware UCB):
If rewards have variance , tighter bounds are possible:
where is the empirical variance.
KL-UCB: Uses KL-divergence for tighter bounds on Bernoulli rewards:
KL-UCB is asymptotically optimal (matches Lai-Robbins lower bound).
MOSS (Minimax Optimal Strategy in the Stochastic case):
Achieves the minimax optimal regret.
Thompson Sampling
Thompson Sampling (TS) is a Bayesian approach: maintain a posterior distribution over each arm's mean, sample from posteriors, and play the arm with highest sample.
Algorithm:
- Prior: Start with prior for each arm (e.g., for Bernoulli)
- At each time :
- Sample for each arm
- Pull
- Update: Update posterior with observed reward
For Bernoulli bandits with Beta prior:
Prior: , typically (uniform)
After observing successes and failures on arm :
Algorithm for Bernoulli Thompson Sampling:
- Initialize: for all arms
- For :
- Sample for each arm
- Pull
- Observe reward
- Update: ,
┌─────────────────────────────────────────────────────────────────────────┐
│ THOMPSON SAMPLING INTUITION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Arm 1: 8 successes, 2 failures → Beta(9, 3) │
│ Arm 2: 3 successes, 7 failures → Beta(4, 8) │
│ Arm 3: 1 success, 1 failure → Beta(2, 2) │
│ │
│ POSTERIOR DISTRIBUTIONS: │
│ ───────────────────────── │
│ │
│ P(μ) │
│ │ Arm 2 Arm 1 │
│ │ ╱╲ ╱╲ │
│ │ ╱ ╲ Arm 3 ╱ ╲ │
│ │ ╱ ╲ ╱╲ ╱ ╲ │
│ │ ╱ ╲ ╱ ╲ ╱ ╲ │
│ │ ╱ ╲ ╱ ╲ ╱ ╲ │
│ │ ╱ ╲ ╱ ╲ ╱ ╲ │
│ │─────────────────────────────────────────── μ │
│ 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1.0 │
│ │
│ SAMPLING PROCESS: │
│ ───────────────── │
│ │
│ Round 1: Sample μ̃₁ = 0.72, μ̃₂ = 0.28, μ̃₃ = 0.55 → Pull Arm 1 │
│ Round 2: Sample μ̃₁ = 0.68, μ̃₂ = 0.35, μ̃₃ = 0.71 → Pull Arm 3 │
│ Round 3: Sample μ̃₁ = 0.81, μ̃₂ = 0.22, μ̃₃ = 0.48 → Pull Arm 1 │
│ │
│ KEY INSIGHT: │
│ ─────────── │
│ • Arm 1: Narrow posterior (many samples) → samples cluster near 0.75 │
│ • Arm 3: Wide posterior (few samples) → samples vary widely │
│ • Wide posteriors mean occasional high samples → exploration! │
│ │
│ Thompson Sampling naturally balances exploration (wide posteriors) │
│ and exploitation (high posterior means) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
For Gaussian bandits with known variance :
Prior:
Posterior after observations with sample mean :
With uninformative prior ():
Thompson Sampling Regret Bound
Theorem (Agrawal & Goyal, 2012): For Thompson Sampling with Bernoulli rewards:
Thompson Sampling achieves the same asymptotic regret as UCB and matches the Lai-Robbins lower bound.
Advantages of Thompson Sampling:
- Simple and intuitive
- Naturally handles prior information
- Often outperforms UCB in practice (better constants)
- Extends easily to complex settings (contextual, combinatorial)
Disadvantages:
- Requires specifying a prior
- Computationally expensive for complex posteriors
- Harder to analyze theoretically
Comparison: UCB vs. Thompson Sampling
┌─────────────────────────────────────────────────────────────────────────┐
│ UCB vs THOMPSON SAMPLING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ │ Aspect │ UCB │ Thompson Sampling │ │
│ ├─────────────────────┼──────────────────────┼───────────────────┤ │
│ │ Philosophy │ Frequentist │ Bayesian │ │
│ │ Exploration │ Deterministic │ Randomized │ │
│ │ Computation │ Simple (max UCB) │ Sampling required │ │
│ │ Prior knowledge │ Hard to incorporate │ Natural via prior │ │
│ │ Regret bound │ O(K ln T / Δ) │ O(K ln T / Δ) │ │
│ │ Practical perf. │ Good │ Often better │ │
│ │ Analysis difficulty │ Easier │ Harder │ │
│ │ Delayed feedback │ Straightforward │ Straightforward │ │
│ │ Batched decisions │ Needs modification │ Natural │ │
│ │
│ EMPIRICAL COMPARISON (typical): │
│ ─────────────────────────────── │
│ │
│ Regret │
│ │ │
│ │ ε-greedy (linear) │
│ │ ╱ │
│ │ ╱ │
│ │ ╱ UCB │
│ │╱ ╱ │
│ │ ╱ Thompson Sampling │
│ │ ╱ ╱ │
│ │ ╱ ╱ │
│ │ ╱ ╱ │
│ │──╱─╱─────────────────────────────────── Time │
│ │
│ Thompson Sampling often has better constants (lower regret) │
│ despite same asymptotic rate as UCB │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Information-Directed Sampling (IDS)
While UCB and Thompson Sampling are the most common approaches, Information-Directed Sampling offers a principled alternative that directly optimizes the exploration-exploitation tradeoff.
Key idea: At each round, choose the action that achieves the best tradeoff between expected regret and information gained about the optimal action.
Information ratio:
where:
- = expected instantaneous regret
- = mutual information between optimal arm and observation
- = history up to time
IDS Algorithm:
At each time , select distribution over arms minimizing:
Then sample .
Intuition:
- Numerator: Squared expected regret (want low)
- Denominator: Information gain (want high)
- Minimizing the ratio balances both objectives
┌─────────────────────────────────────────────────────────────────────────┐
│ IDS vs UCB vs THOMPSON SAMPLING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ UCB: │
│ ──── │
│ "Be optimistic—assume uncertain arms are good" │
│ Exploration is implicit via confidence bounds │
│ │
│ THOMPSON SAMPLING: │
│ ───────────────── │
│ "Sample from beliefs, act on sample" │
│ Exploration via posterior uncertainty │
│ │
│ IDS: │
│ ──── │
│ "Explicitly optimize regret vs. information tradeoff" │
│ Directly measures value of information │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHEN IDS HELPS: │
│ ─────────────── │
│ │
│ • Many similar arms: IDS identifies which comparisons are informative │
│ • Structured problems: IDS exploits information structure │
│ • Sparse rewards: IDS values rare informative observations │
│ │
│ Example: 100 arms, 99 have mean 0.5, one has mean 0.6 │
│ - UCB/TS: Explore all arms roughly equally initially │
│ - IDS: Quickly identifies that most arms are similar, │
│ focuses on finding the unique good arm │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Regret bound:
For -armed bandits with bounded information ratio :
where is the entropy of the optimal arm under the prior.
Key insight: always, but for structured problems can be much smaller, leading to better regret than UCB/TS.
Practical considerations:
- Computing mutual information can be expensive
- Approximations needed for complex posteriors
- Most beneficial for structured or many-armed problems
Part III: Contextual Bandits
In many applications, we observe context (features) before making a decision. The optimal action depends on this context.
Problem Formulation
Setting: At each time :
- Observe context (e.g., user features)
- Select arm
- Receive reward where
Goal: Learn a policy that maximizes:
Regret (compared to best policy in hindsight):
┌─────────────────────────────────────────────────────────────────────────┐
│ CONTEXTUAL BANDIT SETTING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EXAMPLE: News Article Recommendation │
│ ──────────────────────────────────── │
│ │
│ Context x_t = [user_age, user_gender, time_of_day, device, ...] │
│ │
│ Arms = {Sports, Politics, Tech, Entertainment, ...} │
│ │
│ Reward = 1 if user clicks, 0 otherwise │
│ │
│ Goal: Learn which article category to show each user │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY CONTEXT MATTERS: │
│ ──────────────────── │
│ │
│ User A: young, male, evening, mobile │
│ → Sports might be best │
│ │
│ User B: older, female, morning, desktop │
│ → Politics might be best │
│ │
│ Without context: single "best" arm for everyone (suboptimal) │
│ With context: personalized arm selection (better) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMPARED TO STANDARD MAB: │
│ ───────────────────────── │
│ │
│ Standard MAB: A_t → R_t │
│ Contextual MAB: (x_t, A_t) → R_t │
│ │
│ Standard MAB learns: μ_a (scalar per arm) │
│ Contextual MAB learns: f_a(x) (function per arm) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Linear Contextual Bandits (LinUCB)
Assume the expected reward is linear in context:
where is an unknown parameter vector for arm .
LinUCB Algorithm:
For each arm , maintain:
- (design matrix)
- (reward-weighted contexts)
Ridge regression estimate:
Confidence bound:
For with probability :
where and:
UCB for arm given context :
Algorithm:
- Pull
- Observe
- Update and
Regret bound:
where hides logarithmic factors.
┌─────────────────────────────────────────────────────────────────────────┐
│ LinUCB GEOMETRIC INTUITION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Parameter space for arm a (2D example): │
│ │
│ θ₂ │
│ │ │
│ │ ╭─────────────╮ │
│ │ ╱ ╲ Confidence ellipsoid │
│ │ ╱ ●───────────────→ True θ* │
│ │ │ ↑ │ │
│ │ │ θ̂ (estimate) │ │
│ │ ╲ ╱ │
│ │ ╲ ╱ │
│ │ ╰────────────╯ │
│ │ │
│ └────────────────────────── θ₁ │
│ │
│ Ellipsoid shape: A_a^{-1} (inverse of design matrix) │
│ Directions with many observations → narrow (confident) │
│ Directions with few observations → wide (uncertain) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ UCB for context x: │
│ │
│ UCB_a(x) = x^T θ̂_a + α √(x^T A_a^{-1} x) │
│ ───────── ───────────────── │
│ Exploitation Exploration │
│ (predicted (uncertainty │
│ reward) in direction x) │
│ │
│ If x aligns with uncertain directions → large exploration bonus │
│ If x aligns with well-explored directions → small exploration bonus │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Linear Thompson Sampling
For linear bandits:
Prior:
Posterior (Gaussian conjugate update):
Algorithm:
- Sample for each arm
- Pull
- Update and
The parameter controls exploration (often set to match confidence width).
Generalized Linear Bandits
When rewards are non-linear but belong to an exponential family:
where is a known link function (e.g., sigmoid for logistic).
Examples:
- Logistic bandits: (click-through prediction)
- Poisson bandits: (count data)
GLM-UCB and GLM-Thompson Sampling extend the linear approaches using maximum likelihood estimation and appropriate confidence bounds.
Neural Contextual Bandits
For complex contexts (images, text), linear models are insufficient. Neural networks can learn rich representations.
NeuralUCB / NeuralTS:
- Train neural network to predict rewards
- Use neural tangent kernel (NTK) or ensemble methods for uncertainty
- Apply UCB or Thompson Sampling on top
Uncertainty estimation methods:
- Dropout at test time: Sample multiple predictions with dropout
- Ensemble: Train multiple networks, use variance
- Neural Tangent Kernel: Treat last layer as linear, apply LinUCB
- Bayesian neural networks: Maintain weight distributions
Regret bounds: Under certain assumptions (NTK regime), neural bandits achieve regret.
Part IV: Advanced Topics
Adversarial Bandits
In adversarial bandits, rewards are not stochastic but chosen by an adversary who may know your algorithm.
Setting: At each time :
- Adversary secretly assigns rewards to all arms
- You select arm
- You receive and observe (or only observe your reward in "bandit feedback")
Regret (vs. best fixed arm in hindsight):
EXP3 (Exponential-weight algorithm for Exploration and Exploitation):
Maintain weights for each arm. Let .
Probability of pulling arm :
where is the exploration parameter.
Weight update after observing reward for arm :
Intuition: is an unbiased estimator of (importance weighting). High-reward arms get higher weights.
EXP3 Regret Bound:
This is minimax optimal (matches lower bound up to constants).
┌─────────────────────────────────────────────────────────────────────────┐
│ STOCHASTIC vs ADVERSARIAL BANDITS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STOCHASTIC BANDITS: │
│ ─────────────────── │
│ Rewards: R_t ~ ν_a (fixed distribution per arm) │
│ Adversary: None (nature is i.i.d.) │
│ Best algorithms: UCB, Thompson Sampling │
│ Regret: O(K ln T / Δ) or O(√(KT ln K)) │
│ │
│ Exploits: Concentration (samples converge to mean) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ADVERSARIAL BANDITS: │
│ ──────────────────── │
│ Rewards: r_{t,a} chosen by adversary (possibly adaptive) │
│ Adversary: Knows your algorithm, can be malicious │
│ Best algorithms: EXP3, EXP3.P │
│ Regret: O(√(KT ln K)) │
│ │
│ Exploits: Randomization (adversary can't predict your choice) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY EXP3 IS RANDOMIZED: │
│ ─────────────────────── │
│ │
│ If you're deterministic, adversary sets: │
│ - Reward 0 for your chosen arm │
│ - Reward 1 for all other arms │
│ → Linear regret! │
│ │
│ Randomization prevents adversary from targeting you │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Best Arm Identification
Instead of minimizing regret (cumulative performance), identify the best arm with high confidence.
Fixed-confidence setting: Given confidence , find such that:
using as few samples as possible.
Fixed-budget setting: Given budget , find minimizing:
Successive Elimination:
- Start with all arms active:
- Pull each active arm once per round
- Eliminate arms whose UCB is below the LCB of another arm
- Stop when one arm remains
Sample complexity (fixed-confidence):
Lower bound (any -correct algorithm):
Combinatorial Bandits
Select a set of arms (or a combinatorial structure) rather than a single arm.
Examples:
- Shortest path: Arms are paths, reward is path cost
- Matching: Arms are matchings, reward is matching value
- Assortment: Select subset of products, reward is revenue
Semi-bandit feedback: Observe rewards of all selected arms (not just total).
CUCB (Combinatorial UCB):
For super arm :
Select subject to feasibility constraints.
Regret: where = size of selected set.
Batched and Delayed Feedback
Batched bandits: Make decisions before observing any feedback.
- Requires more exploration (can't adapt within batch)
- Regret: with batches
Delayed feedback: Reward for action at time observed at time .
- Must account for "in-flight" decisions
- Regret increases with delay:
Non-Stationary Bandits
Arm distributions change over time.
Abruptly changing: Distributions are piecewise stationary with change points.
Discounted UCB: Weight recent observations more heavily:
Sliding window UCB: Only use observations from last rounds.
Regret: is achievable.
Bandit Variants
Beyond standard stochastic bandits, many important variants capture real-world constraints and feedback structures.
Restless Bandits
In restless bandits, arm states evolve even when not pulled—a critical feature for many applications.
Setting: Each arm has a state that evolves according to a Markov chain:
- If arm is pulled:
- If arm is not pulled:
Reward depends on state:
Examples:
- Ad fatigue: Ad effectiveness decreases when shown (pulled) but recovers when not shown
- Machine maintenance: Machines degrade when used, but also when idle
- Communication channels: Channel quality fluctuates independently of whether you use it
Whittle index policy: Compute an "index" for each arm-state pair:
Pull arms with highest Whittle indices. Asymptotically optimal when arms are "indexable."
Challenge: Computing Whittle indices requires solving per-arm MDPs. Full problem is PSPACE-hard.
Dueling Bandits
In dueling bandits, feedback is pairwise comparisons rather than absolute rewards.
Setting: At each round:
- Select two arms
- Observe comparison outcome: or
Preference model (Bradley-Terry-Luce):
or more generally, a preference matrix .
Goal: Find the Condorcet winner—the arm that beats all others:
Regret: Measured in comparisons not involving the Condorcet winner.
┌─────────────────────────────────────────────────────────────────────────┐
│ DUELING BANDITS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STANDARD BANDIT: DUELING BANDIT: │
│ ──────────────── ─────────────── │
│ │
│ Pull arm A Compare A vs B │
│ Observe R_A = 0.7 Observe: A wins (or B wins) │
│ │
│ Information: Absolute Information: Relative │
│ score of A preference only │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ APPLICATIONS: │
│ ───────────── │
│ • Ranking from user preferences ("Which result is better?") │
│ • A/B testing with preference feedback │
│ • Tournament design │
│ • Information retrieval evaluation │
│ • RLHF for LLMs (comparing model outputs) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Algorithms:
- Interleaved Filter (IF): Eliminate arms beaten by current champion
- Beat the Mean (BTM): Compare against average competitor
- RUCB: UCB-style approach for dueling
- Double Thompson Sampling: Sample preferences, duel top two
Regret bound: where is minimum preference gap.
Sleeping Bandits
In sleeping bandits, not all arms are available at every round.
Setting: At each round :
- Observe available arm set
- Select
- Receive reward
Examples:
- Product recommendations: Not all products in stock
- Ad serving: Some ads have budget constraints
- Resource allocation: Some resources temporarily unavailable
Regret (against best policy):
Algorithms: Adapt UCB/TS to only consider available arms. Regret bounds depend on availability patterns.
Mortal Bandits
In mortal bandits, arms can permanently disappear.
Setting: Each arm has a lifetime (possibly random). After pulls, arm is no longer available.
Examples:
- Job candidates: Candidates accept other offers
- Limited inventory: Products sell out
- Time-sensitive opportunities: Deals expire
Challenge: Must balance learning about an arm vs. exploiting it before it dies.
Key insight: With short lifetimes, pure exploitation can be optimal (no time to learn). With long lifetimes, standard bandit algorithms apply.
Constrained and Safe Bandits
Many real applications require satisfying constraints beyond maximizing reward.
Bandits with Knapsack Constraints
Setting: Each arm pull consumes resources. Stop when budget exhausted.
Formulation:
- Pulling arm gives reward and consumes resources
- Budget constraint: for each resource
- Stopping time is when any budget exhausted
Goal: Maximize total reward subject to budget constraints.
LP relaxation:
Algorithms:
- Primal-dual methods: Maintain dual variables for constraints
- UCB-BwK: UCB with budget-aware exploration
- Thompson Sampling with budget: Sample, but respect constraints
Regret: relative to best fixed policy satisfying constraints.
┌─────────────────────────────────────────────────────────────────────────┐
│ BANDITS WITH KNAPSACKS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EXAMPLE: Ad Campaign with Budget │
│ ──────────────────────────────── │
│ │
│ Arms = {Premium placement, Standard placement, Text ad} │
│ │
│ Rewards (CTR): Costs: │
│ Premium: 0.08 Premium: $2.00 │
│ Standard: 0.05 Standard: $0.50 │
│ Text: 0.02 Text: $0.10 │
│ │
│ Budget: $1000 │
│ │
│ NAIVE APPROACH: │
│ ─────────────── │
│ Always pick Premium (highest CTR) │
│ → 500 impressions, 40 clicks │
│ │
│ OPTIMAL MIX: │
│ ──────────── │
│ Mix of Standard and Text │
│ → 2500+ impressions, 100+ clicks │
│ │
│ Bandit must LEARN this while respecting budget! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Safe Bandits
Safe bandits require that performance never drops below a safety threshold.
Setting: There exists a baseline arm with known performance . Constraint:
Applications:
- Medical treatment: New treatments must be at least as good as standard of care
- Autonomous systems: Exploration must not cause accidents
- Production systems: A/B tests shouldn't tank conversion rates
Safe UCB: Only pull arms whose lower confidence bound exceeds threshold:
Challenge: Safe exploration is harder—must be confident an arm is safe before trying it.
Conservative Bandits
A variant of safe bandits where cumulative performance must stay above baseline.
Constraint:
Must never fall too far behind what baseline would have achieved.
Algorithm: Explore only when "safety budget" accumulated from exploitation allows it.
Fair Bandits
Fair bandits ensure equitable treatment across groups or arms.
Meritocratic fairness: Pull probability proportional to true quality:
Group fairness: If arms belong to groups, ensure groups receive fair share of pulls:
Individual fairness: Similar arms should be pulled with similar frequency.
Tradeoff: Fairness constraints typically increase regret. The fair-bandit literature studies Pareto frontiers between fairness and efficiency.
┌─────────────────────────────────────────────────────────────────────────┐
│ FAIRNESS IN BANDITS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EXAMPLE: Job Candidate Screening │
│ ──────────────────────────────── │
│ │
│ Arms = candidates, Reward = hiring success │
│ │
│ UNCONSTRAINED OPTIMAL: │
│ ────────────────────── │
│ Always interview highest-predicted candidates │
│ May systematically exclude certain groups │
│ │
│ FAIR BANDIT: │
│ ──────────── │
│ Ensure demographic groups receive fair interview rates │
│ while still learning who are the best candidates │
│ │
│ APPLICATIONS: │
│ ───────────── │
│ • Loan approvals across demographics │
│ • Content recommendation diversity │
│ • Resource allocation across regions │
│ • Ad serving to different user groups │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part V: Applications
A/B Testing and Adaptive Experiments
Traditional A/B testing:
- Split traffic 50/50 between A and B
- Run for fixed duration
- Analyze results, declare winner
Problems:
- Waste: Shows inferior variant to 50% of users
- Fixed duration: May stop too early or too late
- No adaptation: Can't respond to early signals
Bandit-based testing:
Replace fixed allocation with bandit algorithm:
- Thompson Sampling allocates more traffic to better variants
- Early stopping when confident
- Reduced regret (fewer users see bad variants)
Multi-armed bandit approach:
Caution: Bandit testing complicates statistical analysis (non-uniform allocation, stopping rules). Hybrid approaches maintain valid inference.
┌─────────────────────────────────────────────────────────────────────────┐
│ A/B TESTING vs BANDIT TESTING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL A/B TEST: │
│ ───────────────────── │
│ │
│ Traffic allocation over time: │
│ │
│ 100% │ ████████████████████████████████████████ │
│ 50% │ ████████████████████ Variant A (50%) │
│ │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ Variant B (50%) │
│ 0% │──────────────────────────────────────── Time │
│ Start End │
│ │
│ If B is worse: 50% of users get bad experience throughout │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ BANDIT-BASED TEST (Thompson Sampling): │
│ ────────────────────────────────────── │
│ │
│ Traffic allocation over time: │
│ │
│ 100% │ ████████████████████████████████████████ │
│ │ ███████████████████████████████████████ A dominates │
│ 50% │ ████████████████████ │
│ │ ▓▓▓▓▓▓▓▓▓▓ B fades out │
│ 0% │──────────────────────────────────────── Time │
│ Start ↑ End │
│ B found inferior │
│ │
│ Adapts to evidence: worse variants get less traffic over time │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ REGRET COMPARISON (Δ = 0.05 difference, T = 10,000 users): │
│ │
│ A/B: Regret = 0.5 × 0.05 × 10,000 = 250 │
│ Bandit: Regret ≈ 20-50 (logarithmic in T) │
│ │
│ 5-10x improvement! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Recommendation Systems
Cold-start problem: New items/users have no interaction data.
Bandits provide principled exploration:
- New items get exploration bonus (uncertainty)
- System learns preferences while maintaining engagement
Explore-exploit in recommendations:
Cascade bandits for ranked lists:
User examines items from top until satisfied. Model this as bandit over rankings.
Conversational recommendations:
Each question is an arm. Bandit selects questions to efficiently narrow preferences.
Clinical Trials
Adaptive clinical trials: Allocate more patients to more promising treatments.
Ethical motivation: Minimize patients receiving inferior treatment.
Response-adaptive randomization:
Challenges:
- Regulatory requirements for valid inference
- Delayed outcomes (treatment effects take time)
- Non-stationarity (patient population changes)
Gittins index (optimal for discounted infinite-horizon):
Compute index for each arm state . Pull arm with highest index.
Online Advertising
Ad creative selection (covered in our advertising post):
- Arms = ad creative variants
- Context = user features
- Reward = click (or conversion)
Bid optimization:
- Arms = bid levels
- Reward = profit = value - cost
- Challenge: cost depends on auction outcome
Budget-constrained bandits:
Standard bandits assume unlimited pulls. With budget constraint :
Knapsack bandits: Select arms subject to budget, maximize value.
Hyperparameter Optimization
Successive Halving:
- Start with random configurations
- Train each for some iterations
- Keep top half, double training iterations
- Repeat until one remains
Hyperband: Run Successive Halving with different initial budgets.
Bayesian Optimization (related to bandits):
Model hyperparameter → performance as Gaussian Process. Use UCB-like acquisition:
or UCB:
Resource Allocation
Dynamic pricing: Set prices to maximize revenue while learning demand.
- Arms = price levels
- Reward = revenue = price × demand(price)
- Challenge: demand function unknown
Network routing: Route traffic to minimize congestion while learning link costs.
Crowdsourcing: Assign tasks to workers while learning worker quality.
Offline Policy Evaluation
Before deploying a new bandit policy, we often want to evaluate it using historical data collected by a different policy. This is offline policy evaluation (OPE).
Setting:
- Historical data: collected by logging policy
- Target policy: (the policy we want to evaluate)
- Goal: Estimate
Challenge: We only observe rewards for actions taken by , not .
Inverse Propensity Scoring (IPS)
Key idea: Reweight samples to correct for distribution mismatch.
IPS estimator:
Intuition: If would have taken action twice as often as , count that sample twice.
Properties:
- Unbiased:
- High variance: When and differ significantly, importance weights can be huge
Variance reduction: Clip importance weights or use self-normalized estimator:
┌─────────────────────────────────────────────────────────────────────────┐
│ INVERSE PROPENSITY SCORING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ EXAMPLE: │
│ ──────── │
│ │
│ Logging policy π₀: Shows Ad A 80%, Ad B 20% │
│ Target policy π: Shows Ad A 30%, Ad B 70% │
│ │
│ Historical data (1000 samples): │
│ - Ad A shown 800 times, avg reward 0.05 │
│ - Ad B shown 200 times, avg reward 0.08 │
│ │
│ NAIVE ESTIMATE of π: │
│ ──────────────────── │
│ 0.3 × 0.05 + 0.7 × 0.08 = 0.071 │
│ (Wrong! Uses biased sample means) │
│ │
│ IPS ESTIMATE: │
│ ───────────── │
│ For Ad A samples: weight = 0.3/0.8 = 0.375 │
│ For Ad B samples: weight = 0.7/0.2 = 3.5 │
│ │
│ Weighted average accounts for sampling bias │
│ │
│ PROBLEM: Ad B weight is 3.5 → high variance! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Direct Method (DM)
Key idea: Build a reward model, use it to predict counterfactual outcomes.
Direct Method estimator:
where is a learned reward model.
Properties:
- Low variance: No importance weighting
- Biased: If reward model is wrong, estimate is wrong
Tradeoff: DM has low variance but potential bias; IPS has no bias but high variance.
Doubly Robust (DR) Estimator
Key idea: Combine IPS and DM to get the best of both worlds.
DR estimator:
Decomposition:
- First term: Direct method estimate
- Second term: IPS correction for reward model errors
Properties:
- Doubly robust: Unbiased if either is correct or is known
- Lower variance than IPS: When is good, corrections are small
- Asymptotically efficient: Achieves optimal variance under regularity conditions
┌─────────────────────────────────────────────────────────────────────────┐
│ OFFLINE POLICY EVALUATION COMPARISON │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ │ Method │ Bias │ Variance │ Requirement │ │
│ ├─────────────────┼───────────┼───────────┼────────────────────────┤ │
│ │ Direct Method │ Possible │ Low │ Good reward model │ │
│ │ IPS │ None │ High │ Known π₀ │ │
│ │ Doubly Robust │ Reduced │ Medium │ Either works │ │
│ │
│ PRACTICAL GUIDANCE: │
│ ─────────────────── │
│ │
│ 1. If π and π₀ are similar → IPS works well │
│ 2. If you have a good reward model → DM may suffice │
│ 3. Generally → Use DR for robustness │
│ 4. Always → Log propensities π₀(a|x) for future evaluation! │
│ │
│ CRITICAL: Design logging policies with evaluation in mind │
│ - Ensure π₀(a|x) > 0 for all actions π might take │
│ - Some exploration in logging enables better evaluation │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Offline Policy Learning
Beyond evaluation, we can learn policies from logged data.
Counterfactual Risk Minimization (CRM):
Challenges:
- Optimizing over policies is harder than evaluating one
- Propensity clipping needed for stability
- Sample complexity depends on logging policy overlap
Practical approach: Use DR estimator with variance regularization:
Part VI: Theoretical Foundations
Concentration Inequalities
Hoeffding's inequality (bounded random variables):
For independent with and :
For :
Chernoff bound (Bernoulli random variables):
For i.i.d. Bernoulli():
Sub-Gaussian concentration:
If is -sub-Gaussian (includes bounded r.v.s, Gaussian):
For sample mean of i.i.d. sub-Gaussian:
Martingale Methods
Azuma-Hoeffding inequality:
For martingale with :
Application to bandits: Cumulative reward is a martingale (conditioned on arm selections).
Information-Theoretic Lower Bounds
Change-of-measure argument:
For two bandit instances and differing only in arm :
where is binary KL divergence.
Lai-Robbins bound derivation:
By choosing to make arm optimal and using the above, we get:
Multiplying by and summing gives the regret lower bound.
Finite-Time Analysis
High-probability regret bounds:
With probability :
Anytime algorithms: Regret bounds hold for all simultaneously (not just pre-specified ).
Instance-dependent vs. worst-case bounds:
- Instance-dependent: — better when gaps are large
- Worst-case: — better when gaps are small
Algorithms like MOSS and UCB-V achieve both simultaneously.
Part VII: Connections to Other Fields
Reinforcement Learning
Bandits are one-step RL (single state, single action, immediate reward).
Relationship:
| Aspect | Bandits | Full RL |
|---|---|---|
| States | 1 | Many |
| Actions | K arms | Action space per state |
| Horizon | T steps | T or infinite |
| Transitions | None | $P(s' |
| Reward |
Bandits in RL:
- Exploration in Q-learning uses bandit ideas
- UCB for tree search (UCT in Monte Carlo Tree Search)
- Thompson Sampling for posterior-based exploration
Bayesian Optimization
Global optimization of expensive black-box functions:
where is expensive to evaluate.
Connection to bandits:
- Arms = points in
- Infinite arms (continuous space)
- Gaussian Process models
- Acquisition functions ≈ UCB
Acquisition functions:
- UCB:
- Expected Improvement:
- Thompson Sampling: Sample , optimize
Online Learning
Prediction with expert advice:
- experts give predictions each round
- Learner combines predictions
- Regret against best expert
Hedge algorithm (full information, like EXP3 but observe all expert losses):
Regret:
Online convex optimization:
- Actions = points in convex set
- Loss = convex function (revealed after action)
- Gradient descent achieves regret
Bandits = online learning with bandit feedback (only see own loss).
Game Theory
Regret minimization ↔ Nash equilibrium:
In repeated games, if all players use no-regret algorithms, play converges to Nash equilibrium.
Fictitious play: Best response to empirical opponent distribution (similar to greedy).
Multi-agent bandits: Multiple agents compete for same arms—game-theoretic considerations arise.
Summary and Key Takeaways
┌─────────────────────────────────────────────────────────────────────────┐
│ MULTI-ARMED BANDIT SUMMARY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: │
│ ──────── │
│ Balance exploration (learning) with exploitation (earning) │
│ │
│ FUNDAMENTAL LIMITS: │
│ ─────────────────── │
│ • Lai-Robbins: Regret ≥ Ω(∑ ln T / KL(νₐ, ν*)) │
│ • Minimax: Regret ≥ Ω(√(KT)) │
│ • Bayesian vs Frequentist regret: different guarantees │
│ │
│ KEY ALGORITHMS: │
│ ─────────────── │
│ │
│ │ Algorithm │ Regret │ Key Idea │ │
│ ├───────────────────┼────────────────┼───────────────────────┤ │
│ │ ε-greedy (fixed) │ O(εT) │ Random exploration │ │
│ │ ε-greedy (decay) │ O(K ln T / Δ) │ Decreasing ε │ │
│ │ UCB1 │ O(K ln T / Δ) │ Optimism under uncert │ │
│ │ KL-UCB │ Optimal │ KL-based confidence │ │
│ │ Thompson Sampling │ O(K ln T / Δ) │ Posterior sampling │ │
│ │ IDS │ O(√(Ψ·H·T)) │ Info-regret tradeoff │ │
│ │ EXP3 (adversarial)│ O(√(KT ln K)) │ Exponential weights │ │
│ │
│ CONTEXTUAL BANDITS: │
│ ─────────────────── │
│ • LinUCB: Linear rewards, O(d√(KT)) regret │
│ • Neural bandits: Complex contexts, use NNs for representation │
│ │
│ BANDIT VARIANTS: │
│ ──────────────── │
│ • Restless: Arms evolve when not pulled │
│ • Dueling: Pairwise comparison feedback │
│ • Sleeping/Mortal: Arms not always available │
│ • Constrained: Knapsacks, safety, fairness │
│ │
│ APPLICATIONS: │
│ ───────────── │
│ • A/B testing (adaptive experiments) │
│ • Recommendations (explore-exploit) │
│ • Clinical trials (ethical allocation) │
│ • Ad selection (CTR optimization) │
│ • Hyperparameter tuning (Hyperband) │
│ │
│ OFFLINE EVALUATION: │
│ ─────────────────── │
│ • IPS: Unbiased but high variance │
│ • Direct Method: Low variance but biased │
│ • Doubly Robust: Best of both worlds │
│ │
│ PRACTICAL ADVICE: │
│ ──────────────── │
│ 1. Start with Thompson Sampling (simple, effective) │
│ 2. Use LinUCB for linear contextual problems │
│ 3. Consider IDS for structured/many-armed problems │
│ 4. Use DR estimator for offline evaluation │
│ 5. Log propensities for future policy evaluation │
│ 6. Consider constraints: safety, fairness, budgets │
│ 7. Monitor for non-stationarity │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Sources
Foundational Papers:
- Lai & Robbins (1985): Asymptotically Efficient Adaptive Allocation Rules
- Auer, Cesa-Bianchi & Fischer (2002): Finite-time Analysis of the Multiarmed Bandit Problem
- Thompson (1933): On the Likelihood that One Unknown Probability Exceeds Another
- Auer et al. (2002): The Nonstochastic Multiarmed Bandit Problem
Thompson Sampling:
- Agrawal & Goyal (2012): Analysis of Thompson Sampling for the Multi-armed Bandit Problem
- Russo et al. (2018): A Tutorial on Thompson Sampling
Contextual Bandits:
- Li et al. (2010): A Contextual-Bandit Approach to Personalized News Article Recommendation (LinUCB)
- Agrawal & Goyal (2013): Thompson Sampling for Contextual Bandits
Neural Bandits:
- Zhou et al. (2020): Neural Contextual Bandits with UCB-based Exploration
- Zhang et al. (2020): Neural Thompson Sampling
Applications:
- Scott (2010): A Modern Bayesian Look at the Multi-armed Bandit
- Chapelle & Li (2011): An Empirical Evaluation of Thompson Sampling
Information-Directed Sampling:
- Russo & Van Roy (2014): Learning to Optimize via Information-Directed Sampling
- Russo & Van Roy (2018): Information-Directed Sampling and Thompson Sampling
Bandit Variants:
- Whittle (1988): Restless Bandits: Activity Allocation in a Changing World
- Yue et al. (2012): The K-armed Dueling Bandits Problem
- Kleinberg et al. (2010): Regret Bounds for Sleeping Experts and Bandits
Constrained and Safe Bandits:
- Badanidiyuru et al. (2013): Bandits with Knapsacks
- Sui et al. (2015): Safe Exploration for Optimization with Gaussian Processes
- Joseph et al. (2016): Fairness in Learning: Classic and Contextual Bandits
- Wu et al. (2016): Conservative Bandits
Offline Policy Evaluation:
- Horvitz & Thompson (1952): A Generalization of Sampling Without Replacement
- Dudík et al. (2011): Doubly Robust Policy Evaluation and Learning
- Swaminathan & Joachims (2015): Counterfactual Risk Minimization
Books and Surveys:
Frequently Asked Questions
Related Articles
Machine Learning for Advertising: CTR Prediction, Ad Ranking, and Bidding Systems
Comprehensive guide to ML systems powering digital advertising. From logistic regression to deep CTR models, user behavior sequences to multi-task learning, and real-time bidding optimization—understand the algorithms behind the $600B+ ad industry.
Recommendation Systems: From Collaborative Filtering to Deep Learning
In-depth journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.
Transformers for Recommendation Systems: From SASRec to HSTU
In-depth tour of transformer-based recommendation systems. From the fundamentals of sequential recommendation to Meta's trillion-parameter HSTU, understand how attention mechanisms revolutionized personalization.
Building Agentic AI Systems: A Complete Implementation Guide
Hands-on guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.
Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning
Production-focused guide to inference-time scaling techniques—Chain of Thought, Tree of Thoughts, Monte Carlo Tree Search, Process Reward Models, and the HuggingFace search-and-learn framework.
RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond
Clear walkthrough of reinforcement learning algorithms for LLM alignment—PPO, GRPO, GSPO, REINFORCE++, DPO, and their variants. Understanding the tradeoffs that power modern AI assistants.