Machine Learning for Advertising: CTR Prediction, Ad Ranking, and Bidding Systems
Comprehensive guide to ML systems powering digital advertising. From logistic regression to deep CTR models, user behavior sequences to multi-task learning, and real-time bidding optimization—understand the algorithms behind the $600B+ ad industry.
Table of Contents
The Scale of Advertising ML
Digital advertising is a $600+ billion industry, and machine learning is its backbone. Every time you see an ad online, dozens of ML models have executed in milliseconds: predicting whether you'll click, estimating conversion probability, optimizing bids, and ranking thousands of candidate ads.
This isn't just recommendation systems with a different name. Advertising ML has unique challenges:
┌─────────────────────────────────────────────────────────────────────────┐
│ ADVERTISING ML vs GENERAL RECOMMENDATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ GENERAL RECSYS: │
│ ─────────────── │
│ Goal: Maximize user engagement/satisfaction │
│ Items: Products, videos, songs (relatively stable catalog) │
│ Feedback: Implicit (views, clicks) or explicit (ratings) │
│ Latency: 100-500ms acceptable │
│ Stakes: Poor recommendations → user leaves │
│ │
│ ADVERTISING ML: │
│ ─────────────── │
│ Goal: Maximize revenue while maintaining user experience │
│ Items: Ads (constantly changing, millions of advertisers) │
│ Feedback: Sparse (CTR ~1-3%), delayed conversions │
│ Latency: <10-50ms required (real-time bidding) │
│ Stakes: Wrong predictions → lose money (pay per impression/click) │
│ │
│ UNIQUE CHALLENGES: │
│ ───────────────── │
│ • Feature sparsity: Billions of feature combinations │
│ • Class imbalance: 97-99% negative examples │
│ • Multi-stakeholder: Users, advertisers, platform │
│ • Position bias: Top positions get more clicks regardless of relevance │
│ • Delayed feedback: Conversions may happen days later │
│ • Adversarial dynamics: Click fraud, bid manipulation │
│ │
└─────────────────────────────────────────────────────────────────────────┘
This post covers the complete advertising ML stack, from foundational CTR prediction to advanced user modeling and real-time bidding optimization.
Part I: Foundations of CTR Prediction
The CTR Prediction Problem
Click-Through Rate (CTR) prediction is the cornerstone of advertising ML. Given a user , an ad , and a context (time, device, page), predict the probability that the user will click:
This probability directly determines:
- Ad ranking: Higher predicted CTR → higher position
- Bid optimization: Expected value =
- Revenue: Platform typically charges per click (CPC) or per impression (CPM)
┌─────────────────────────────────────────────────────────────────────────┐
│ CTR PREDICTION IN THE AD STACK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ User visits page │
│ │ │
│ ▼ │
│ ┌─────────────┐ Candidate ads ┌─────────────────────┐ │
│ │ Ad │ ─────────────────────►│ CTR Prediction │ │
│ │ Request │ (1000s of ads) │ Model │ │
│ └─────────────┘ └──────────┬──────────┘ │
│ │ │
│ P(click) for each ad │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Ranking Function │ │
│ │ │ │
│ │ Score = f(pCTR, │ │
│ │ bid, │ │
│ │ quality) │ │
│ └──────────┬──────────┘ │
│ │ │
│ Top K ads │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Ad Displayed │ │
│ └─────────────────────┘ │
│ │
│ The entire pipeline must complete in <50ms │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Feature Engineering: The Foundation
Before diving into models, understand that advertising ML is fundamentally about feature interactions. A user who is "male, age 25-34, interested in sports" seeing an ad for "Nike running shoes" on a "sports news website" at "7pm on weekday" has a very different click probability than any individual feature would suggest.
The features in CTR prediction are typically:
┌─────────────────────────────────────────────────────────────────────────┐
│ FEATURE CATEGORIES IN CTR PREDICTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ USER FEATURES: │
│ ────────────── │
│ • Demographics: age, gender, location, language │
│ • Behavioral: past clicks, purchases, browsing history │
│ • Contextual: device, OS, browser, time of day │
│ • Aggregated: click rate on category, avg session duration │
│ │
│ AD FEATURES: │
│ ──────────── │
│ • Creative: ad ID, advertiser ID, campaign ID │
│ • Content: category, keywords, landing page domain │
│ • Historical: ad CTR, conversion rate, quality score │
│ • Bid: bid amount, budget remaining, campaign age │
│ │
│ CONTEXT FEATURES: │
│ ───────────────── │
│ • Publisher: site ID, page category, content keywords │
│ • Position: ad slot, above/below fold │
│ • Temporal: hour, day of week, season, holidays │
│ • Request: referrer, search query (if search ads) │
│ │
│ CROSS FEATURES (manually engineered): │
│ ───────────────────────────────────── │
│ • user_gender × ad_category │
│ • user_age × hour_of_day │
│ • device_type × ad_format │
│ • user_interest × ad_keyword │
│ │
│ SCALE: Typically 10^6 to 10^9 sparse features after one-hot encoding │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The key insight: most features are categorical and high-cardinality. User ID alone might have billions of values. When one-hot encoded, the feature vector becomes extremely sparse but extremely high-dimensional.
Part II: Evolution of CTR Models
Stage 1: Logistic Regression (The Baseline)
The journey begins with logistic regression, still used in production at many companies for its simplicity and interpretability.
Model formulation:
where is the sigmoid function, are features, and are learned weights.
Loss function (binary cross-entropy):
┌─────────────────────────────────────────────────────────────────────────┐
│ LOGISTIC REGRESSION FOR CTR │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Input: Sparse feature vector x ∈ {0,1}^n (one-hot encoded) │
│ │
│ Example (simplified): │
│ ───────────────────── │
│ user_gender=male: [1, 0] (2 dims) │
│ user_age=25-34: [0, 0, 1, 0, 0] (5 dims) │
│ ad_category=sports: [0, 0, 0, 1, 0] (5 dims) │
│ device=mobile: [0, 1] (2 dims) │
│ │
│ Concatenated: x = [1,0,0,0,1,0,0,0,0,0,1,0,0,1] │
│ │
│ Prediction: │
│ ─────────── │
│ z = w₀ + w₁·1 + w₃·1 + w₁₁·1 + w₁₄·1 │
│ = bias + w_male + w_age25-34 + w_sports + w_mobile │
│ │
│ ŷ = σ(z) = P(click) │
│ │
│ LIMITATION: Only captures first-order effects │
│ ─────────── │
│ Cannot model: "males interested in sports click more on sports ads" │
│ This requires explicit feature crosses: x_male × x_sports │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Why LR works despite its simplicity:
- Interpretability: Weight directly shows feature importance
- Scalability: Can train on billions of examples with SGD
- Sparsity: Most weights are zero (L1 regularization)
- Online learning: Weights can be updated incrementally
Why LR is insufficient:
- Requires manual feature engineering for interactions
- Cannot learn non-linear patterns
- Feature crosses explode combinatorially: for pairwise, for k-way
Stage 2: Polynomial/Feature Cross Models
To capture interactions, we can explicitly model feature crosses:
Degree-2 Polynomial:
The problem: for features, we now have pairwise interaction terms. With millions of sparse features, this is computationally infeasible and leads to severe overfitting (most pairs never co-occur in training data).
Stage 3: Factorization Machines (FM)
The breakthrough: Instead of learning a weight for each feature pair, learn a latent vector for each feature and model interactions as dot products.
FM formulation (Rendle, 2010):
where is the dot product.
┌─────────────────────────────────────────────────────────────────────────┐
│ FACTORIZATION MACHINES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ KEY INSIGHT: Factorize the interaction weight matrix │
│ ─────────────────────────────────────────────────── │
│ │
│ Instead of: W_ij (n² parameters for pairwise interactions) │
│ │
│ Learn: V ∈ ℝ^(n×k) where k << n │
│ W_ij ≈ <v_i, v_j> = Σ v_if · v_jf │
│ │
│ Parameter reduction: │
│ ─────────────────── │
│ Full interactions: O(n²) → With FM: O(n·k) │
│ │
│ Example: n = 10⁶ features, k = 64 │
│ Full: 10¹² parameters (impossible) │
│ FM: 64 × 10⁶ = 6.4 × 10⁷ parameters (tractable) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ GENERALIZATION POWER: │
│ ──────────────────── │
│ │
│ Even if (feature_i, feature_j) never co-occur in training data, │
│ FM can estimate their interaction through the latent vectors: │
│ │
│ v_i learned from (feature_i, feature_k) co-occurrences │
│ v_j learned from (feature_j, feature_k) co-occurrences │
│ → <v_i, v_j> provides reasonable interaction estimate │
│ │
│ This is similar to how matrix factorization in RecSys handles │
│ user-item pairs never seen in training. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Efficient computation (the FM trick):
The naive computation of pairwise interactions is . But FM can be computed in :
Derivation:
Starting from the identity:
Let :
Rearranging and summing over :
This reformulation enables linear-time computation—critical for real-time serving.
Stage 4: Field-aware Factorization Machines (FFM)
Limitation of FM: The same latent vector is used regardless of which feature it interacts with.
FFM insight (Juan et al., 2016): Different interactions may require different representations. A user's "sports interest" should interact differently with "ad category" versus "time of day."
FFM formulation:
where denotes the field of feature , and is feature 's latent vector for interacting with field .
┌─────────────────────────────────────────────────────────────────────────┐
│ FM vs FFM: FIELD-AWARE INTERACTIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ FIELDS: Groups of related features │
│ ────────────────────────────────── │
│ │
│ Field 1 (User): user_id, user_age, user_gender │
│ Field 2 (Ad): ad_id, ad_category, advertiser_id │
│ Field 3 (Context): hour, day_of_week, device │
│ Field 4 (Publisher): site_id, page_category │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ FM: Each feature has ONE latent vector │
│ ──────────────────────────────────── │
│ │
│ user_age=25-34: v_age = [0.1, 0.3, -0.2, ...] │
│ │
│ Interaction with ad_category: <v_age, v_category> │
│ Interaction with hour: <v_age, v_hour> │
│ (Same v_age used for both!) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ FFM: Each feature has latent vector PER FIELD │
│ ───────────────────────────────────────────── │
│ │
│ user_age=25-34: │
│ v_age,Ad = [0.1, 0.3, -0.2, ...] (for Ad field) │
│ v_age,Context = [0.4, -0.1, 0.5, ...] (for Context field) │
│ v_age,Pub = [-0.2, 0.2, 0.1, ...] (for Publisher field) │
│ │
│ Interaction with ad_category: <v_age,Ad, v_category,User> │
│ Interaction with hour: <v_age,Context, v_hour,User> │
│ (Different vectors for different fields!) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PARAMETER COUNT: │
│ ──────────────── │
│ FM: n × k │
│ FFM: n × F × k (F = number of fields) │
│ │
│ Tradeoff: More expressive but more parameters and slower training │
│ │
└─────────────────────────────────────────────────────────────────────────┘
FFM won several Kaggle CTR prediction competitions and became a standard baseline in the industry.
Part III: Deep Learning for CTR Prediction
The Deep Learning Revolution
Around 2016, deep learning entered CTR prediction. The key insight: neural networks can automatically learn feature interactions without manual engineering.
Wide & Deep Learning (Google, 2016)
Google's Wide & Deep architecture combines memorization (wide) with generalization (deep).
Motivation:
- Memorization: Learning specific feature co-occurrences from history
- "Users who installed app X often install app Y"
- Requires feature engineering but captures precise patterns
- Generalization: Learning transferable patterns
- "Users interested in fitness apps like health-related apps"
- DNNs learn general representations but may over-generalize
Architecture:
where:
- = raw features
- = cross-product transformations (manual feature crosses)
- = final hidden layer of the deep network
┌─────────────────────────────────────────────────────────────────────────┐
│ WIDE & DEEP ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Output │ │
│ │ σ(·) │ │
│ └──────┬───────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ │ │
│ ┌───────┴───────┐ ┌───────┴───────┐ │
│ │ WIDE │ │ DEEP │ │
│ │ (Linear) │ │ (DNN) │ │
│ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │
│ ┌───────┴───────┐ ┌───────┴───────┐ │
│ │ Raw Features │ │ Hidden │ │
│ │ + │ │ Layers │ │
│ │ Cross Features│ │ (ReLU) │ │
│ │ (manual) │ └───────┬───────┘ │
│ └───────┬───────┘ │ │
│ │ ┌───────┴───────┐ │
│ │ │ Embedding │ │
│ │ │ Layer │ │
│ │ └───────┬───────┘ │
│ │ │ │
│ └───────────┬───────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ Sparse Features │ │
│ │ (categorical IDs) │ │
│ └───────────────────────┘ │
│ │
│ WIDE COMPONENT: Memorization │
│ ───────────────────────────── │
│ • Linear model on raw + crossed features │
│ • Crossed features like: installed_app × impression_app │
│ • Captures specific, frequent patterns │
│ │
│ DEEP COMPONENT: Generalization │
│ ────────────────────────────── │
│ • Embeddings for categorical features │
│ • Multiple hidden layers with ReLU │
│ • Learns dense representations that generalize │
│ │
│ JOINT TRAINING: Both components trained end-to-end │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Key insight: The wide component still requires manual feature crosses. Can we automate this?
DeepFM (Huawei, 2017)
DeepFM replaces the wide component's manual crosses with a Factorization Machine, achieving automatic feature interaction learning at both low and high orders.
Architecture:
where:
┌─────────────────────────────────────────────────────────────────────────┐
│ DeepFM ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Output │ │
│ │ σ(·) │ │
│ └──────┬───────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ ADD │ │
│ └───────────┬───────────┘ │
│ ┌────────────────┼────────────────┐ │
│ │ │ │ │
│ ┌───────┴───────┐ ┌──────┴──────┐ ┌──────┴──────┐ │
│ │ 1st Order │ │ 2nd Order │ │ Deep │ │
│ │ (Linear) │ │ (FM) │ │ (DNN) │ │
│ └───────┬───────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ │ │ ┌──────┴──────┐ │
│ │ │ │ Hidden │ │
│ │ │ │ Layers │ │
│ │ │ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┴───────────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ SHARED Embeddings │ ← KEY INNOVATION │
│ └───────────┬───────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ Sparse Features │ │
│ └───────────────────────┘ │
│ │
│ KEY INNOVATIONS: │
│ ──────────────── │
│ 1. FM replaces manual feature crosses (automatic 2nd-order) │
│ 2. DNN captures higher-order interactions │
│ 3. SHARED embeddings between FM and DNN │
│ - Reduces parameters │
│ - FM and DNN reinforce each other │
│ 4. No pre-training required (end-to-end) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Why shared embeddings matter:
The embedding vector serves dual purposes:
- In FM: Direct dot-product interactions
- In DNN: Concatenated as input for higher-order learning
This parameter sharing creates a synergy: the FM component provides explicit 2nd-order signals that help the DNN converge faster, while the DNN's gradients improve the embeddings used by FM.
Deep & Cross Network (DCN) (Google, 2017)
DCN introduces an elegant cross network that explicitly models feature interactions of arbitrary order without the combinatorial explosion.
Cross Layer formulation:
where:
- = input features
- = output of layer
- = learnable parameters
┌─────────────────────────────────────────────────────────────────────────┐
│ CROSS NETWORK: HOW IT WORKS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Cross Layer Operation: │
│ ────────────────────── │
│ │
│ x_{l+1} = x_0 · (x_l^T · w_l) + b_l + x_l │
│ = x_0 · (scalar) + b_l + x_l │
│ │
│ Breakdown: │
│ ────────── │
│ 1. x_l^T · w_l → scalar (dot product) │
│ 2. x_0 · scalar → feature-weighted x_0 │
│ 3. + x_l → residual connection │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ INTERACTION ORDER ANALYSIS: │
│ ─────────────────────────── │
│ │
│ Layer 0: x_0 = [x_1, x_2, x_3] (1st order features) │
│ │
│ Layer 1: x_1 = x_0 · (x_0^T w_0) + x_0 │
│ Contains: x_1, x_2, x_3 (1st order) │
│ x_1², x_1x_2, x_1x_3... (2nd order) │
│ │
│ Layer 2: x_2 = x_0 · (x_1^T w_1) + x_1 │
│ Contains: 1st, 2nd order (from x_1) │
│ 3rd order (x_0 × 2nd order terms) │
│ │
│ Layer L: Contains interactions up to order (L+1) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PARAMETER EFFICIENCY: │
│ ───────────────────── │
│ │
│ Each cross layer: d parameters (w_l) + d parameters (b_l) = 2d │
│ L cross layers: 2Ld parameters │
│ │
│ Compare to polynomial: d^(L+1) parameters for order-(L+1) interactions │
│ │
│ Example: d=1000, L=3 │
│ Cross Network: 6,000 parameters │
│ Full polynomial: 10^12 parameters (impossible!) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
DCN-v2 (2020): The original DCN's cross layer uses rank-1 weight matrices (). DCN-v2 generalizes to full-rank:
where is element-wise product and .
This increases expressiveness at the cost of more parameters, with a practical compromise using low-rank decomposition: where .
xDeepFM (Microsoft, 2018)
xDeepFM introduces the Compressed Interaction Network (CIN) to learn explicit, bounded-degree feature interactions at the vector level (not bit level).
Key insight: In DeepFM's DNN, interactions happen at the bit level (individual embedding dimensions). CIN operates at the vector level (entire embedding vectors), which is more interpretable and controllable.
CIN formulation:
where:
- : input feature embeddings ( features, dimensions)
- : output of layer ( feature maps)
- : Hadamard (element-wise) product
┌─────────────────────────────────────────────────────────────────────────┐
│ CIN: COMPRESSED INTERACTION NETWORK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT: m field embeddings, each D-dimensional │
│ │
│ X^0 = [e_1, e_2, ..., e_m] ∈ ℝ^(m × D) │
│ │
│ LAYER k: Compute interactions with original embeddings │
│ ───────────────────────────────────────────────────── │
│ │
│ Step 1: Outer product along embedding dimension │
│ │
│ Z^k = X^(k-1) ⊗ X^0 ∈ ℝ^(H_{k-1} × m × D) │
│ │
│ Each Z^k_{i,j} = X^(k-1)_i ⊙ X^0_j (Hadamard product) │
│ │
│ Step 2: Compress along the H_{k-1} × m dimensions │
│ │
│ X^k_h = Σ_i Σ_j W^k_{h,i,j} · Z^k_{i,j} │
│ │
│ Output: X^k ∈ ℝ^(H_k × D) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ INTERACTION ORDER: │
│ ────────────────── │
│ │
│ Layer 1: X^1 involves X^0 ⊗ X^0 → 2nd order interactions │
│ Layer 2: X^2 involves X^1 ⊗ X^0 → 3rd order interactions │
│ Layer k: Contains exactly (k+1)-order interactions │
│ │
│ Unlike DNN where interaction order is implicit and unbounded, │
│ CIN gives explicit control over maximum interaction degree. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OUTPUT: Sum pooling over each layer │
│ │
│ p^k = Σ_h Σ_d X^k_{h,d} (scalar per layer) │
│ │
│ y_CIN = [p^1, p^2, ..., p^T] (T layers) │
│ │
│ Final: Concatenate with linear + DNN outputs │
│ │
└─────────────────────────────────────────────────────────────────────────┘
AutoInt (2019)
AutoInt applies multi-head self-attention to learn feature interactions, treating each feature embedding as a token.
Architecture:
where:
┌─────────────────────────────────────────────────────────────────────────┐
│ AutoInt: ATTENTION FOR FEATURE INTERACTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INTUITION: Treat feature embeddings like tokens in a transformer │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Input: M feature embeddings [e_1, e_2, ..., e_M] │
│ │
│ e_1 e_2 e_3 e_4 │
│ (user_age) (ad_cat) (hour) (device) │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Multi-Head Self-Attention │ │
│ │ │ │
│ │ α_11 α_12 α_13 α_14 │ │
│ │ α_21 α_22 α_23 α_24 (attention matrix)│ │
│ │ α_31 α_32 α_33 α_34 │ │
│ │ α_41 α_42 α_43 α_44 │ │
│ │ │ │
│ │ α_ij = how much feature i attends to j │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ẽ_1 ẽ_2 ẽ_3 ẽ_4 │
│ (contextualized embeddings) │
│ │
│ WHAT ATTENTION LEARNS: │
│ ────────────────────── │
│ │
│ High α_12: "user_age" strongly interacts with "ad_category" │
│ High α_34: "hour" strongly interacts with "device" │
│ │
│ Unlike FM (all pairs weighted equally by dot product), │
│ attention learns WHICH interactions matter for each example. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STACKING: Multiple attention layers = higher-order interactions │
│ │
│ Layer 1: ẽ^1 = Attn(e, e, e) → 2nd order │
│ Layer 2: ẽ^2 = Attn(ẽ^1, ẽ^1, ẽ^1) → 3rd order │
│ Layer L: up to (L+1)-order interactions │
│ │
│ Residual connections: ẽ^l = ẽ^(l-1) + Attn(...) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Advantages of attention-based interaction:
- Dynamic: Attention weights depend on the specific input (unlike FM's fixed weights)
- Interpretable: Can visualize which features interact
- Efficient: Self-attention is well-optimized on modern hardware
FiBiNET (Sina Weibo, 2019)
FiBiNET introduces SENET-like attention to dynamically reweight feature importance before interaction.
Key innovation: Not all features are equally important for every prediction. FiBiNET learns to squeeze (aggregate) and excite (reweight) features.
SENET Layer:
(squeeze: average pooling per field)
(excite: two FC layers)
(reweight)
┌─────────────────────────────────────────────────────────────────────────┐
│ FiBiNET: FEATURE IMPORTANCE LEARNING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: Not all features equally important for all predictions │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Example: Predicting click for "luxury watch" ad │
│ │
│ Important features: user_income, user_age, user_interests │
│ Less important: hour_of_day, browser_type │
│ │
│ But for "fast food" ad: │
│ Important features: hour_of_day, user_location │
│ Less important: user_income │
│ │
│ SENET learns to dynamically reweight based on the input! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ARCHITECTURE: │
│ │
│ Input embeddings: E = [e_1, e_2, ..., e_f] (f fields) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ SQUEEZE: Global average pooling per field │ │
│ │ │ │
│ │ z_i = mean(e_i) → z = [z_1, z_2, ..., z_f] │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ EXCITE: Two FC layers with reduction ratio r │ │
│ │ │ │
│ │ s = W_1 · z (f → f/r) │ │
│ │ s = ReLU(s) │ │
│ │ a = W_2 · s (f/r → f) │ │
│ │ a = sigmoid(a) (importance weights) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ REWEIGHT: Scale embeddings by importance │ │
│ │ │ │
│ │ v_i = a_i · e_i → V = [v_1, v_2, ..., v_f] │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Bilinear interaction on reweighted embeddings │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part IV: User Behavior Sequence Modeling
The Behavior Sequence Problem
So far, we've treated user features as static. But in advertising, user behavior history is critical. A user who just searched for "running shoes" is much more likely to click on a Nike ad than their static demographic profile suggests.
Challenge: How do we model sequences of past behaviors to predict future ad clicks?
┌─────────────────────────────────────────────────────────────────────────┐
│ USER BEHAVIOR IN AD PREDICTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ User's recent behavior sequence: │
│ ──────────────────────────────── │
│ │
│ t-5: Viewed "Nike Air Max" product page │
│ t-4: Searched "best running shoes 2024" │
│ t-3: Clicked ad for "Adidas Ultraboost" │
│ t-2: Read article "Marathon Training Guide" │
│ t-1: Added "Running Socks" to cart │
│ t: Current ad impression: "Nike Running Shoes" │
│ │
│ QUESTION: How does this history inform P(click)? │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ NAIVE APPROACH: Aggregate all behaviors │
│ ─────────────────────────────────────── │
│ │
│ user_embedding = mean([e_nike, e_search, e_adidas, e_article, e_socks])│
│ │
│ Problems: │
│ • All behaviors weighted equally │
│ • Recent behaviors not prioritized │
│ • Relationship to target ad ignored │
│ │
│ BETTER: Weight behaviors by relevance to current ad │
│ ───────────────────────────────────────────────── │
│ │
│ For "Nike Running Shoes" ad: │
│ • "Nike Air Max" view: HIGH relevance (same brand + category) │
│ • "Adidas Ultraboost" click: MEDIUM relevance (competitor) │
│ • "Marathon Training" read: MEDIUM relevance (related interest) │
│ • "Running Socks" cart: LOW relevance (different product type) │
│ │
│ This is the core idea behind DIN, DIEN, and related models. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Deep Interest Network (DIN) (Alibaba, 2018)
DIN introduces target-aware attention over user behavior sequences. Instead of treating all past behaviors equally, DIN computes attention weights based on relevance to the target ad.
Key insight: User interests are diverse and locally activated. When predicting clicks on a "Nike shoe" ad, past behaviors related to sports/shoes should matter more than unrelated behaviors.
Attention mechanism:
where:
- = embedding of behavior
- = embedding of target ad
- = attention function
Attention function (activation unit):
┌─────────────────────────────────────────────────────────────────────────┐
│ DIN: DEEP INTEREST NETWORK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TARGET AD: Nike Running Shoes (embedding e_A) │
│ │
│ USER BEHAVIOR HISTORY: │
│ ────────────────────── │
│ │
│ e_1 e_2 e_3 e_4 e_5 │
│ (Nike Air) (Search) (Adidas) (Article) (Socks) │
│ │ │ │ │ │ │
│ └───────────┴───────────┴───────────┴───────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ Attention Unit │ │
│ │ a(e_j, e_A) │← Target ad e_A │
│ └───────────┬───────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ │ │ │ │
│ a_1=0.6 a_2=0.3 a_3=0.4 ... │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ a_1·e_1 + a_2·e_2 + a_3·e_3 + ... │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ User Interest │ │
│ │ Representation │ │
│ │ v_U │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ Concatenate with other │ │
│ │ features → MLP → P(click) │ │
│ └─────────────────────────────┘ │
│ │
│ KEY PROPERTIES: │
│ ─────────────── │
│ 1. Attention weights NOT normalized (sum ≠ 1) │
│ - Allows varying total interest intensity │
│ - User with strong interest → larger ||v_U|| │
│ │
│ 2. Activation unit uses both similarity AND difference │
│ - [e_j, e_A]: raw features │
│ - [e_j ⊙ e_A]: element-wise product (similarity) │
│ - [e_j - e_A]: difference (captures contrast) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Why unnormalized attention?
In standard attention (e.g., transformers), weights sum to 1. DIN deliberately avoids normalization:
- Normalized: User with 10 relevant items and user with 1 relevant item produce similar magnitude outputs
- Unnormalized: User with more relevant items has larger interest representation, capturing interest intensity
Deep Interest Evolution Network (DIEN) (Alibaba, 2019)
DIEN extends DIN by modeling the temporal evolution of user interests, not just their static representation.
Key insight: User interests evolve over time. The sequence [search shoes → view Nike → view Adidas → buy Nike] tells a story of interest development that static attention misses.
Two-layer architecture:
- Interest Extractor Layer: GRU captures sequential patterns
- Interest Evolving Layer: Attention-augmented GRU focuses on target-relevant evolution
┌─────────────────────────────────────────────────────────────────────────┐
│ DIEN: INTEREST EVOLUTION NETWORK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LAYER 1: INTEREST EXTRACTOR (GRU) │
│ ───────────────────────────────── │
│ │
│ b_1 → b_2 → b_3 → b_4 → b_5 (behavior sequence) │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │
│ │GRU│→│GRU│→│GRU│→│GRU│→│GRU│ │
│ └──┘ └──┘ └──┘ └──┘ └──┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ h_1 h_2 h_3 h_4 h_5 (hidden states = interest states) │
│ │
│ Auxiliary loss: Predict next behavior from h_t │
│ L_aux = -Σ [log P(b_{t+1} | h_t) + log P(b'_t | h_t)] │
│ (positive + negative samples) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LAYER 2: INTEREST EVOLVING (AUGRU) │
│ ────────────────────────────────── │
│ │
│ h_1 h_2 h_3 h_4 h_5 │
│ │ │ │ │ │ │
│ │ ┌──┴──┐┌──┴──┐┌──┴──┐┌──┴──┐ │
│ │ │Attn ││Attn ││Attn ││Attn │ ← attention w.r.t. target ad │
│ │ └──┬──┘└──┬──┘└──┬──┘└──┬──┘ │
│ │ a_1│ a_2│ a_3│ a_4│ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌────┐┌────┐┌────┐┌────┐┌────┐ │
│ │AUGRU││AUGRU││AUGRU││AUGRU││AUGRU│ │
│ └────┘└────┘└────┘└────┘└────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ h'_1 h'_2 h'_3 h'_4 h'_5 │
│ │ │
│ ▼ │
│ Final interest state h'_T │
│ (used for prediction) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ AUGRU (Attention Update GRU): │
│ ───────────────────────────── │
│ │
│ Standard GRU: │
│ ũ_t = σ(W_u · [h_{t-1}, i_t] + b_u) (update gate) │
│ h̃_t = tanh(W_h · [r_t ⊙ h_{t-1}, i_t]) (candidate) │
│ h_t = (1 - ũ_t) ⊙ h_{t-1} + ũ_t ⊙ h̃_t │
│ │
│ AUGRU modification: │
│ u'_t = a_t · ũ_t (attention-scaled update) │
│ h'_t = (1 - u'_t) ⊙ h'_{t-1} + u'_t ⊙ h̃_t │
│ │
│ Effect: Low attention a_t → small update → ignore this behavior │
│ High attention a_t → normal update → incorporate behavior │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Auxiliary loss for interest extraction:
The auxiliary loss ensures hidden states actually capture user interests by requiring them to predict the next behavior:
where is a negative sample (item the user didn't interact with).
DSIN: Deep Session Interest Network (Alibaba, 2019)
DSIN recognizes that user behavior naturally clusters into sessions. Within a session, behaviors are highly related; across sessions, interests may differ significantly.
Architecture:
- Session Division: Split behavior sequence into sessions (e.g., by time gaps)
- Intra-Session Interest: Self-attention within each session
- Inter-Session Interest: Bi-LSTM across sessions to capture evolution
- Session Interest Activation: Attention with target ad
┌─────────────────────────────────────────────────────────────────────────┐
│ DSIN: SESSION-BASED INTEREST MODELING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ RAW BEHAVIOR SEQUENCE: │
│ ────────────────────── │
│ [b1, b2, b3] | gap | [b4, b5] | gap | [b6, b7, b8, b9] │
│ └──Session 1──┘ └Session 2┘ └───Session 3────┘ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LAYER 1: INTRA-SESSION (Self-Attention per session) │
│ ─────────────────────────────────────────────────── │
│ │
│ Session 1: [b1, b2, b3] → Self-Attention → Interest I_1 │
│ Session 2: [b4, b5] → Self-Attention → Interest I_2 │
│ Session 3: [b6, b7, b8, b9] → Self-Attention → Interest I_3 │
│ │
│ Self-attention captures relationships within session: │
│ "viewed Nike, then searched shoes, then viewed Nike sizes" │
│ → coherent shopping intent │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LAYER 2: INTER-SESSION (Bi-LSTM across sessions) │
│ ──────────────────────────────────────────────── │
│ │
│ I_1 ───→ I_2 ───→ I_3 (forward LSTM) │
│ I_1 ←─── I_2 ←─── I_3 (backward LSTM) │
│ │
│ Captures interest evolution across sessions: │
│ "First explored options, then compared prices, then ready to buy" │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LAYER 3: SESSION ACTIVATION (Target-aware attention) │
│ ──────────────────────────────────────────────────── │
│ │
│ [I_1, I_2, I_3] × Attention(target_ad) → weighted sum │
│ │
│ Recent relevant session may be more important than old relevant one │
│ │
└─────────────────────────────────────────────────────────────────────────┘
SIM: Search-based Interest Model (Alibaba, 2020)
For users with very long behavior histories (thousands of items), attention over the full sequence is too slow. SIM introduces a two-stage approach: first retrieve relevant behaviors, then apply attention.
Architecture:
- General Search Unit (GSU): Fast retrieval of top-K relevant behaviors
- Exact Search Unit (ESU): Precise attention over retrieved behaviors
GSU (soft search):
Select top-K behaviors with highest relevance scores.
GSU (hard search):
Use category/brand matching to retrieve candidates, e.g., "all behaviors in same category as target ad."
┌─────────────────────────────────────────────────────────────────────────┐
│ SIM: HANDLING LONG BEHAVIOR SEQUENCES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: User has 10,000 past behaviors │
│ ───────────────────────────────────── │
│ │
│ DIN/DIEN: Attention over 10,000 items → O(10,000) per inference │
│ → Too slow for real-time serving (<10ms budget) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SIM SOLUTION: Two-stage retrieval + attention │
│ ───────────────────────────────────────────── │
│ │
│ All behaviors (10,000) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ STAGE 1: General Search Unit (GSU) │ │
│ │ │ │
│ │ Option A: Soft search │ │
│ │ score_i = e_i^T W e_target │ │
│ │ Keep top-K (e.g., K=100) │ │
│ │ │ │
│ │ Option B: Hard search │ │
│ │ Filter by category/brand match │ │
│ │ (Very fast, O(1) with index) │ │
│ └────────────────┬────────────────────┘ │
│ │ │
│ Relevant behaviors (100) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ STAGE 2: Exact Search Unit (ESU) │ │
│ │ │ │
│ │ Full attention (like DIN/DIEN) │ │
│ │ over the K retrieved behaviors │ │
│ │ │ │
│ │ Time encoding: add position info │ │
│ │ for temporal awareness │ │
│ └────────────────┬────────────────────┘ │
│ │ │
│ ▼ │
│ User interest vector │
│ │
│ COMPLEXITY REDUCTION: │
│ ───────────────────── │
│ Original: O(N) attention = O(10,000) │
│ SIM: O(K) attention = O(100) → 100x speedup! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part V: Multi-Task Learning for Advertising
The Multi-Objective Problem
In advertising, we care about multiple outcomes:
- Click: Did user click the ad? (CTR)
- Conversion: Did user complete a purchase? (CVR)
- Engagement: Did user spend time on landing page?
- Long-term: Did user become a repeat customer?
These objectives are related but distinct. Multi-task learning (MTL) enables joint modeling.
┌─────────────────────────────────────────────────────────────────────────┐
│ WHY MULTI-TASK LEARNING? │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ OPTION 1: Separate models │
│ ───────────────────────── │
│ │
│ CTR Model: Features → DNN_1 → P(click) │
│ CVR Model: Features → DNN_2 → P(conversion) │
│ │
│ Problems: │
│ • No shared learning (features learned independently) │
│ • Inconsistent predictions (CTR and CVR may disagree) │
│ • Sample selection bias for CVR (only see conversions after clicks) │
│ • 2x serving cost │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OPTION 2: Multi-task model │
│ ────────────────────────── │
│ │
│ Features → Shared Layers → Task-Specific Heads → [P(click), P(conv)] │
│ │
│ Benefits: │
│ • Shared representations learn general patterns │
│ • Transfer learning between tasks │
│ • Single model for serving │
│ • Auxiliary tasks regularize main task │
│ │
│ Challenges: │
│ • Negative transfer (tasks may conflict) │
│ • Gradient balancing between tasks │
│ • Different task difficulties │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Shared-Bottom Architecture
The simplest MTL approach: share bottom layers, use task-specific towers on top.
Architecture:
Loss:
where balances task importance.
┌─────────────────────────────────────────────────────────────────────────┐
│ SHARED-BOTTOM ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Task 1 Task 2 Task 3 │
│ (CTR) (CVR) (LTV) │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Tower 1 │ │Tower 2 │ │Tower 3 │ │
│ │ (MLP) │ │ (MLP) │ │ (MLP) │ │
│ └────┬───┘ └────┬───┘ └────┬───┘ │
│ │ │ │ │
│ └──────────────┼─────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Shared │ │
│ │ Bottom │ │
│ │ (MLP) │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Embedding │ │
│ │ Layer │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Input │ │
│ │ Features │ │
│ └─────────────┘ │
│ │
│ LIMITATION: Assumes all tasks benefit from same representation │
│ → Negative transfer when tasks conflict │
│ │
└─────────────────────────────────────────────────────────────────────────┘
MMOE: Multi-gate Mixture-of-Experts (Google, 2018)
MMOE addresses negative transfer by learning task-specific combinations of shared experts.
Key idea: Instead of one shared bottom, use multiple expert networks and let each task decide which experts to use.
Architecture:
where:
- = expert 's output
- = task 's gating weights
- = task 's tower
┌─────────────────────────────────────────────────────────────────────────┐
│ MMOE: MIXTURE OF EXPERTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Task 1 Task 2 Task 3 │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Tower 1 │ │Tower 2 │ │Tower 3 │ │
│ └────┬───┘ └────┬───┘ └────┬───┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Gate 1 │ │Gate 2 │ │Gate 3 │ │
│ │[.3,.5,.2] │[.6,.2,.2] │[.1,.3,.6] │
│ └────┬───┘ └────┬───┘ └────┬───┘ │
│ │ │ │ │
│ │ │ │ │
│ └──────────────┼─────────────┘ │
│ weighted sum │
│ ┌──────────────┼─────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Expert 1│ │Expert 2│ │Expert 3│ │
│ │ (MLP) │ │ (MLP) │ │ (MLP) │ │
│ └────┬───┘ └────┬───┘ └────┬───┘ │
│ │ │ │ │
│ └──────────────┼─────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Input │ │
│ └─────────────┘ │
│ │
│ KEY INSIGHT: │
│ ─────────── │
│ • Task 1 uses mostly Expert 2 (weights [.3, .5, .2]) │
│ • Task 3 uses mostly Expert 3 (weights [.1, .3, .6]) │
│ • Tasks can specialize while still sharing some computation │
│ │
│ GATING MECHANISM: │
│ ───────────────── │
│ g_k(x) = softmax(W_k · x) │
│ │
│ Gate is input-dependent: different inputs may use different expert │
│ combinations even for the same task │
│ │
└─────────────────────────────────────────────────────────────────────────┘
PLE: Progressive Layered Extraction (Tencent, 2020)
PLE extends MMOE by separating task-specific experts from shared experts, and progressively refining representations across layers.
Key insight: Some knowledge should be task-specific from the start, not just at the tower level.
Architecture:
At each layer :
where:
- = task-specific experts for task
- = shared experts
- = gating network selecting from both
┌─────────────────────────────────────────────────────────────────────────┐
│ PLE: PROGRESSIVE LAYERED EXTRACTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Task A Tower Task B Tower │
│ │ │ │
│ ┌──────────┴───────────────────┴──────────┐ │
│ │ │ │
│ ▼ ▼ │
│ EXTRACTION LAYER 2: │
│ ────────────────────────────────────────────────────────────────── │
│ │
│ Task A Shared Task B │
│ Experts Experts Experts │
│ ┌───┐┌───┐ ┌───┐┌───┐ ┌───┐┌───┐ │
│ │EA1││EA2│ │ES1││ES2│ │EB1││EB2│ │
│ └─┬─┘└─┬─┘ └─┬─┘└─┬─┘ └─┬─┘└─┬─┘ │
│ │ │ │ │ │ │ │
│ └────┴────────┴────┴─────────┴────┘ │
│ │ │ │
│ Gate A selects Gate B selects │
│ from all 6 from all 6 │
│ │ │ │
│ ─────────┴────────────────┴───────────────────────────────────────── │
│ │
│ EXTRACTION LAYER 1: │
│ ────────────────────────────────────────────────────────────────── │
│ │
│ Task A Shared Task B │
│ Experts Experts Experts │
│ ┌───┐┌───┐ ┌───┐┌───┐ ┌───┐┌───┐ │
│ │EA1││EA2│ │ES1││ES2│ │EB1││EB2│ │
│ └─┬─┘└─┬─┘ └─┬─┘└─┬─┘ └─┬─┘└─┬─┘ │
│ │ │ │ │ │ │ │
│ └────┴────────┴────┴─────────┴────┘ │
│ │ │
│ ┌─────┴─────┐ │
│ │ Input │ │
│ └───────────┘ │
│ │
│ COMPARISON WITH MMOE: │
│ ───────────────────── │
│ │
│ MMOE: │
│ • All experts are shared │
│ • Task-specific learning only in towers │
│ │
│ PLE: │
│ • Explicit task-specific experts at each layer │
│ • Progressive refinement through multiple extraction layers │
│ • Better handles conflicting tasks │
│ │
└─────────────────────────────────────────────────────────────────────────┘
ESMM: Entire Space Multi-Task Model (Alibaba, 2018)
ESMM addresses the sample selection bias problem in CVR prediction.
The problem:
- CVR (conversion rate) = P(conversion | click)
- Training data: Only users who clicked
- Deployment: Predict for all impressions (including non-clickers)
This is sample selection bias: the training distribution differs from the deployment distribution.
ESMM's solution: Model the entire sample space using the decomposition:
or equivalently:
┌─────────────────────────────────────────────────────────────────────────┐
│ ESMM: SAMPLE SELECTION BIAS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE PROBLEM: │
│ ──────────── │
│ │
│ All Impressions (1M) │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ │ │
│ Clicks (30K) No Click (970K) │
│ │ │ │
│ ┌───────┴───────┐ × │
│ │ │ (no conversion │
│ Conversion (1K) No Conv (29K) data here!) │
│ │
│ Traditional CVR model: │
│ • Trained on 30K clicks only │
│ • Deployed on 1M impressions │
│ • Distribution mismatch! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ESMM SOLUTION: │
│ ────────────── │
│ │
│ Model CTCVR (click AND convert) over entire impression space: │
│ │
│ P(click ∩ conversion) = P(click) × P(conversion | click) │
│ CTCVR = CTR × CVR │
│ │
│ ARCHITECTURE: │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ CTR Tower │ │ CVR Tower │ │
│ │ pCTR │ │ pCVR │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ │ ┌──────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────┐ │
│ │ pCTCVR = pCTR × pCVR │ ← Multiplied, not concatenated │
│ └────────────────────────┘ │
│ │
│ TRAINING: │
│ ───────── │
│ • CTR: supervised on all impressions (click/no-click labels) │
│ • CTCVR: supervised on all impressions (conversion labels) │
│ • CVR: NO direct supervision—learned implicitly! │
│ │
│ Loss = L_CTR(pCTR, click_label) + L_CTCVR(pCTCVR, conversion_label) │
│ │
│ BENEFIT: │
│ ──────── │
│ • CVR implicitly trained on ALL samples (via CTCVR supervision) │
│ • No sample selection bias │
│ • CTR and CVR share embeddings (transfer learning) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Mathematical insight:
Since , the gradient flows through both towers:
The CVR tower receives gradients weighted by CTR, which naturally emphasizes training on samples likely to click—exactly what we want for CVR estimation.
Part VI: Calibration and Position Bias
Why Calibration Matters
A well-calibrated model means: when you predict 10% CTR, approximately 10% of those impressions should result in clicks.
Definition (calibration):
In words: among all predictions with value , the actual positive rate should be .
Why calibration matters in advertising:
-
Revenue optimization:
- Overestimated pCTR → overpay for impressions
- Underestimated pCTR → lose valuable impressions
-
Budget pacing: Advertisers set daily budgets assuming predicted CTRs are accurate
-
Auction dynamics: Second-price auctions assume truthful bidding based on accurate value estimates
┌─────────────────────────────────────────────────────────────────────────┐
│ CALIBRATION IN AD SYSTEMS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ WELL-CALIBRATED MODEL: │
│ ────────────────────── │
│ │
│ Predicted CTR │ Actual CTR │ Calibration │
│ ──────────────┼────────────┼─────────────── │
│ 0.01 │ 0.010 │ Perfect │
│ 0.05 │ 0.052 │ Close │
│ 0.10 │ 0.098 │ Close │
│ 0.20 │ 0.195 │ Close │
│ │
│ POORLY-CALIBRATED MODEL: │
│ ──────────────────────── │
│ │
│ Predicted CTR │ Actual CTR │ Problem │
│ ──────────────┼────────────┼─────────────── │
│ 0.01 │ 0.005 │ Overconfident │
│ 0.05 │ 0.030 │ Overconfident │
│ 0.10 │ 0.150 │ Underconfident │
│ 0.20 │ 0.250 │ Underconfident │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CALIBRATION PLOT (Reliability Diagram): │
│ │
│ Actual CTR │
│ │ ╱ │
│ 0.3 │ ╱ │
│ │ ╱ ● (well-calibrated) │
│ 0.2 │ ╱ ● │
│ │ ╱ ● │
│ 0.1 │ ╱● │
│ │ ╱● │
│ 0 │─────────╱───────────────────────── │
│ 0 0.1 0.2 0.3 Predicted CTR │
│ │
│ Perfect calibration: points lie on diagonal │
│ Above diagonal: underconfident (actual > predicted) │
│ Below diagonal: overconfident (actual < predicted) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Calibration Methods
Platt Scaling:
Learn a post-hoc logistic transformation:
where are learned on a validation set.
Isotonic Regression:
Non-parametric: learn a monotonic step function mapping raw scores to calibrated probabilities.
Temperature Scaling:
where softens predictions (less confident), sharpens them.
Position Bias
The problem: Ads in higher positions get more clicks regardless of relevance, simply because users see them first.
Observed CTR decomposition:
where:
- = probability user sees the ad (decreases with position)
- = true relevance (what we want to estimate)
┌─────────────────────────────────────────────────────────────────────────┐
│ POSITION BIAS IN AD CLICKS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ USER'S ATTENTION PATTERN: │
│ ───────────────────────── │
│ │
│ Position 1: ████████████████████ P(examine) = 1.0 │
│ Position 2: ██████████████████ P(examine) = 0.85 │
│ Position 3: ████████████████ P(examine) = 0.70 │
│ Position 4: ██████████████ P(examine) = 0.55 │
│ Position 5: ████████████ P(examine) = 0.40 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ THE BIAS PROBLEM: │
│ ───────────────── │
│ │
│ Scenario: Two ads with same true relevance (0.10 click prob if seen) │
│ │
│ Ad A in position 1: Observed CTR = 1.0 × 0.10 = 0.10 │
│ Ad B in position 5: Observed CTR = 0.4 × 0.10 = 0.04 │
│ │
│ Naive model: "Ad A is 2.5x better than Ad B" ← WRONG! │
│ Reality: They're equally good; position caused the difference │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY THIS MATTERS FOR TRAINING: │
│ ────────────────────────────── │
│ │
│ Training data: Historical clicks with position information │
│ │
│ If we ignore position bias: │
│ • Ads that historically appeared in top positions → overestimated CTR │
│ • Ads that historically appeared in bottom → underestimated CTR │
│ • Rich get richer (biased ads keep getting top positions) │
│ │
│ We need to estimate TRUE relevance, not position-confounded CTR │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Debiasing Approaches
1. Position as Feature:
Simply add position as an input feature during training, but set it to a reference position (e.g., position 1) during inference.
2. Propensity Weighting (Inverse Propensity Scoring):
Weight training samples inversely by their examination probability:
This upweights samples from low positions (which are harder to click).
3. Position Bias Models (PAL):
Model examination and relevance separately:
At inference, use only the relevance component.
Part VII: Real-Time Bidding (RTB)
The RTB Ecosystem
Real-Time Bidding is how most display/video ads are bought and sold. When a user loads a webpage:
- Publisher sends bid request to ad exchange (user info, context)
- Exchange broadcasts to multiple Demand-Side Platforms (DSPs)
- DSPs evaluate their advertisers' campaigns and submit bids
- Exchange runs auction, winner's ad is displayed
- Total time: <100ms
┌─────────────────────────────────────────────────────────────────────────┐
│ RTB: REAL-TIME BIDDING FLOW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ USER LOADS PAGE │
│ │ │
│ ▼ (1) Bid Request (~10ms) │
│ ┌─────────────┐ │
│ │ Publisher │ │
│ │ (SSP) │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ (2) Broadcast to DSPs │
│ ┌─────────────┐ │
│ │ Ad Exchange │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────┴────┬─────────┬─────────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │DSP 1│ │DSP 2│ │DSP 3│ │DSP 4│ (3) Each DSP decides bid (~50ms) │
│ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│ │ │ │ │ │
│ │ ┌─────┴────────┴────────┴─────┐ │
│ │ │ For each campaign: │ │
│ │ │ 1. Predict pCTR, pCVR │ │
│ │ │ 2. Calculate expected value │ │
│ │ │ 3. Apply bidding strategy │ │
│ │ │ 4. Check budget constraints │ │
│ │ └─────────────────────────────┘ │
│ │ │
│ ▼ (4) Submit bids │
│ ┌─────────────┐ │
│ │ Ad Exchange │ (5) Run auction (usually 2nd price) │
│ └──────┬──────┘ │
│ │ │
│ ▼ (6) Winner's ad served │
│ ┌─────────────┐ │
│ │ User sees │ │
│ │ ad │ │
│ └─────────────┘ │
│ │
│ TOTAL LATENCY BUDGET: ~100ms │
│ • Network round-trip: ~20ms │
│ • DSP processing: ~50ms │
│ • Exchange auction: ~10ms │
│ • Ad rendering: ~20ms │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Bid Optimization
The bidding problem: Given predicted CTR, conversion rate, and campaign constraints, what bid maximizes value?
Basic formulation (for CPA campaigns):
where target CPA is the advertiser's desired cost per acquisition.
Auction dynamics:
In a second-price auction, you pay the second-highest bid plus a small increment. The optimal strategy under truthful bidding is to bid your true value:
But real auctions have complications: budget constraints, pacing requirements, competition dynamics.
Budget Pacing
Problem: Advertiser has daily budget but wants to spread impressions throughout the day (not exhaust budget in the morning).
Pacing strategies:
-
Probabilistic throttling: Bid on only a fraction of requests
-
Bid shading: Reduce bids by a pacing multiplier
-
PID controller: Dynamically adjust based on spend rate vs. target rate
┌─────────────────────────────────────────────────────────────────────────┐
│ BUDGET PACING OVER A DAY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Spend ($) │
│ │ │
│ $100│───────────────────────────────────────────── Budget limit │
│ │ ╱ │
│ $80│ ╱ │
│ │ ╱ ← Well-paced spend │
│ $60│ ╱ │
│ │ ╱╱╱╱╱╱ │
│ $40│ ╱╱╱╱╱╱ ← Uniform pacing target │
│ │ ╱╱╱╱╱╱ │
│ $20│ ╱╱╱╱╱ │
│ │╱ │
│ $0├─────────────────────────────────────────────────────── │
│ 0 4 8 12 16 20 24 Hour │
│ │
│ WITHOUT PACING: │
│ ─────────────── │
│ │
│ Spend ($) │
│ │ │
│ $100│─────────█████████████████████████████████████ Budget exhausted │
│ │ █ by 2pm! │
│ $80│ █ │
│ │ █ │
│ $60│ █ │
│ │ █ │
│ $40│ █ │
│ │ █ │
│ $20│ █ │
│ │█ │
│ $0├─────────────────────────────────────────────────────── │
│ 0 4 8 12 16 20 24 Hour │
│ │
│ Problem: Miss all evening traffic (often high-value!) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
PID Controller for pacing:
where .
Game Theory of Ad Auctions
Ad auctions are strategic games where bidders compete for impressions. Understanding game-theoretic foundations is essential for optimal bidding.
Auction Formats in Digital Advertising:
┌─────────────────────────────────────────────────────────────────────────┐
│ AD AUCTION MECHANISMS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ FIRST-PRICE AUCTION: │
│ ──────────────────── │
│ Winner pays their bid │
│ │
│ Bids: [$1.00, $0.80, $0.60] │
│ Winner: Bidder 1 ($1.00) │
│ Payment: $1.00 │
│ │
│ Strategy: Shade bid below true value to increase profit margin │
│ Equilibrium: Complex, depends on beliefs about competitors │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SECOND-PRICE AUCTION (Vickrey): │
│ ─────────────────────────────── │
│ Winner pays second-highest bid │
│ │
│ Bids: [$1.00, $0.80, $0.60] │
│ Winner: Bidder 1 ($1.00) │
│ Payment: $0.80 (second price) │
│ │
│ Strategy: Bid true value (dominant strategy!) │
│ Equilibrium: Truthful bidding is optimal regardless of competitors │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ GENERALIZED SECOND-PRICE (GSP): │
│ ─────────────────────────────── │
│ Multiple slots, each winner pays next-highest bid │
│ │
│ Bids: [$1.00, $0.80, $0.60] for 2 slots │
│ Slot 1: Bidder 1, pays $0.80 │
│ Slot 2: Bidder 2, pays $0.60 │
│ │
│ Strategy: NOT truthful! Bid shading is rational │
│ Equilibrium: Multiple equilibria exist │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Vickrey-Clarke-Groves (VCG) Mechanism:
The VCG mechanism extends second-price auctions to multiple items with the property that truthful bidding is a dominant strategy.
Payment rule:
where:
- = optimal allocation with bidder
- = optimal allocation without bidder
- = bidder 's value under allocation
Intuition: Bidder pays the externality they impose on others—the reduction in others' total value caused by 's presence.
Why VCG matters: Under VCG, bidding your true value is always optimal, regardless of what others do. This simplifies bidder strategy and improves auction efficiency.
Nash Equilibrium in GSP Auctions:
GSP (used by Google Ads for years) does NOT have truthful bidding as equilibrium. Instead, bidders shade bids strategically.
Symmetric Nash Equilibrium bid:
For a bidder with value competing for position :
where = click-through rate for position (position 1 has highest CTR).
Key insight: Bid shading increases with position quality difference. If position 1 gets 10x more clicks than position 2, competition for position 1 is fierce, and bid shading is minimal.
┌─────────────────────────────────────────────────────────────────────────┐
│ EQUILIBRIUM ANALYSIS: GSP vs VCG │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Example: 3 bidders, 2 slots │
│ Values: v₁ = $10, v₂ = $8, v₃ = $4 │
│ CTRs: α₁ = 1.0 (slot 1), α₂ = 0.5 (slot 2) │
│ │
│ TRUTHFUL BIDDING (VCG): │
│ ─────────────────────── │
│ Bids = Values: [$10, $8, $4] │
│ Allocation: Bidder 1 → Slot 1, Bidder 2 → Slot 2 │
│ │
│ VCG Payments: │
│ p₁ = (v₂·α₁ + v₃·α₂) - (v₂·α₂ + v₃·0) = ($8·1 + $4·0.5) - ($8·0.5) │
│ = $10 - $4 = $6 │
│ p₂ = (v₃·α₂) - (v₃·0) = $2 - $0 = $2 │
│ │
│ GSP EQUILIBRIUM (with bid shading): │
│ ──────────────────────────────────── │
│ Equilibrium bids: b₁ = $8, b₂ = $4, b₃ = $4 │
│ Payments: p₁ = $4 (second price), p₂ = $4 (third price) │
│ │
│ COMPARISON: │
│ ─────────── │
│ │ Mechanism │ Revenue │ Efficiency │ Strategy Complexity │ │
│ ├───────────┼─────────┼────────────┼─────────────────────┤ │
│ │ VCG │ $8 │ Optimal │ Simple (truthful) │ │
│ │ GSP │ $8 │ Optimal │ Complex (shade) │ │
│ │ First-Prc │ Varies │ Optimal │ Complex (shade) │ │
│ │
│ Revenue Equivalence Theorem: Under certain conditions, all │
│ auction formats yield the same expected revenue! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Reserve Prices and Optimal Auction Design:
Platforms set reserve prices to extract more revenue from high-value bidders.
Myerson's optimal reserve (for uniform value distribution on ):
Revenue impact:
Setting sacrifices some auctions (no winner) but extracts more from winners.
Modern auction trends:
- First-price auctions: Google and others moved from GSP to first-price (2019-2021)
- Header bidding: Multiple exchanges compete simultaneously
- Unified auctions: Combine direct deals with programmatic
Reinforcement Learning for Bid Optimization
Bidding is fundamentally a sequential decision problem: each bid affects budget, win rate, and future opportunities. RL provides a principled framework.
MDP Formulation for Bidding:
State space :
Action space :
Or discretized:
Or bid multiplier:
Transition dynamics :
- If win: budget decreases by payment, conversion may occur
- If lose: budget unchanged, opportunity lost
- Time always advances
Reward function :
R(s_t, a_t) =
┌ conversion_value - payment if win and convert
│ -payment if win and no convert
└ 0 if lose
Or for CPA goals:
┌─────────────────────────────────────────────────────────────────────────┐
│ RL BIDDING: MDP FORMULATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STATE at time t: │
│ ──────────────── │
│ s_t = [ │
│ B_t, # Budget remaining ($) │
│ T - t, # Time remaining (hours) │
│ pCTR, # Predicted click probability │
│ pCVR, # Predicted conversion probability │
│ user_embed, # User features (dense) │
│ context, # Page, device, hour, etc. │
│ win_rate_t, # Recent win rate │
│ spend_rate_t # Current spend velocity │
│ ] │
│ │
│ ACTION: │
│ ─────── │
│ a_t = bid_multiplier ∈ {0.5, 0.75, 1.0, 1.25, 1.5, 2.0} │
│ actual_bid = a_t × base_bid │
│ base_bid = pCTR × pCVR × target_CPA │
│ │
│ TRANSITION: │
│ ─────────── │
│ │
│ ┌─────────┐ win (p=w(bid)) ┌─────────────────────┐ │
│ │ Bid │ ──────────────────► │ B_{t+1} = B_t - cost │ │
│ │ a_t │ │ reward = value - cost│ │
│ └────┬────┘ └─────────────────────┘ │
│ │ │
│ │ lose (p=1-w(bid)) │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ B_{t+1} = B_t │ │
│ │ reward = 0 │ │
│ │ (opportunity lost) │ │
│ └─────────────────────┘ │
│ │
│ OBJECTIVE: │
│ ────────── │
│ Maximize: E[Σ γ^t R_t] subject to Σ cost_t ≤ B (budget constraint) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Value Function and Q-Learning:
State-value function:
Action-value function:
Bellman optimality equation:
Deep Q-Network (DQN) for bidding:
Approximate with neural network :
where = target network parameters (updated periodically for stability).
Policy Gradient Methods:
For continuous bid spaces, policy gradient methods work better than Q-learning.
Policy parametrization:
Policy gradient theorem:
Actor-Critic for bidding:
┌─────────────────────────────────────────────────────────────────────────┐
│ ACTOR-CRITIC BIDDING AGENT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ State │ │
│ │ (features) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌─────────────┴─────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ ACTOR │ │ CRITIC │ │
│ │ π_θ(a|s) │ │ V_φ(s) │ │
│ │ │ │ │ │
│ │ Policy Net │ │ Value Net │ │
│ └───────┬────────┘ └───────┬────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌───────────┐ │
│ │ Bid │ │ Baseline │ │
│ │ Action │ │ V(s) │ │
│ └────┬────┘ └─────┬─────┘ │
│ │ │ │
│ ▼ │ │
│ ┌─────────┐ │ │
│ │ Auction │ │ │
│ │ Result │ │ │
│ └────┬────┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Advantage: A = R + γV(s') - V(s)│ │
│ └─────────────────────────────────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ Actor update: Critic update: │
│ θ ← θ + α∇log π(a|s)·A φ ← φ - β∇(V(s) - target)² │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Handling Budget Constraints (Constrained RL):
Budget constraints make this a Constrained MDP (CMDP):
Lagrangian relaxation:
Solve via dual gradient descent: update to maximize , update to minimize it.
Practical constraint handling:
- Penalty shaping: Add to reward
- Budget as state: Include remaining budget in state, learn budget-aware policy
- Post-hoc projection: Clip actions to respect constraints
Offline RL for Bidding:
Online RL exploration can be costly (bad bids lose money). Offline RL learns from historical auction logs.
Challenge: Distribution shift—learned policy may choose bids never seen in data.
Conservative Q-Learning (CQL):
The penalty term discourages high Q-values for out-of-distribution actions.
Multi-Agent Considerations:
All bidders are simultaneously learning and adapting, creating a multi-agent RL problem.
Challenges:
- Non-stationarity: Other bidders' policies change over time
- Partial observability: Can't see competitors' states or strategies
- Credit assignment: Win/loss depends on others' bids
Approaches:
- Opponent modeling: Estimate competitors' bidding strategies
- Robust RL: Optimize for worst-case competitor behavior
- Mean-field approximation: Model aggregate competition as a distribution
- Regret minimization: Guarantee no-regret against arbitrary competitors
Bid Landscape Forecasting
To optimize bids, we need to understand the competitive landscape: what bids are needed to win at different rates?
Win rate function:
This is typically modeled as:
- Log-normal:
- Empirical: Learn from historical auction data
Optimal bidding with win rate:
For second-price auctions, .
Part VIII: Production Considerations
Latency Requirements
Ad systems have extreme latency requirements:
| Component | Budget |
|---|---|
| Total end-to-end | <100ms |
| Feature retrieval | <10ms |
| Model inference | <10ms |
| Ranking logic | <5ms |
| Network overhead | ~50ms |
Techniques for low-latency inference:
- Model distillation: Train small "student" model to mimic large "teacher"
- Quantization: INT8 or even INT4 inference
- Pruning: Remove unimportant weights
- Caching: Precompute user/item embeddings
- Cascade ranking: Cheap model filters 10K→100 candidates, expensive model ranks 100
Feature Store Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ FEATURE STORE FOR AD SERVING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ OFFLINE PIPELINE (batch): │
│ ───────────────────────── │
│ │
│ Raw Data → Feature Engineering → Feature Store (offline) │
│ (Spark/Flink) (daily/hourly) (Hive, S3) │
│ │
│ Features: User historical stats, item aggregates, long-term behavior │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ONLINE PIPELINE (real-time): │
│ ──────────────────────────── │
│ │
│ Events → Stream Processing → Feature Store (online) │
│ (Kafka) (Flink) (Redis, DynamoDB) │
│ │
│ Features: Real-time counts, recent clicks, session features │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SERVING PATH: │
│ ───────────── │
│ │
│ Bid Request → Feature Retrieval → Model Inference → Bid Response │
│ │ │
│ ┌─────────┴─────────┐ │
│ ▼ ▼ │
│ Online Store Offline Store │
│ (Redis: <1ms) (preloaded cache) │
│ │
│ FEATURE FRESHNESS REQUIREMENTS: │
│ ──────────────────────────────── │
│ │
│ • User embedding: Updated daily (offline OK) │
│ • User recent clicks: Updated real-time (online required) │
│ • Ad historical CTR: Updated hourly (near-line) │
│ • Context features: Computed at request time │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Model Update and A/B Testing
Continuous training pipeline:
- Daily retraining: Incorporate yesterday's clicks/conversions
- Incremental updates: Online learning on streaming data
- Shadow deployment: New model runs alongside production, compare metrics
- Gradual rollout: 1% → 5% → 20% → 50% → 100% traffic
A/B testing considerations:
- Interference: Users in treatment may affect control (competition for same inventory)
- Delayed conversions: Need to wait days/weeks for full conversion data
- Novelty effects: New models may appear better initially due to exploration
- Metric selection: CTR? Revenue? Long-term user satisfaction?
Part IX: Advanced Topics
Delayed Feedback Modeling
Conversions often happen hours or days after clicks. How do we train when labels are incomplete?
Approaches:
- Attribution window: Only count conversions within X days of click
- Importance weighting: Weight older samples higher (more complete labels)
- Elapsed-time model:
Defuse model (Chapelle, 2014):
Model both the conversion probability and delay distribution.
Fraud Detection
Click fraud costs advertisers billions annually. Detection approaches:
- Anomaly detection: Unusual click patterns, timing, sources
- Behavioral modeling: Bots have different behavior than humans
- IP/device fingerprinting: Identify fraudulent traffic sources
- Conversion modeling: Fraudulent clicks rarely convert
Privacy-Preserving Advertising
With increasing privacy regulations (GDPR, CCPA) and deprecation of third-party cookies:
- Federated learning: Train models without centralizing user data
- Differential privacy: Add noise to prevent individual identification
- On-device prediction: Run models locally on user devices
- Cohort-based targeting: Target groups, not individuals (Google's Topics API)
Part X: Generative AI and LLM-Powered Advertising
The emergence of Large Language Models is transforming advertising beyond traditional CTR prediction. GenAI impacts every stage of the advertising pipeline: creative generation, audience understanding, personalization, and optimization.
The GenAI Advertising Stack
┌─────────────────────────────────────────────────────────────────────────┐
│ GENAI IN ADVERTISING PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL PIPELINE: │
│ ───────────────────── │
│ │
│ Advertiser → Fixed Creative → Targeting Rules → CTR Model → User │
│ (one ad copy) (demographics) (predict) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ GENAI-ENHANCED PIPELINE: │
│ ──────────────────────── │
│ │
│ Advertiser → LLM Creative → Semantic → Neural → User │
│ Generation Audience Ranking │
│ (1000s of Understanding + LLM │
│ variations) (intent, context) Personalization │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ GENAI TOUCHPOINTS: │
│ │
│ 1. CREATIVE GENERATION │
│ • Ad copy variations │
│ • Headline optimization │
│ • Image generation (DALL-E, Midjourney) │
│ • Video script generation │
│ │
│ 2. AUDIENCE UNDERSTANDING │
│ • Intent classification from search queries │
│ • Semantic user profiling │
│ • Contextual page understanding │
│ • Conversation-based preference elicitation │
│ │
│ 3. PERSONALIZATION │
│ • Dynamic ad copy adaptation │
│ • Real-time message tailoring │
│ • Conversational ad experiences │
│ │
│ 4. OPTIMIZATION │
│ • LLM-as-judge for ad quality │
│ • Automated A/B test analysis │
│ • Campaign strategy recommendations │
│ │
└─────────────────────────────────────────────────────────────────────────┘
LLM-Powered Creative Generation
Traditional advertising requires human copywriters to create ad variations. LLMs can generate thousands of variations automatically, enabling true personalization at scale.
The Creative Generation Problem:
Given:
- Product/service description
- Brand guidelines and tone
- Target audience segment
- Platform constraints (character limits, format)
Generate: Optimized ad copy that maximizes engagement
Multi-Armed Bandit for Creative Selection:
With LLM-generated variations (potentially thousands), we need efficient exploration to find winners without wasting budget on poor performers. The Upper Confidence Bound (UCB) algorithm balances exploitation (showing best-performing creatives) with exploration (testing uncertain ones):
where:
- = estimated CTR for creative (exploitation term)
- = times creative has been shown
- = exploration constant (typically 1.0-2.0)
- = exploration bonus (decreases as we test creative more)
Intuition: The exploration bonus is large when is small (we haven't tested this creative much, so we're uncertain). As we show the creative more, the bonus shrinks and the algorithm relies more on observed performance. This prevents premature convergence to locally optimal creatives while still exploiting known winners.
Practical considerations:
- Cold start: New LLM-generated creatives start with high exploration bonus
- Batch updates: In practice, update periodically (hourly/daily) rather than per-impression
- Contextual bandits: Extend to where is user context—different creatives may perform better for different users
┌─────────────────────────────────────────────────────────────────────────┐
│ LLM CREATIVE GENERATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT: │
│ ────── │
│ Product: "Running shoes with carbon fiber plate" │
│ Brand: Nike │
│ Audience: Competitive runners, 25-40 │
│ Platform: Google Search (30 char headline, 90 char description) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LLM GENERATION (with constraints): │
│ ────────────────────────────────── │
│ │
│ Variation 1 (Performance focus): │
│ Headline: "Shave Minutes Off Your PR" │
│ Description: "Carbon-plated running shoes engineered for speed. │
│ Free shipping on orders over $100." │
│ │
│ Variation 2 (Technology focus): │
│ Headline: "Carbon Fiber Technology" │
│ Description: "Experience the same tech as Olympic marathoners. │
│ Shop the new collection today." │
│ │
│ Variation 3 (Social proof): │
│ Headline: "Worn by World Champions" │
│ Description: "Join 100,000+ runners who improved their times. │
│ Rated 4.9 stars by elite athletes." │
│ │
│ Variation 4 (Urgency): │
│ Headline: "Limited Edition Colors" │
│ Description: "Race-day ready carbon shoes. Only 500 pairs left. │
│ Order now for guaranteed delivery." │
│ │
│ ... (100s more variations) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SELECTION: │
│ ────────── │
│ │
│ 1. LLM-as-Judge filters low-quality/off-brand variations │
│ 2. Multi-armed bandit explores promising variations │
│ 3. CTR model provides exploitation signal │
│ 4. Best performers get more budget │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Quality Control with LLM-as-Judge:
Not all generated creatives are good. Use a judge model to filter before deployment:
Multi-dimensional scoring (each 1-5 scale):
| Dimension | What It Measures | Failure Examples |
|---|---|---|
| Brand alignment | Tone, voice, values match brand | Luxury brand with casual slang |
| Policy compliance | No prohibited claims | Unsubstantiated health claims |
| Clarity | Message is understandable | Confusing or ambiguous copy |
| Persuasiveness | Compelling call-to-action | Weak or missing CTA |
| Grammar | Correct language | Typos, awkward phrasing |
| Factual accuracy | Claims are true | Wrong prices, features |
Composite score:
The ensures policy/factual violations are hard blockers regardless of other qualities.
Implementation options:
- Prompt-based: Describe scoring criteria in prompt, ask LLM to rate
- Fine-tuned judge: Train classifier on human-labeled creative quality data
- Ensemble: Multiple judges vote, require consensus for approval
- Hybrid: LLM pre-filter + human review for borderline cases
Threshold tuning:
- High threshold (4.5+): Fewer creatives pass, higher average quality, less variety
- Low threshold (3.5+): More creatives pass, more variety, some quality risk
- Adaptive threshold: Start high for new campaigns, lower as you build trust
Semantic Audience Understanding
Traditional targeting uses demographic segments (age, gender, location). LLMs enable semantic targeting based on intent and context.
Intent Understanding from Search Queries:
Beyond simple classification, LLMs extract nuanced intent:
┌─────────────────────────────────────────────────────────────────────────┐
│ SEMANTIC INTENT EXTRACTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Query: "best running shoes for marathon under $200" │
│ │
│ TRADITIONAL KEYWORD MATCHING: │
│ ───────────────────────────── │
│ Keywords: [running, shoes, marathon, $200] │
│ Match ads containing these keywords │
│ │
│ LLM SEMANTIC UNDERSTANDING: │
│ ─────────────────────────── │
│ { │
│ "intent": "transactional", │
│ "product_category": "performance_running_shoes", │
│ "use_case": "marathon_racing", │
│ "experience_level": "intermediate_to_advanced", │
│ "price_sensitivity": "high", │
│ "price_ceiling": 200, │
│ "decision_stage": "comparison_shopping", │
│ "implicit_needs": [ │
│ "durability_for_long_distance", │
│ "energy_return", │
│ "lightweight" │
│ ], │
│ "likely_follow_up_interests": [ │
│ "running_socks", │
│ "hydration_gear", │
│ "marathon_training_plans" │
│ ] │
│ } │
│ │
│ TARGETING IMPLICATIONS: │
│ ─────────────────────── │
│ • Show mid-tier shoes (not budget, not premium) │
│ • Emphasize marathon-specific features │
│ • Highlight value proposition (quality at price point) │
│ • Cross-sell complementary gear │
│ │
└─────────────────────────────────────────────────────────────────────────┘
User Embedding from Behavior + LLM:
Combine traditional collaborative filtering embeddings with LLM-derived semantic embeddings:
where:
- = embedding from click/purchase history (DIN/DIEN style)
- = embedding from LLM understanding of user's content consumption
- = blending weight (typically 0.3-0.7, tuned via validation)
When to use different values:
| Scenario | Recommended | Rationale |
|---|---|---|
| User has rich click history | 0.7-0.8 | Trust behavioral signals |
| New/cold-start user | 0.2-0.3 | Rely on semantic understanding |
| High-intent queries | 0.4-0.5 | Balance both signals |
| Content-heavy domains (news, articles) | 0.3-0.4 | LLM understands content better |
Implementation approaches:
- Late fusion: Compute and separately, combine at serving time
- Early fusion: Concatenate behavioral and semantic features, let model learn combination
- Learned fusion: Train a small network to predict optimal per user
Contextual Page Understanding:
Instead of simple page categorization (sports, news, entertainment), LLMs understand page content semantically:
How it works:
- Extract page text (title, headings, body, metadata)
- Pass through LLM encoder (e.g., fine-tuned BERT, or frozen GPT embeddings)
- Resulting embedding captures semantic meaning, not just category
This enables:
- Brand safety: Understand if content discusses sensitive topics (violence, controversy) even without keyword matches
- Contextual relevance: Match "marathon training tips" article to running shoe ads even if "shoes" isn't mentioned
- Sentiment alignment: Place upbeat ads on positive content, avoid juxtaposition issues
- Topic nuance: Distinguish "Apple (company)" from "apple (fruit)" for targeting
Dynamic Personalization at Serving Time
The most transformative application: personalize ad creative in real-time based on user context.
Personalization Hierarchy:
┌─────────────────────────────────────────────────────────────────────────┐
│ PERSONALIZATION DEPTH LEVELS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LEVEL 0: No Personalization │
│ ────────────────────────────── │
│ Same ad shown to everyone │
│ "Buy Nike Running Shoes Today" │
│ │
│ LEVEL 1: Segment-Based (Traditional) │
│ ───────────────────────────────────── │
│ Different ads per demographic segment │
│ Male 25-34: "Dominate Your Next Race" │
│ Female 25-34: "Run Your Personal Best" │
│ │
│ LEVEL 2: Behavioral (DIN/DIEN era) │
│ ────────────────────────────────── │
│ Ad selection based on user history │
│ User viewed marathons → Show marathon shoe ads │
│ User viewed trails → Show trail shoe ads │
│ │
│ LEVEL 3: LLM Dynamic Personalization │
│ ───────────────────────────────────── │
│ Ad CONTENT adapted in real-time │
│ │
│ User A (searched "Boston Marathon qualifying times"): │
│ "Qualify for Boston: Shoes Trusted by 50,000+ BQ Runners" │
│ │
│ User B (searched "couch to 5k beginner"): │
│ "Start Your Running Journey: Comfort-First Design for New Runners" │
│ │
│ User C (browsing running injury articles): │
│ "Run Pain-Free: Engineered Support for Injury Prevention" │
│ │
│ Same product, completely different messaging! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Real-Time Personalization Architecture:
But LLM inference is too slow for real-time bidding (<50ms). Solutions:
- Pre-computation: Generate top-K personalized variants offline, select at serving time
- Template + Slot Filling: LLM generates templates, fast model fills slots
- Cached Personas: Pre-compute ads for user personas, map users to personas
- Speculative Generation: Generate during page load, before ad request
┌─────────────────────────────────────────────────────────────────────────┐
│ REAL-TIME PERSONALIZATION ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ OFFLINE (Batch): │
│ ──────────────── │
│ │
│ For each (product, persona) pair: │
│ LLM generates N ad variations │
│ Store in Creative Cache │
│ │
│ Personas: {beginner_runner, competitive_runner, injury_recovery, │
│ casual_fitness, marathon_focused, trail_enthusiast, ...} │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ONLINE (Real-time): │
│ ─────────────────── │
│ │
│ 1. User request arrives │
│ │ │
│ ▼ │
│ 2. ┌─────────────────┐ │
│ │ Persona Classifier│ (Fast: embedding similarity, ~1ms) │
│ │ user → persona │ │
│ └────────┬──────────┘ │
│ │ │
│ ▼ │
│ 3. ┌─────────────────┐ │
│ │ Creative Cache │ (Lookup: product × persona, ~1ms) │
│ │ Lookup │ │
│ └────────┬──────────┘ │
│ │ │
│ ▼ │
│ 4. ┌─────────────────┐ │
│ │ CTR Model │ (Score personalized creative, ~5ms) │
│ │ Prediction │ │
│ └────────┬──────────┘ │
│ │ │
│ ▼ │
│ 5. Return personalized ad │
│ │
│ Total latency: <10ms (no LLM inference in critical path!) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
LLM-Enhanced CTR Prediction
Beyond creative generation, LLMs can directly improve CTR prediction models.
Feature Augmentation with LLM Embeddings:
Traditional CTR models use sparse ID features. Add dense semantic features from LLMs:
where are embeddings from an LLM encoder.
Benefits:
- Cold-start handling: New ads have semantic embeddings even without click history
- Generalization: Similar products share similar embeddings
- Cross-domain transfer: Knowledge transfers across ad categories
Semantic Similarity for Candidate Retrieval:
Use LLM embeddings for initial candidate retrieval:
Then apply traditional CTR models for final ranking.
┌─────────────────────────────────────────────────────────────────────────┐
│ LLM-ENHANCED TWO-TOWER RETRIEVAL │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ QUERY TOWER: AD TOWER: │
│ ───────────── ───────── │
│ │
│ User Query Ad Content │
│ + User History + Ad Metadata │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ LLM Encoder │ │ LLM Encoder │ │
│ │ (shared) │ │ (shared) │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Projection │ │ Projection │ │
│ │ Layer │ │ Layer │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ e_query e_ad │
│ │ │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ score = <e_query, e_ad> │
│ │ │
│ ▼ │
│ Top-K candidates → CTR ranking │
│ │
│ LLM BENEFITS: │
│ ───────────── │
│ • "marathon training" query matches "26.2 mile race shoes" ad │
│ • "gift for runner dad" matches "men's premium running shoes" │
│ • Semantic understanding beyond keyword matching │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Conversational Advertising
LLMs enable a new paradigm: conversational ad experiences where users interact with ads through dialogue.
Use Cases:
- Product Discovery: "I need shoes for my first marathon. What do you recommend?"
- Objection Handling: "Why are these so expensive?" → Explain value proposition
- Personalized Recommendations: Multi-turn dialogue to understand needs
- Purchase Assistance: Guide through size selection, shipping options
Conversational Ad Architecture:
┌─────────────────────────────────────────────────────────────────────────┐
│ CONVERSATIONAL AD EXPERIENCE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Ad: │
│ ─────────────── │
│ ┌─────────────────────────────────────────┐ │
│ │ Nike ZoomX Vaporfly - $250 │ │
│ │ The fastest marathon shoe ever. │ │
│ │ [Shop Now] │ │
│ └─────────────────────────────────────────┘ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Conversational Ad: │
│ ────────────────── │
│ ┌─────────────────────────────────────────┐ │
│ │ Nike Running Assistant │ │
│ │ │ │
│ │ User: "Is this shoe good for a │ │
│ │ beginner marathon runner?" │ │
│ │ │ │
│ │ Nike: "Great question! The Vaporfly │ │
│ │ is our elite racing shoe, designed │ │
│ │ for experienced runners chasing PRs.│ │
│ │ For your first marathon, I'd │ │
│ │ recommend the Pegasus 41 - it's │ │
│ │ more cushioned for training miles │ │
│ │ and race day comfort. Would you │ │
│ │ like to see it?" │ │
│ │ │ │
│ │ [See Pegasus 41] [Tell me more] │ │
│ │ [Compare both] │ │
│ └─────────────────────────────────────────┘ │
│ │
│ BENEFITS: │
│ ───────── │
│ • Higher engagement (dialogue > static ad) │
│ • Better matching (understand actual needs) │
│ • Trust building (honest recommendations) │
│ • Data collection (explicit preference signals) │
│ │
│ CHALLENGES: │
│ ─────────── │
│ • Latency (LLM inference per turn) │
│ • Brand safety (LLM may say wrong things) │
│ • Cost (compute per conversation) │
│ • Measurement (how to attribute conversions) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Measurement and Attribution for Conversational Ads:
Conversational ads create new measurement challenges—how do you attribute value across a multi-turn dialogue?
where:
- = reward signal at turn (click, add-to-cart, purchase)
- = discount factor (earlier turns get less credit)
- = total conversation turns
Attribution approaches:
| Model | Description | When to Use |
|---|---|---|
| Last-touch | Credit final turn before conversion | Simple, but ignores discovery value |
| First-touch | Credit conversation initiation | Values engagement, ignores persuasion |
| Linear | Equal credit to all turns | Fair, but doesn't capture turn importance |
| Position-based | 40% first, 40% last, 20% middle | Balances discovery and conversion |
| Data-driven | ML model learns credit assignment | Best accuracy, requires volume |
Key metrics for conversational ads:
- Engagement rate: % of users who respond to first message
- Conversation depth: Average turns per conversation
- Resolution rate: % of conversations ending in desired action
- Deflection rate: % of users who abandon mid-conversation
- Cost per conversation: Total LLM compute / conversations
- Incremental lift: Conversion rate vs. static ad control group
LLM-Based Campaign Optimization
Beyond individual ads, LLMs can optimize entire campaigns.
Automated A/B Test Analysis:
LLMs can:
- Identify statistically significant results (accounting for multiple comparisons)
- Explain WHY certain variants won (not just that they won)
- Suggest follow-up tests based on observed patterns
- Detect Simpson's paradox and other statistical pitfalls
- Identify segment-level winners that differ from overall winners
What makes LLM analysis different from traditional dashboards:
Traditional: "Variant B has 5% higher CTR with p<0.05"
LLM analysis: "Variant B outperformed because its 'limited time' messaging created urgency. However, this effect was concentrated in mobile users during evening hours—desktop users showed no significant difference. Consider: (1) testing urgency messaging specifically for mobile evening campaigns, (2) investigating why desktop users didn't respond (perhaps they need more product details before urgency appeals work)."
Budget Allocation Recommendations:
LLMs analyze cross-channel performance and recommend budget shifts. Key capabilities:
- Diminishing returns detection: "Search is hitting saturation—incremental CPA increasing. Consider shifting 20% to display prospecting."
- Opportunity identification: "Competitor X reduced spend on 'running shoes' keywords—bid landscape favorable for expansion."
- Goal alignment: "Current allocation optimizes for clicks, but your stated goal is conversions. Recommend shifting budget from awareness to consideration campaigns."
- Seasonality anticipation: "Marathon season approaching—recommend 30% budget increase for running category starting week 8."
Audience Expansion with LLM Reasoning:
Traditional lookalike audiences use statistical similarity. LLMs add semantic reasoning about WHY an audience works, enabling more thoughtful expansion.
Given a high-performing audience segment, LLM reasons about similar segments:
High-performing segment: "Marathon runners who clicked on nutrition ads"
LLM reasoning:
"This audience responds well because they're health-conscious athletes
focused on performance optimization. Similar audiences might include:
1. Triathletes (similar endurance focus)
2. CrossFit enthusiasts (performance-oriented)
3. Cycling enthusiasts (endurance athletes)
4. Health app power users (quantified-self mindset)
Recommendation: Test expansion to triathlon audiences first,
as they have the closest intent profile."
Personalization Ethics and Guardrails
LLM-powered personalization raises important ethical considerations.
Risks:
- Manipulation: Hyper-personalized messaging could exploit psychological vulnerabilities
- Filter bubbles: Users only see ads reinforcing existing preferences
- Privacy: Deep personalization requires extensive data collection
- Deception: AI-generated content may mislead users about what's human vs. machine
Guardrails:
┌─────────────────────────────────────────────────────────────────────────┐
│ PERSONALIZATION GUARDRAILS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CONTENT FILTERS: │
│ ──────────────── │
│ • No health claims without substantiation │
│ • No urgency manipulation ("Only 1 left!" when false) │
│ • No exploitation of negative emotions │
│ • No discrimination based on protected characteristics │
│ │
│ TRANSPARENCY REQUIREMENTS: │
│ ────────────────────────── │
│ • Disclose AI-generated content │
│ • Explain why ad was shown (ad preferences) │
│ • Allow users to opt out of personalization │
│ │
│ TECHNICAL CONTROLS: │
│ ─────────────────── │
│ • LLM output classifiers for harmful content │
│ • Human review for new personalization strategies │
│ • A/B test ethics review board │
│ • Audit trails for personalization decisions │
│ │
│ REGULATORY COMPLIANCE: │
│ ────────────────────── │
│ • GDPR: Data minimization, right to explanation │
│ • CCPA: Opt-out rights, disclosure requirements │
│ • FTC: Truth in advertising, endorsement disclosure │
│ • Industry self-regulation (NAI, DAA) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Future: Agentic Advertising
The next frontier: AI agents that autonomously manage advertising campaigns.
Agentic Capabilities:
- Autonomous Budget Management: Agent monitors performance and reallocates budget without human intervention
- Creative Evolution: Agent generates, tests, and iterates on ad creative continuously
- Competitive Response: Agent detects competitor actions and adjusts strategy
- Cross-Channel Orchestration: Agent coordinates messaging across search, social, display, email
Architecture:
where State includes: current performance, budget status, market conditions, competitive landscape.
┌─────────────────────────────────────────────────────────────────────────┐
│ AGENTIC ADVERTISING SYSTEM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────┐ │
│ │ Advertising Agent │ │
│ │ (LLM + Tools) │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Creative │ │ Budget │ │ Audience │ │
│ │ Generator │ │ Optimizer │ │ Expander │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Ad │ │ Bid │ │ Targeting │ │
│ │ Platform │ │ Management │ │ Rules │ │
│ │ APIs │ │ APIs │ │ APIs │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ AGENT WORKFLOW: │
│ ─────────────── │
│ │
│ 1. Monitor: Continuously track campaign KPIs │
│ 2. Analyze: Identify underperforming segments/creatives │
│ 3. Plan: Decide on optimization actions │
│ 4. Execute: Make changes via platform APIs │
│ 5. Learn: Update strategy based on results │
│ │
│ HUMAN OVERSIGHT: │
│ ──────────────── │
│ • Budget limits (agent can't exceed authorized spend) │
│ • Approval gates (major strategy changes need human OK) │
│ • Alert thresholds (unusual patterns trigger human review) │
│ • Audit logs (all agent actions recorded) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Summary: The Modern Ad ML Stack
┌─────────────────────────────────────────────────────────────────────────┐
│ MODERN AD ML ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ SERVING LAYER │ │
│ │ • Low-latency inference (<10ms) │ │
│ │ • Model cascade (filter → rank) │ │
│ │ • Feature store integration │ │
│ │ • A/B testing framework │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────┼─────────────────────────────────────┐ │
│ │ MODEL LAYER │ │
│ │ │ │
│ │ CTR Model: CVR Model: Bid Model: │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ DeepFM/DCN │ │ ESMM-style │ │ Bid Shading │ │ │
│ │ │ + DIN/DIEN │ │ Multi-task │ │ + Pacing │ │ │
│ │ │ behavior │ │ │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ Multi-task Framework: PLE / MMOE │ │
│ │ Calibration: Platt scaling / Isotonic regression │ │
│ │ Position bias: PAL / Propensity weighting │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────┼─────────────────────────────────────┐ │
│ │ FEATURE LAYER │ │
│ │ │ │
│ │ Online Store (Redis): Offline Store (Hive): │ │
│ │ • Real-time counts • User embeddings │ │
│ │ • Session features • Historical aggregates │ │
│ │ • Recent behaviors • Item statistics │ │
│ │ │ │
│ │ Feature Engineering: Categorical encoding, crosses, embeddings │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────┼─────────────────────────────────────┐ │
│ │ DATA LAYER │ │
│ │ │ │
│ │ • Click/impression logs │ │
│ │ • Conversion tracking (with delayed attribution) │ │
│ │ • User behavior sequences │ │
│ │ • Fraud detection signals │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Sources
CTR Prediction Models:
- Factorization Machines (Rendle, 2010)
- Field-aware Factorization Machines (Juan et al., 2016)
- Wide & Deep Learning (Google, 2016)
- DeepFM Paper (Huawei, 2017)
- Deep & Cross Network (Google, 2017)
- DCN-v2 (Google, 2020)
- xDeepFM Paper (Microsoft, 2018)
- AutoInt Paper (2019)
- FiBiNET Paper (Sina Weibo, 2019)
User Behavior Modeling:
- Deep Interest Network (Alibaba, 2018)
- DIEN Paper (Alibaba, 2019)
- DSIN Paper (Alibaba, 2019)
- SIM Paper (Alibaba, 2020)
Multi-Task Learning:
Industry Systems:
GenAI for Advertising:
Frequently Asked Questions
Related Articles
Recommendation Systems: From Collaborative Filtering to Deep Learning
In-depth journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.
Transformers for Recommendation Systems: From SASRec to HSTU
In-depth tour of transformer-based recommendation systems. From the fundamentals of sequential recommendation to Meta's trillion-parameter HSTU, understand how attention mechanisms revolutionized personalization.
Generative AI for Recommendation Systems: LLMs Meet Personalization
Practical guide to LLM-powered recommendation systems. From feature augmentation to conversational agents, understand how generative AI is transforming personalization.
Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications
Clear walkthrough of embedding models for RAG, search, and AI applications. Comparison of text-embedding-3, BGE, E5, Cohere Embed v4, and Voyage with guidance on fine-tuning, dimensionality, multimodal embeddings, and production optimization.
Hardware Optimization for LLMs: CUDA Kernels, TPU vs GPU, and Accelerator Architecture
Field guide to hardware optimization for large language models covering GPU architecture, CUDA kernel optimization, TPU comparisons, memory hierarchies, and practical strategies for maximizing throughput on modern AI accelerators.
Building Agentic AI Systems: A Complete Implementation Guide
Hands-on guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.
LLM Personalization: Building AI That Adapts to Individual Users
Clear walkthrough of personalizing Large Language Models. From memory architectures to preference learning, understand how to build AI systems that truly adapt to individual users—and the challenges that remain.
Advanced Prompt Engineering: From Basic to Production-Grade
Master the techniques that separate amateur prompts from production systems—chain-of-thought, structured outputs, model-specific optimization, and prompt architecture.