Skip to main content
Back to Blog

Machine Learning for Advertising: CTR Prediction, Ad Ranking, and Bidding Systems

Comprehensive guide to ML systems powering digital advertising. From logistic regression to deep CTR models, user behavior sequences to multi-task learning, and real-time bidding optimization—understand the algorithms behind the $600B+ ad industry.

16 min read
Share:

The Scale of Advertising ML

Digital advertising is a $600+ billion industry, and machine learning is its backbone. Every time you see an ad online, dozens of ML models have executed in milliseconds: predicting whether you'll click, estimating conversion probability, optimizing bids, and ranking thousands of candidate ads.

This isn't just recommendation systems with a different name. Advertising ML has unique challenges:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│           ADVERTISING ML vs GENERAL RECOMMENDATION                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  GENERAL RECSYS:                                                         │
│  ───────────────                                                         │
│  Goal: Maximize user engagement/satisfaction                            │
│  Items: Products, videos, songs (relatively stable catalog)             │
│  Feedback: Implicit (views, clicks) or explicit (ratings)               │
│  Latency: 100-500ms acceptable                                          │
│  Stakes: Poor recommendations → user leaves                             │
│                                                                          │
│  ADVERTISING ML:                                                         │
│  ───────────────                                                         │
│  Goal: Maximize revenue while maintaining user experience               │
│  Items: Ads (constantly changing, millions of advertisers)              │
│  Feedback: Sparse (CTR ~1-3%), delayed conversions                      │
│  Latency: <10-50ms required (real-time bidding)                         │
│  Stakes: Wrong predictions → lose money (pay per impression/click)      │
│                                                                          │
│  UNIQUE CHALLENGES:                                                      │
│  ─────────────────                                                       │
│  • Feature sparsity: Billions of feature combinations                   │
│  • Class imbalance: 97-99% negative examples                            │
│  • Multi-stakeholder: Users, advertisers, platform                      │
│  • Position bias: Top positions get more clicks regardless of relevance │
│  • Delayed feedback: Conversions may happen days later                  │
│  • Adversarial dynamics: Click fraud, bid manipulation                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

This post covers the complete advertising ML stack, from foundational CTR prediction to advanced user modeling and real-time bidding optimization.


Part I: Foundations of CTR Prediction

The CTR Prediction Problem

Click-Through Rate (CTR) prediction is the cornerstone of advertising ML. Given a user uu, an ad aa, and a context cc (time, device, page), predict the probability that the user will click:

P(click=1u,a,c)P(\text{click} = 1 \mid u, a, c)

This probability directly determines:

  • Ad ranking: Higher predicted CTR → higher position
  • Bid optimization: Expected value = P(click)×bidP(\text{click}) \times \text{bid}
  • Revenue: Platform typically charges per click (CPC) or per impression (CPM)
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CTR PREDICTION IN THE AD STACK                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  User visits page                                                        │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────┐    Candidate ads     ┌─────────────────────┐           │
│  │   Ad       │ ─────────────────────►│   CTR Prediction    │           │
│  │  Request   │    (1000s of ads)     │      Model          │           │
│  └─────────────┘                      └──────────┬──────────┘           │
│                                                   │                      │
│                                         P(click) for each ad             │
│                                                   │                      │
│                                                   ▼                      │
│                                       ┌─────────────────────┐           │
│                                       │   Ranking Function   │           │
│                                       │                      │           │
│                                       │  Score = f(pCTR,     │           │
│                                       │           bid,       │           │
│                                       │           quality)   │           │
│                                       └──────────┬──────────┘           │
│                                                   │                      │
│                                           Top K ads                      │
│                                                   │                      │
│                                                   ▼                      │
│                                       ┌─────────────────────┐           │
│                                       │    Ad Displayed     │           │
│                                       └─────────────────────┘           │
│                                                                          │
│  The entire pipeline must complete in <50ms                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Feature Engineering: The Foundation

Before diving into models, understand that advertising ML is fundamentally about feature interactions. A user who is "male, age 25-34, interested in sports" seeing an ad for "Nike running shoes" on a "sports news website" at "7pm on weekday" has a very different click probability than any individual feature would suggest.

The features in CTR prediction are typically:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE CATEGORIES IN CTR PREDICTION                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USER FEATURES:                                                          │
│  ──────────────                                                          │
│  • Demographics: age, gender, location, language                        │
│  • Behavioral: past clicks, purchases, browsing history                 │
│  • Contextual: device, OS, browser, time of day                         │
│  • Aggregated: click rate on category, avg session duration             │
│                                                                          │
│  AD FEATURES:                                                            │
│  ────────────                                                            │
│  • Creative: ad ID, advertiser ID, campaign ID                          │
│  • Content: category, keywords, landing page domain                     │
│  • Historical: ad CTR, conversion rate, quality score                   │
│  • Bid: bid amount, budget remaining, campaign age                      │
│                                                                          │
│  CONTEXT FEATURES:                                                       │
│  ─────────────────                                                       │
│  • Publisher: site ID, page category, content keywords                  │
│  • Position: ad slot, above/below fold                                  │
│  • Temporal: hour, day of week, season, holidays                        │
│  • Request: referrer, search query (if search ads)                      │
│                                                                          │
│  CROSS FEATURES (manually engineered):                                   │
│  ─────────────────────────────────────                                   │
│  • user_gender × ad_category                                            │
│  • user_age × hour_of_day                                               │
│  • device_type × ad_format                                              │
│  • user_interest × ad_keyword                                           │
│                                                                          │
│  SCALE: Typically 10^6 to 10^9 sparse features after one-hot encoding  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The key insight: most features are categorical and high-cardinality. User ID alone might have billions of values. When one-hot encoded, the feature vector becomes extremely sparse but extremely high-dimensional.


Part II: Evolution of CTR Models

Stage 1: Logistic Regression (The Baseline)

The journey begins with logistic regression, still used in production at many companies for its simplicity and interpretability.

Model formulation:

y^=σ(w0+i=1nwixi)\hat{y} = \sigma\left(w_0 + \sum_{i=1}^{n} w_i x_i\right)

where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function, xix_i are features, and wiw_i are learned weights.

Loss function (binary cross-entropy):

L=1Nj=1N[yjlog(y^j)+(1yj)log(1y^j)]\mathcal{L} = -\frac{1}{N}\sum_{j=1}^{N}\left[y_j \log(\hat{y}_j) + (1-y_j)\log(1-\hat{y}_j)\right]

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LOGISTIC REGRESSION FOR CTR                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Input: Sparse feature vector x ∈ {0,1}^n (one-hot encoded)             │
│                                                                          │
│  Example (simplified):                                                   │
│  ─────────────────────                                                   │
│  user_gender=male:     [1, 0]           (2 dims)                        │
│  user_age=25-34:       [0, 0, 1, 0, 0]  (5 dims)                        │
│  ad_category=sports:   [0, 0, 0, 1, 0]  (5 dims)                        │
│  device=mobile:        [0, 1]           (2 dims)                        │
│                                                                          │
│  Concatenated: x = [1,0,0,0,1,0,0,0,0,0,1,0,0,1]                        │
│                                                                          │
│  Prediction:                                                             │
│  ───────────                                                             │
│  z = w₀ + w₁·1 + w₃·1 + w₁₁·1 + w₁₄·1                                  │
│    = bias + w_male + w_age25-34 + w_sports + w_mobile                   │
│                                                                          │
│  ŷ = σ(z) = P(click)                                                    │
│                                                                          │
│  LIMITATION: Only captures first-order effects                          │
│  ───────────                                                             │
│  Cannot model: "males interested in sports click more on sports ads"    │
│  This requires explicit feature crosses: x_male × x_sports              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why LR works despite its simplicity:

  1. Interpretability: Weight wiw_i directly shows feature importance
  2. Scalability: Can train on billions of examples with SGD
  3. Sparsity: Most weights are zero (L1 regularization)
  4. Online learning: Weights can be updated incrementally

Why LR is insufficient:

  • Requires manual feature engineering for interactions
  • Cannot learn non-linear patterns
  • Feature crosses explode combinatorially: O(n2)O(n^2) for pairwise, O(nk)O(n^k) for k-way

Stage 2: Polynomial/Feature Cross Models

To capture interactions, we can explicitly model feature crosses:

Degree-2 Polynomial:

y^=σ(w0+i=1nwixi+i=1nj=i+1nwijxixj)\hat{y} = \sigma\left(w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} w_{ij} x_i x_j\right)

The problem: for nn features, we now have n(n1)2\frac{n(n-1)}{2} pairwise interaction terms. With millions of sparse features, this is computationally infeasible and leads to severe overfitting (most pairs never co-occur in training data).


Stage 3: Factorization Machines (FM)

The breakthrough: Instead of learning a weight wijw_{ij} for each feature pair, learn a latent vector viRk\mathbf{v}_i \in \mathbb{R}^k for each feature and model interactions as dot products.

FM formulation (Rendle, 2010):

y^=w0+i=1nwixi+i=1nj=i+1nvi,vjxixj\hat{y} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j

where vi,vj=f=1kvifvjf\langle \mathbf{v}_i, \mathbf{v}_j \rangle = \sum_{f=1}^{k} v_{if} v_{jf} is the dot product.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FACTORIZATION MACHINES                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  KEY INSIGHT: Factorize the interaction weight matrix                   │
│  ───────────────────────────────────────────────────                    │
│                                                                          │
│  Instead of:  W_ij (n² parameters for pairwise interactions)            │
│                                                                          │
│  Learn:       V ∈ ℝ^(n×k) where k << n                                  │
│               W_ij ≈ <v_i, v_j> = Σ v_if · v_jf                         │
│                                                                          │
│  Parameter reduction:                                                    │
│  ───────────────────                                                     │
│  Full interactions: O(n²)  →  With FM: O(n·k)                           │
│                                                                          │
│  Example: n = 10⁶ features, k = 64                                      │
│  Full: 10¹² parameters (impossible)                                     │
│  FM:   64 × 10⁶ = 6.4 × 10⁷ parameters (tractable)                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENERALIZATION POWER:                                                   │
│  ────────────────────                                                    │
│                                                                          │
│  Even if (feature_i, feature_j) never co-occur in training data,        │
│  FM can estimate their interaction through the latent vectors:          │
│                                                                          │
│  v_i learned from (feature_i, feature_k) co-occurrences                 │
│  v_j learned from (feature_j, feature_k) co-occurrences                 │
│  → <v_i, v_j> provides reasonable interaction estimate                  │
│                                                                          │
│  This is similar to how matrix factorization in RecSys handles          │
│  user-item pairs never seen in training.                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Efficient computation (the FM trick):

The naive computation of pairwise interactions is O(kn2)O(kn^2). But FM can be computed in O(kn)O(kn):

i=1nj=i+1nvi,vjxixj=12f=1k[(i=1nvifxi)2i=1nvif2xi2]\sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2}\sum_{f=1}^{k}\left[\left(\sum_{i=1}^{n} v_{if} x_i\right)^2 - \sum_{i=1}^{n} v_{if}^2 x_i^2\right]

Derivation:

Starting from the identity: (i=1nai)2=i=1nai2+2i=1nj=i+1naiaj\left(\sum_{i=1}^{n} a_i\right)^2 = \sum_{i=1}^{n} a_i^2 + 2\sum_{i=1}^{n}\sum_{j=i+1}^{n} a_i a_j

Let ai=vifxia_i = v_{if} x_i: (i=1nvifxi)2=i=1nvif2xi2+2i=1nj=i+1nvifvjfxixj\left(\sum_{i=1}^{n} v_{if} x_i\right)^2 = \sum_{i=1}^{n} v_{if}^2 x_i^2 + 2\sum_{i=1}^{n}\sum_{j=i+1}^{n} v_{if} v_{jf} x_i x_j

Rearranging and summing over ff: i<jvi,vjxixj=12f=1k[(i=1nvifxi)2i=1nvif2xi2]\sum_{i<j} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2}\sum_{f=1}^{k}\left[\left(\sum_{i=1}^{n} v_{if} x_i\right)^2 - \sum_{i=1}^{n} v_{if}^2 x_i^2\right]

This reformulation enables linear-time computation—critical for real-time serving.


Stage 4: Field-aware Factorization Machines (FFM)

Limitation of FM: The same latent vector vi\mathbf{v}_i is used regardless of which feature it interacts with.

FFM insight (Juan et al., 2016): Different interactions may require different representations. A user's "sports interest" should interact differently with "ad category" versus "time of day."

FFM formulation:

y^=w0+i=1nwixi+i=1nj=i+1nvi,fj,vj,fixixj\hat{y} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \mathbf{v}_{i,f_j}, \mathbf{v}_{j,f_i} \rangle x_i x_j

where fjf_j denotes the field of feature jj, and vi,f\mathbf{v}_{i,f} is feature ii's latent vector for interacting with field ff.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FM vs FFM: FIELD-AWARE INTERACTIONS                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FIELDS: Groups of related features                                      │
│  ──────────────────────────────────                                      │
│                                                                          │
│  Field 1 (User):     user_id, user_age, user_gender                     │
│  Field 2 (Ad):       ad_id, ad_category, advertiser_id                  │
│  Field 3 (Context):  hour, day_of_week, device                          │
│  Field 4 (Publisher): site_id, page_category                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FM: Each feature has ONE latent vector                                  │
│  ────────────────────────────────────                                    │
│                                                                          │
│  user_age=25-34:  v_age = [0.1, 0.3, -0.2, ...]                         │
│                                                                          │
│  Interaction with ad_category:    <v_age, v_category>                   │
│  Interaction with hour:           <v_age, v_hour>                       │
│  (Same v_age used for both!)                                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FFM: Each feature has latent vector PER FIELD                          │
│  ─────────────────────────────────────────────                           │
│                                                                          │
│  user_age=25-34:                                                         │
│    v_age,Ad      = [0.1, 0.3, -0.2, ...]  (for Ad field)                │
│    v_age,Context = [0.4, -0.1, 0.5, ...]  (for Context field)           │
│    v_age,Pub     = [-0.2, 0.2, 0.1, ...]  (for Publisher field)         │
│                                                                          │
│  Interaction with ad_category:    <v_age,Ad, v_category,User>           │
│  Interaction with hour:           <v_age,Context, v_hour,User>          │
│  (Different vectors for different fields!)                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PARAMETER COUNT:                                                        │
│  ────────────────                                                        │
│  FM:  n × k                                                              │
│  FFM: n × F × k  (F = number of fields)                                 │
│                                                                          │
│  Tradeoff: More expressive but more parameters and slower training      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

FFM won several Kaggle CTR prediction competitions and became a standard baseline in the industry.


Part III: Deep Learning for CTR Prediction

The Deep Learning Revolution

Around 2016, deep learning entered CTR prediction. The key insight: neural networks can automatically learn feature interactions without manual engineering.

Wide & Deep Learning (Google, 2016)

Google's Wide & Deep architecture combines memorization (wide) with generalization (deep).

Motivation:

  • Memorization: Learning specific feature co-occurrences from history
    • "Users who installed app X often install app Y"
    • Requires feature engineering but captures precise patterns
  • Generalization: Learning transferable patterns
    • "Users interested in fitness apps like health-related apps"
    • DNNs learn general representations but may over-generalize

Architecture:

y^=σ(wwideT[x,ϕ(x)]+wdeepTa(L)+b)\hat{y} = \sigma\left(\mathbf{w}_{\text{wide}}^T [\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}_{\text{deep}}^T \mathbf{a}^{(L)} + b\right)

where:

  • x\mathbf{x} = raw features
  • ϕ(x)\phi(\mathbf{x}) = cross-product transformations (manual feature crosses)
  • a(L)\mathbf{a}^{(L)} = final hidden layer of the deep network
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    WIDE & DEEP ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                         ┌──────────────┐                                │
│                         │   Output     │                                │
│                         │   σ(·)       │                                │
│                         └──────┬───────┘                                │
│                                │                                         │
│                    ┌───────────┴───────────┐                            │
│                    │                       │                             │
│            ┌───────┴───────┐       ┌───────┴───────┐                    │
│            │     WIDE      │       │     DEEP      │                    │
│            │   (Linear)    │       │    (DNN)      │                    │
│            └───────┬───────┘       └───────┬───────┘                    │
│                    │                       │                             │
│            ┌───────┴───────┐       ┌───────┴───────┐                    │
│            │ Raw Features  │       │    Hidden     │                    │
│            │      +        │       │    Layers     │                    │
│            │ Cross Features│       │   (ReLU)      │                    │
│            │ (manual)      │       └───────┬───────┘                    │
│            └───────┬───────┘               │                             │
│                    │               ┌───────┴───────┐                    │
│                    │               │   Embedding   │                    │
│                    │               │    Layer      │                    │
│                    │               └───────┬───────┘                    │
│                    │                       │                             │
│                    └───────────┬───────────┘                            │
│                                │                                         │
│                    ┌───────────┴───────────┐                            │
│                    │    Sparse Features    │                            │
│                    │   (categorical IDs)   │                            │
│                    └───────────────────────┘                            │
│                                                                          │
│  WIDE COMPONENT: Memorization                                           │
│  ─────────────────────────────                                          │
│  • Linear model on raw + crossed features                               │
│  • Crossed features like: installed_app × impression_app                │
│  • Captures specific, frequent patterns                                 │
│                                                                          │
│  DEEP COMPONENT: Generalization                                         │
│  ──────────────────────────────                                         │
│  • Embeddings for categorical features                                  │
│  • Multiple hidden layers with ReLU                                     │
│  • Learns dense representations that generalize                         │
│                                                                          │
│  JOINT TRAINING: Both components trained end-to-end                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Key insight: The wide component still requires manual feature crosses. Can we automate this?


DeepFM (Huawei, 2017)

DeepFM replaces the wide component's manual crosses with a Factorization Machine, achieving automatic feature interaction learning at both low and high orders.

Architecture:

y^=σ(yFM+yDNN)\hat{y} = \sigma\left(y_{\text{FM}} + y_{\text{DNN}}\right)

where: yFM=w0+i=1nwixi+i=1nj=i+1nvi,vjxixjy_{\text{FM}} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j

yDNN=wTa(L)+by_{\text{DNN}} = \mathbf{w}^T \mathbf{a}^{(L)} + b

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DeepFM ARCHITECTURE                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                              ┌──────────────┐                           │
│                              │   Output     │                           │
│                              │   σ(·)       │                           │
│                              └──────┬───────┘                           │
│                                     │                                    │
│                         ┌───────────┴───────────┐                       │
│                         │         ADD           │                       │
│                         └───────────┬───────────┘                       │
│                    ┌────────────────┼────────────────┐                  │
│                    │                │                │                   │
│            ┌───────┴───────┐ ┌──────┴──────┐ ┌──────┴──────┐           │
│            │   1st Order   │ │  2nd Order  │ │    Deep     │           │
│            │   (Linear)    │ │    (FM)     │ │   (DNN)     │           │
│            └───────┬───────┘ └──────┬──────┘ └──────┬──────┘           │
│                    │                │               │                    │
│                    │                │        ┌──────┴──────┐            │
│                    │                │        │   Hidden    │            │
│                    │                │        │   Layers    │            │
│                    │                │        └──────┬──────┘            │
│                    │                │               │                    │
│                    └────────────────┴───────────────┘                   │
│                                     │                                    │
│                         ┌───────────┴───────────┐                       │
│                         │    SHARED Embeddings  │  ← KEY INNOVATION    │
│                         └───────────┬───────────┘                       │
│                                     │                                    │
│                         ┌───────────┴───────────┐                       │
│                         │   Sparse Features     │                       │
│                         └───────────────────────┘                       │
│                                                                          │
│  KEY INNOVATIONS:                                                        │
│  ────────────────                                                        │
│  1. FM replaces manual feature crosses (automatic 2nd-order)            │
│  2. DNN captures higher-order interactions                              │
│  3. SHARED embeddings between FM and DNN                                │
│     - Reduces parameters                                                │
│     - FM and DNN reinforce each other                                   │
│  4. No pre-training required (end-to-end)                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why shared embeddings matter:

The embedding vector vi\mathbf{v}_i serves dual purposes:

  1. In FM: Direct dot-product interactions vi,vj\langle \mathbf{v}_i, \mathbf{v}_j \rangle
  2. In DNN: Concatenated as input for higher-order learning

This parameter sharing creates a synergy: the FM component provides explicit 2nd-order signals that help the DNN converge faster, while the DNN's gradients improve the embeddings used by FM.


Deep & Cross Network (DCN) (Google, 2017)

DCN introduces an elegant cross network that explicitly models feature interactions of arbitrary order without the combinatorial explosion.

Cross Layer formulation:

xl+1=x0xlTwl+bl+xl\mathbf{x}_{l+1} = \mathbf{x}_0 \mathbf{x}_l^T \mathbf{w}_l + \mathbf{b}_l + \mathbf{x}_l

where:

  • x0\mathbf{x}_0 = input features
  • xl\mathbf{x}_l = output of layer ll
  • wl,bl\mathbf{w}_l, \mathbf{b}_l = learnable parameters
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CROSS NETWORK: HOW IT WORKS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Cross Layer Operation:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  x_{l+1} = x_0 · (x_l^T · w_l) + b_l + x_l                              │
│          = x_0 · (scalar) + b_l + x_l                                   │
│                                                                          │
│  Breakdown:                                                              │
│  ──────────                                                              │
│  1. x_l^T · w_l  →  scalar (dot product)                                │
│  2. x_0 · scalar →  feature-weighted x_0                                │
│  3. + x_l        →  residual connection                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INTERACTION ORDER ANALYSIS:                                             │
│  ───────────────────────────                                             │
│                                                                          │
│  Layer 0: x_0 = [x_1, x_2, x_3]  (1st order features)                   │
│                                                                          │
│  Layer 1: x_1 = x_0 · (x_0^T w_0) + x_0                                 │
│           Contains: x_1, x_2, x_3           (1st order)                 │
│                     x_1², x_1x_2, x_1x_3... (2nd order)                 │
│                                                                          │
│  Layer 2: x_2 = x_0 · (x_1^T w_1) + x_1                                 │
│           Contains: 1st, 2nd order (from x_1)                           │
│                     3rd order (x_0 × 2nd order terms)                   │
│                                                                          │
│  Layer L: Contains interactions up to order (L+1)                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PARAMETER EFFICIENCY:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Each cross layer: d parameters (w_l) + d parameters (b_l) = 2d         │
│  L cross layers: 2Ld parameters                                          │
│                                                                          │
│  Compare to polynomial: d^(L+1) parameters for order-(L+1) interactions │
│                                                                          │
│  Example: d=1000, L=3                                                   │
│  Cross Network: 6,000 parameters                                        │
│  Full polynomial: 10^12 parameters (impossible!)                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

DCN-v2 (2020): The original DCN's cross layer uses rank-1 weight matrices (x0xlT\mathbf{x}_0 \mathbf{x}_l^T). DCN-v2 generalizes to full-rank:

xl+1=x0(Wlxl+bl)+xl\mathbf{x}_{l+1} = \mathbf{x}_0 \odot (\mathbf{W}_l \mathbf{x}_l + \mathbf{b}_l) + \mathbf{x}_l

where \odot is element-wise product and WlRd×d\mathbf{W}_l \in \mathbb{R}^{d \times d}.

This increases expressiveness at the cost of more parameters, with a practical compromise using low-rank decomposition: Wl=UlVlT\mathbf{W}_l = \mathbf{U}_l \mathbf{V}_l^T where Ul,VlRd×r\mathbf{U}_l, \mathbf{V}_l \in \mathbb{R}^{d \times r}.


xDeepFM (Microsoft, 2018)

xDeepFM introduces the Compressed Interaction Network (CIN) to learn explicit, bounded-degree feature interactions at the vector level (not bit level).

Key insight: In DeepFM's DNN, interactions happen at the bit level (individual embedding dimensions). CIN operates at the vector level (entire embedding vectors), which is more interpretable and controllable.

CIN formulation:

Xh,k=i=1Hk1j=1mWijk,h(Xi,k1Xj,0)X^k_{h,*} = \sum_{i=1}^{H_{k-1}} \sum_{j=1}^{m} W^{k,h}_{ij} (X^{k-1}_{i,*} \circ X^0_{j,*})

where:

  • X0Rm×DX^0 \in \mathbb{R}^{m \times D}: input feature embeddings (mm features, DD dimensions)
  • XkRHk×DX^k \in \mathbb{R}^{H_k \times D}: output of layer kk (HkH_k feature maps)
  • \circ: Hadamard (element-wise) product
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CIN: COMPRESSED INTERACTION NETWORK                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT: m field embeddings, each D-dimensional                          │
│                                                                          │
│         X^0 = [e_1, e_2, ..., e_m]  ∈ ℝ^(m × D)                        │
│                                                                          │
│  LAYER k: Compute interactions with original embeddings                  │
│  ─────────────────────────────────────────────────────                   │
│                                                                          │
│  Step 1: Outer product along embedding dimension                        │
│                                                                          │
│          Z^k = X^(k-1) ⊗ X^0  ∈ ℝ^(H_{k-1} × m × D)                    │
│                                                                          │
│          Each Z^k_{i,j} = X^(k-1)_i ⊙ X^0_j  (Hadamard product)        │
│                                                                          │
│  Step 2: Compress along the H_{k-1} × m dimensions                      │
│                                                                          │
│          X^k_h = Σ_i Σ_j W^k_{h,i,j} · Z^k_{i,j}                        │
│                                                                          │
│          Output: X^k ∈ ℝ^(H_k × D)                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INTERACTION ORDER:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Layer 1: X^1 involves X^0 ⊗ X^0 → 2nd order interactions              │
│  Layer 2: X^2 involves X^1 ⊗ X^0 → 3rd order interactions              │
│  Layer k: Contains exactly (k+1)-order interactions                     │
│                                                                          │
│  Unlike DNN where interaction order is implicit and unbounded,          │
│  CIN gives explicit control over maximum interaction degree.            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OUTPUT: Sum pooling over each layer                                     │
│                                                                          │
│          p^k = Σ_h Σ_d X^k_{h,d}  (scalar per layer)                    │
│                                                                          │
│          y_CIN = [p^1, p^2, ..., p^T]  (T layers)                       │
│                                                                          │
│  Final: Concatenate with linear + DNN outputs                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

AutoInt (2019)

AutoInt applies multi-head self-attention to learn feature interactions, treating each feature embedding as a token.

Architecture:

e~m(h)=k=1Mαm,k(h)(WV(h)ek)\tilde{\mathbf{e}}_m^{(h)} = \sum_{k=1}^{M} \alpha_{m,k}^{(h)} \left(\mathbf{W}_V^{(h)} \mathbf{e}_k\right)

where: αm,k(h)=exp(ψ(h)(em,ek))l=1Mexp(ψ(h)(em,el))\alpha_{m,k}^{(h)} = \frac{\exp(\psi^{(h)}(\mathbf{e}_m, \mathbf{e}_k))}{\sum_{l=1}^{M} \exp(\psi^{(h)}(\mathbf{e}_m, \mathbf{e}_l))}

ψ(h)(em,ek)=WQ(h)em,WK(h)ek\psi^{(h)}(\mathbf{e}_m, \mathbf{e}_k) = \langle \mathbf{W}_Q^{(h)} \mathbf{e}_m, \mathbf{W}_K^{(h)} \mathbf{e}_k \rangle

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    AutoInt: ATTENTION FOR FEATURE INTERACTION            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INTUITION: Treat feature embeddings like tokens in a transformer       │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Input: M feature embeddings [e_1, e_2, ..., e_M]                       │
│                                                                          │
│       e_1         e_2         e_3         e_4                           │
│    (user_age)  (ad_cat)    (hour)     (device)                          │
│        │           │           │           │                             │
│        ▼           ▼           ▼           ▼                             │
│  ┌─────────────────────────────────────────────────┐                    │
│  │           Multi-Head Self-Attention             │                    │
│  │                                                 │                    │
│  │   α_11  α_12  α_13  α_14                        │                    │
│  │   α_21  α_22  α_23  α_24     (attention matrix)│                    │
│  │   α_31  α_32  α_33  α_34                        │                    │
│  │   α_41  α_42  α_43  α_44                        │                    │
│  │                                                 │                    │
│  │   α_ij = how much feature i attends to j       │                    │
│  └─────────────────────────────────────────────────┘                    │
│        │           │           │           │                             │
│        ▼           ▼           ▼           ▼                             │
│      ẽ_1         ẽ_2         ẽ_3         ẽ_4                           │
│  (contextualized embeddings)                                             │
│                                                                          │
│  WHAT ATTENTION LEARNS:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  High α_12: "user_age" strongly interacts with "ad_category"            │
│  High α_34: "hour" strongly interacts with "device"                     │
│                                                                          │
│  Unlike FM (all pairs weighted equally by dot product),                 │
│  attention learns WHICH interactions matter for each example.           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STACKING: Multiple attention layers = higher-order interactions        │
│                                                                          │
│  Layer 1: ẽ^1 = Attn(e, e, e)      → 2nd order                         │
│  Layer 2: ẽ^2 = Attn(ẽ^1, ẽ^1, ẽ^1) → 3rd order                        │
│  Layer L: up to (L+1)-order interactions                                │
│                                                                          │
│  Residual connections: ẽ^l = ẽ^(l-1) + Attn(...)                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Advantages of attention-based interaction:

  1. Dynamic: Attention weights depend on the specific input (unlike FM's fixed weights)
  2. Interpretable: Can visualize which features interact
  3. Efficient: Self-attention is well-optimized on modern hardware

FiBiNET (Sina Weibo, 2019)

FiBiNET introduces SENET-like attention to dynamically reweight feature importance before interaction.

Key innovation: Not all features are equally important for every prediction. FiBiNET learns to squeeze (aggregate) and excite (reweight) features.

SENET Layer:

z=Fsq(E)=1kt=1kei(t)\mathbf{z} = F_{sq}(\mathbf{E}) = \frac{1}{k}\sum_{t=1}^{k} e_i^{(t)} (squeeze: average pooling per field)

A=Fex(z)=σ2(W2σ1(W1z))\mathbf{A} = F_{ex}(\mathbf{z}) = \sigma_2(\mathbf{W}_2 \cdot \sigma_1(\mathbf{W}_1 \cdot \mathbf{z})) (excite: two FC layers)

V=Fscale(A,E)=[a1e1,...,afef]\mathbf{V} = F_{scale}(\mathbf{A}, \mathbf{E}) = [\mathbf{a}_1 \cdot \mathbf{e}_1, ..., \mathbf{a}_f \cdot \mathbf{e}_f] (reweight)

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FiBiNET: FEATURE IMPORTANCE LEARNING                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM: Not all features equally important for all predictions        │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Example: Predicting click for "luxury watch" ad                        │
│                                                                          │
│  Important features: user_income, user_age, user_interests              │
│  Less important: hour_of_day, browser_type                              │
│                                                                          │
│  But for "fast food" ad:                                                │
│  Important features: hour_of_day, user_location                         │
│  Less important: user_income                                            │
│                                                                          │
│  SENET learns to dynamically reweight based on the input!               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ARCHITECTURE:                                                           │
│                                                                          │
│  Input embeddings: E = [e_1, e_2, ..., e_f]  (f fields)                 │
│                                                                          │
│  ┌──────────────────────────────────────────────────────┐               │
│  │  SQUEEZE: Global average pooling per field           │               │
│  │                                                      │               │
│  │  z_i = mean(e_i)  →  z = [z_1, z_2, ..., z_f]       │               │
│  └──────────────────────────────────────────────────────┘               │
│                         │                                                │
│                         ▼                                                │
│  ┌──────────────────────────────────────────────────────┐               │
│  │  EXCITE: Two FC layers with reduction ratio r        │               │
│  │                                                      │               │
│  │  s = W_1 · z       (f → f/r)                        │               │
│  │  s = ReLU(s)                                        │               │
│  │  a = W_2 · s       (f/r → f)                        │               │
│  │  a = sigmoid(a)    (importance weights)             │               │
│  └──────────────────────────────────────────────────────┘               │
│                         │                                                │
│                         ▼                                                │
│  ┌──────────────────────────────────────────────────────┐               │
│  │  REWEIGHT: Scale embeddings by importance            │               │
│  │                                                      │               │
│  │  v_i = a_i · e_i   →  V = [v_1, v_2, ..., v_f]      │               │
│  └──────────────────────────────────────────────────────┘               │
│                         │                                                │
│                         ▼                                                │
│  ┌──────────────────────────────────────────────────────┐               │
│  │  Bilinear interaction on reweighted embeddings       │               │
│  └──────────────────────────────────────────────────────┘               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IV: User Behavior Sequence Modeling

The Behavior Sequence Problem

So far, we've treated user features as static. But in advertising, user behavior history is critical. A user who just searched for "running shoes" is much more likely to click on a Nike ad than their static demographic profile suggests.

Challenge: How do we model sequences of past behaviors to predict future ad clicks?

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    USER BEHAVIOR IN AD PREDICTION                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  User's recent behavior sequence:                                        │
│  ────────────────────────────────                                        │
│                                                                          │
│  t-5: Viewed "Nike Air Max" product page                                │
│  t-4: Searched "best running shoes 2024"                                │
│  t-3: Clicked ad for "Adidas Ultraboost"                                │
│  t-2: Read article "Marathon Training Guide"                            │
│  t-1: Added "Running Socks" to cart                                     │
│  t:   Current ad impression: "Nike Running Shoes"                       │
│                                                                          │
│  QUESTION: How does this history inform P(click)?                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  NAIVE APPROACH: Aggregate all behaviors                                │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  user_embedding = mean([e_nike, e_search, e_adidas, e_article, e_socks])│
│                                                                          │
│  Problems:                                                               │
│  • All behaviors weighted equally                                       │
│  • Recent behaviors not prioritized                                     │
│  • Relationship to target ad ignored                                    │
│                                                                          │
│  BETTER: Weight behaviors by relevance to current ad                    │
│  ─────────────────────────────────────────────────                       │
│                                                                          │
│  For "Nike Running Shoes" ad:                                           │
│  • "Nike Air Max" view: HIGH relevance (same brand + category)          │
│  • "Adidas Ultraboost" click: MEDIUM relevance (competitor)             │
│  • "Marathon Training" read: MEDIUM relevance (related interest)        │
│  • "Running Socks" cart: LOW relevance (different product type)         │
│                                                                          │
│  This is the core idea behind DIN, DIEN, and related models.           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Deep Interest Network (DIN) (Alibaba, 2018)

DIN introduces target-aware attention over user behavior sequences. Instead of treating all past behaviors equally, DIN computes attention weights based on relevance to the target ad.

Key insight: User interests are diverse and locally activated. When predicting clicks on a "Nike shoe" ad, past behaviors related to sports/shoes should matter more than unrelated behaviors.

Attention mechanism:

vU=f(vA)=j=1Na(ej,eA)ej\mathbf{v}_U = f(\mathbf{v}_A) = \sum_{j=1}^{N} a(\mathbf{e}_j, \mathbf{e}_A) \cdot \mathbf{e}_j

where:

  • ej\mathbf{e}_j = embedding of behavior jj
  • eA\mathbf{e}_A = embedding of target ad
  • a(,)a(\cdot, \cdot) = attention function

Attention function (activation unit):

a(ej,eA)=wTPReLU(W[ej,eA,ejeA,ejeA]+b)a(\mathbf{e}_j, \mathbf{e}_A) = \mathbf{w}^T \cdot \text{PReLU}\left(\mathbf{W} \cdot [\mathbf{e}_j, \mathbf{e}_A, \mathbf{e}_j \odot \mathbf{e}_A, \mathbf{e}_j - \mathbf{e}_A] + \mathbf{b}\right)

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DIN: DEEP INTEREST NETWORK                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TARGET AD: Nike Running Shoes (embedding e_A)                          │
│                                                                          │
│  USER BEHAVIOR HISTORY:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│     e_1          e_2          e_3          e_4          e_5             │
│   (Nike Air)  (Search)    (Adidas)    (Article)    (Socks)              │
│       │           │           │           │           │                  │
│       └───────────┴───────────┴───────────┴───────────┘                  │
│                               │                                          │
│                   ┌───────────┴───────────┐                              │
│                   │   Attention Unit       │                             │
│                   │   a(e_j, e_A)         │← Target ad e_A              │
│                   └───────────┬───────────┘                              │
│                               │                                          │
│           ┌───────────────────┼───────────────────┐                      │
│           │                   │                   │                      │
│        a_1=0.6            a_2=0.3            a_3=0.4  ...                │
│           │                   │                   │                      │
│           ▼                   ▼                   ▼                      │
│        a_1·e_1 +          a_2·e_2 +          a_3·e_3 + ...              │
│                               │                                          │
│                               ▼                                          │
│                    ┌─────────────────┐                                  │
│                    │  User Interest  │                                  │
│                    │  Representation │                                  │
│                    │      v_U        │                                  │
│                    └────────┬────────┘                                  │
│                             │                                            │
│                             ▼                                            │
│               ┌─────────────────────────────┐                           │
│               │  Concatenate with other     │                           │
│               │  features → MLP → P(click)  │                           │
│               └─────────────────────────────┘                           │
│                                                                          │
│  KEY PROPERTIES:                                                         │
│  ───────────────                                                         │
│  1. Attention weights NOT normalized (sum ≠ 1)                          │
│     - Allows varying total interest intensity                           │
│     - User with strong interest → larger ||v_U||                        │
│                                                                          │
│  2. Activation unit uses both similarity AND difference                 │
│     - [e_j, e_A]: raw features                                          │
│     - [e_j ⊙ e_A]: element-wise product (similarity)                   │
│     - [e_j - e_A]: difference (captures contrast)                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why unnormalized attention?

In standard attention (e.g., transformers), weights sum to 1. DIN deliberately avoids normalization:

  • Normalized: User with 10 relevant items and user with 1 relevant item produce similar magnitude outputs
  • Unnormalized: User with more relevant items has larger interest representation, capturing interest intensity

Deep Interest Evolution Network (DIEN) (Alibaba, 2019)

DIEN extends DIN by modeling the temporal evolution of user interests, not just their static representation.

Key insight: User interests evolve over time. The sequence [search shoes → view Nike → view Adidas → buy Nike] tells a story of interest development that static attention misses.

Two-layer architecture:

  1. Interest Extractor Layer: GRU captures sequential patterns
  2. Interest Evolving Layer: Attention-augmented GRU focuses on target-relevant evolution
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DIEN: INTEREST EVOLUTION NETWORK                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LAYER 1: INTEREST EXTRACTOR (GRU)                                      │
│  ─────────────────────────────────                                       │
│                                                                          │
│  b_1 → b_2 → b_3 → b_4 → b_5   (behavior sequence)                      │
│   │      │      │      │      │                                          │
│   ▼      ▼      ▼      ▼      ▼                                          │
│  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐                                          │
│  │GRU│→│GRU│→│GRU│→│GRU│→│GRU│                                          │
│  └──┘  └──┘  └──┘  └──┘  └──┘                                          │
│   │      │      │      │      │                                          │
│   ▼      ▼      ▼      ▼      ▼                                          │
│  h_1    h_2    h_3    h_4    h_5   (hidden states = interest states)    │
│                                                                          │
│  Auxiliary loss: Predict next behavior from h_t                         │
│  L_aux = -Σ [log P(b_{t+1} | h_t) + log P(b'_t | h_t)]                  │
│          (positive + negative samples)                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LAYER 2: INTEREST EVOLVING (AUGRU)                                     │
│  ──────────────────────────────────                                      │
│                                                                          │
│  h_1    h_2    h_3    h_4    h_5                                        │
│   │      │      │      │      │                                          │
│   │   ┌──┴──┐┌──┴──┐┌──┴──┐┌──┴──┐                                     │
│   │   │Attn ││Attn ││Attn ││Attn │  ← attention w.r.t. target ad       │
│   │   └──┬──┘└──┬──┘└──┬──┘└──┬──┘                                     │
│   │   a_1│   a_2│   a_3│   a_4│                                         │
│   │      │      │      │      │                                          │
│   ▼      ▼      ▼      ▼      ▼                                          │
│  ┌────┐┌────┐┌────┐┌────┐┌────┐                                        │
│  │AUGRU││AUGRU││AUGRU││AUGRU││AUGRU│                                    │
│  └────┘└────┘└────┘└────┘└────┘                                        │
│   │      │      │      │      │                                          │
│   ▼      ▼      ▼      ▼      ▼                                          │
│  h'_1   h'_2   h'_3   h'_4   h'_5                                       │
│                               │                                          │
│                               ▼                                          │
│                    Final interest state h'_T                            │
│                    (used for prediction)                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  AUGRU (Attention Update GRU):                                          │
│  ─────────────────────────────                                           │
│                                                                          │
│  Standard GRU:                                                          │
│    ũ_t = σ(W_u · [h_{t-1}, i_t] + b_u)     (update gate)               │
│    h̃_t = tanh(W_h · [r_t ⊙ h_{t-1}, i_t])  (candidate)                │
│    h_t = (1 - ũ_t) ⊙ h_{t-1} + ũ_t ⊙ h̃_t                              │
│                                                                          │
│  AUGRU modification:                                                     │
│    u'_t = a_t · ũ_t                         (attention-scaled update)  │
│    h'_t = (1 - u'_t) ⊙ h'_{t-1} + u'_t ⊙ h̃_t                          │
│                                                                          │
│  Effect: Low attention a_t → small update → ignore this behavior       │
│          High attention a_t → normal update → incorporate behavior      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Auxiliary loss for interest extraction:

The auxiliary loss ensures hidden states hth_t actually capture user interests by requiring them to predict the next behavior:

Laux=1Ni=1Nt=1T1[logσ(htiebt+1i)+log(1σ(htiebti))]\mathcal{L}_{aux} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T-1}\left[\log \sigma(\mathbf{h}_t^i \cdot \mathbf{e}_{b_{t+1}}^i) + \log(1 - \sigma(\mathbf{h}_t^i \cdot \mathbf{e}_{b'_t}^i))\right]

where btb'_t is a negative sample (item the user didn't interact with).


DSIN: Deep Session Interest Network (Alibaba, 2019)

DSIN recognizes that user behavior naturally clusters into sessions. Within a session, behaviors are highly related; across sessions, interests may differ significantly.

Architecture:

  1. Session Division: Split behavior sequence into sessions (e.g., by time gaps)
  2. Intra-Session Interest: Self-attention within each session
  3. Inter-Session Interest: Bi-LSTM across sessions to capture evolution
  4. Session Interest Activation: Attention with target ad
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DSIN: SESSION-BASED INTEREST MODELING                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RAW BEHAVIOR SEQUENCE:                                                  │
│  ──────────────────────                                                  │
│  [b1, b2, b3] | gap | [b4, b5] | gap | [b6, b7, b8, b9]                │
│  └──Session 1──┘     └Session 2┘     └───Session 3────┘                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LAYER 1: INTRA-SESSION (Self-Attention per session)                    │
│  ───────────────────────────────────────────────────                    │
│                                                                          │
│  Session 1: [b1, b2, b3] → Self-Attention → Interest I_1                │
│  Session 2: [b4, b5]     → Self-Attention → Interest I_2                │
│  Session 3: [b6, b7, b8, b9] → Self-Attention → Interest I_3            │
│                                                                          │
│  Self-attention captures relationships within session:                  │
│  "viewed Nike, then searched shoes, then viewed Nike sizes"             │
│  → coherent shopping intent                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LAYER 2: INTER-SESSION (Bi-LSTM across sessions)                       │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│    I_1 ───→ I_2 ───→ I_3      (forward LSTM)                           │
│    I_1 ←─── I_2 ←─── I_3      (backward LSTM)                          │
│                                                                          │
│  Captures interest evolution across sessions:                           │
│  "First explored options, then compared prices, then ready to buy"      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LAYER 3: SESSION ACTIVATION (Target-aware attention)                   │
│  ────────────────────────────────────────────────────                    │
│                                                                          │
│    [I_1, I_2, I_3] × Attention(target_ad) → weighted sum                │
│                                                                          │
│  Recent relevant session may be more important than old relevant one    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SIM: Search-based Interest Model (Alibaba, 2020)

For users with very long behavior histories (thousands of items), attention over the full sequence is too slow. SIM introduces a two-stage approach: first retrieve relevant behaviors, then apply attention.

Architecture:

  1. General Search Unit (GSU): Fast retrieval of top-K relevant behaviors
  2. Exact Search Unit (ESU): Precise attention over retrieved behaviors

GSU (soft search):

rel(bi,a)=ebiTWea\text{rel}(b_i, a) = \mathbf{e}_{b_i}^T \mathbf{W} \mathbf{e}_a

Select top-K behaviors with highest relevance scores.

GSU (hard search):

Use category/brand matching to retrieve candidates, e.g., "all behaviors in same category as target ad."

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SIM: HANDLING LONG BEHAVIOR SEQUENCES                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM: User has 10,000 past behaviors                                │
│  ─────────────────────────────────────                                   │
│                                                                          │
│  DIN/DIEN: Attention over 10,000 items → O(10,000) per inference        │
│  → Too slow for real-time serving (<10ms budget)                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SIM SOLUTION: Two-stage retrieval + attention                          │
│  ─────────────────────────────────────────────                           │
│                                                                          │
│  All behaviors (10,000)                                                  │
│         │                                                                │
│         ▼                                                                │
│  ┌─────────────────────────────────────┐                                │
│  │  STAGE 1: General Search Unit (GSU) │                                │
│  │                                     │                                │
│  │  Option A: Soft search              │                                │
│  │    score_i = e_i^T W e_target       │                                │
│  │    Keep top-K (e.g., K=100)         │                                │
│  │                                     │                                │
│  │  Option B: Hard search              │                                │
│  │    Filter by category/brand match   │                                │
│  │    (Very fast, O(1) with index)    │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│         Relevant behaviors (100)                                        │
│                   │                                                      │
│                   ▼                                                      │
│  ┌─────────────────────────────────────┐                                │
│  │  STAGE 2: Exact Search Unit (ESU)   │                                │
│  │                                     │                                │
│  │  Full attention (like DIN/DIEN)     │                                │
│  │  over the K retrieved behaviors     │                                │
│  │                                     │                                │
│  │  Time encoding: add position info   │                                │
│  │  for temporal awareness             │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│                   ▼                                                      │
│          User interest vector                                           │
│                                                                          │
│  COMPLEXITY REDUCTION:                                                   │
│  ─────────────────────                                                   │
│  Original: O(N) attention = O(10,000)                                   │
│  SIM:      O(K) attention = O(100)     → 100x speedup!                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: Multi-Task Learning for Advertising

The Multi-Objective Problem

In advertising, we care about multiple outcomes:

  • Click: Did user click the ad? (CTR)
  • Conversion: Did user complete a purchase? (CVR)
  • Engagement: Did user spend time on landing page?
  • Long-term: Did user become a repeat customer?

These objectives are related but distinct. Multi-task learning (MTL) enables joint modeling.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    WHY MULTI-TASK LEARNING?                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OPTION 1: Separate models                                              │
│  ─────────────────────────                                               │
│                                                                          │
│  CTR Model:  Features → DNN_1 → P(click)                                │
│  CVR Model:  Features → DNN_2 → P(conversion)                           │
│                                                                          │
│  Problems:                                                               │
│  • No shared learning (features learned independently)                  │
│  • Inconsistent predictions (CTR and CVR may disagree)                  │
│  • Sample selection bias for CVR (only see conversions after clicks)   │
│  • 2x serving cost                                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPTION 2: Multi-task model                                             │
│  ──────────────────────────                                              │
│                                                                          │
│  Features → Shared Layers → Task-Specific Heads → [P(click), P(conv)]   │
│                                                                          │
│  Benefits:                                                               │
│  • Shared representations learn general patterns                        │
│  • Transfer learning between tasks                                      │
│  • Single model for serving                                             │
│  • Auxiliary tasks regularize main task                                 │
│                                                                          │
│  Challenges:                                                             │
│  • Negative transfer (tasks may conflict)                               │
│  • Gradient balancing between tasks                                     │
│  • Different task difficulties                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Shared-Bottom Architecture

The simplest MTL approach: share bottom layers, use task-specific towers on top.

Architecture:

hshared=fshared(x)\mathbf{h}^{\text{shared}} = f^{\text{shared}}(\mathbf{x}) yk=fktower(hshared)for task ky_k = f_k^{\text{tower}}(\mathbf{h}^{\text{shared}}) \quad \text{for task } k

Loss:

L=k=1KλkLk\mathcal{L} = \sum_{k=1}^{K} \lambda_k \mathcal{L}_k

where λk\lambda_k balances task importance.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SHARED-BOTTOM ARCHITECTURE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│              Task 1        Task 2        Task 3                         │
│              (CTR)         (CVR)        (LTV)                           │
│                │              │             │                            │
│                ▼              ▼             ▼                            │
│           ┌────────┐    ┌────────┐    ┌────────┐                        │
│           │Tower 1 │    │Tower 2 │    │Tower 3 │                        │
│           │ (MLP)  │    │ (MLP)  │    │ (MLP)  │                        │
│           └────┬───┘    └────┬───┘    └────┬───┘                        │
│                │              │             │                            │
│                └──────────────┼─────────────┘                            │
│                               │                                          │
│                        ┌──────┴──────┐                                  │
│                        │   Shared    │                                  │
│                        │   Bottom    │                                  │
│                        │   (MLP)     │                                  │
│                        └──────┬──────┘                                  │
│                               │                                          │
│                        ┌──────┴──────┐                                  │
│                        │  Embedding  │                                  │
│                        │   Layer     │                                  │
│                        └──────┬──────┘                                  │
│                               │                                          │
│                        ┌──────┴──────┐                                  │
│                        │   Input     │                                  │
│                        │  Features   │                                  │
│                        └─────────────┘                                  │
│                                                                          │
│  LIMITATION: Assumes all tasks benefit from same representation         │
│  → Negative transfer when tasks conflict                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

MMOE: Multi-gate Mixture-of-Experts (Google, 2018)

MMOE addresses negative transfer by learning task-specific combinations of shared experts.

Key idea: Instead of one shared bottom, use multiple expert networks and let each task decide which experts to use.

Architecture:

yk=hk(i=1ngk(i)(x)fi(x))y_k = h_k\left(\sum_{i=1}^{n} g_k^{(i)}(\mathbf{x}) \cdot f_i(\mathbf{x})\right)

where:

  • fi(x)f_i(\mathbf{x}) = expert ii's output
  • gk(x)=softmax(Wgkx)g_k(\mathbf{x}) = \text{softmax}(\mathbf{W}_{gk} \mathbf{x}) = task kk's gating weights
  • hkh_k = task kk's tower
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MMOE: MIXTURE OF EXPERTS                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│              Task 1        Task 2        Task 3                         │
│                │              │             │                            │
│                ▼              ▼             ▼                            │
│           ┌────────┐    ┌────────┐    ┌────────┐                        │
│           │Tower 1 │    │Tower 2 │    │Tower 3 │                        │
│           └────┬───┘    └────┬───┘    └────┬───┘                        │
│                │              │             │                            │
│                ▼              ▼             ▼                            │
│           ┌────────┐    ┌────────┐    ┌────────┐                        │
│           │Gate 1  │    │Gate 2  │    │Gate 3  │                        │
│           │[.3,.5,.2]   │[.6,.2,.2]   │[.1,.3,.6]                       │
│           └────┬───┘    └────┬───┘    └────┬───┘                        │
│                │              │             │                            │
│                │              │             │                            │
│                └──────────────┼─────────────┘                            │
│                         weighted sum                                     │
│                ┌──────────────┼─────────────┐                            │
│                │              │             │                            │
│                ▼              ▼             ▼                            │
│           ┌────────┐    ┌────────┐    ┌────────┐                        │
│           │Expert 1│    │Expert 2│    │Expert 3│                        │
│           │ (MLP)  │    │ (MLP)  │    │ (MLP)  │                        │
│           └────┬───┘    └────┬───┘    └────┬───┘                        │
│                │              │             │                            │
│                └──────────────┼─────────────┘                            │
│                               │                                          │
│                        ┌──────┴──────┐                                  │
│                        │   Input     │                                  │
│                        └─────────────┘                                  │
│                                                                          │
│  KEY INSIGHT:                                                            │
│  ───────────                                                             │
│  • Task 1 uses mostly Expert 2 (weights [.3, .5, .2])                   │
│  • Task 3 uses mostly Expert 3 (weights [.1, .3, .6])                   │
│  • Tasks can specialize while still sharing some computation            │
│                                                                          │
│  GATING MECHANISM:                                                       │
│  ─────────────────                                                       │
│  g_k(x) = softmax(W_k · x)                                              │
│                                                                          │
│  Gate is input-dependent: different inputs may use different expert     │
│  combinations even for the same task                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PLE: Progressive Layered Extraction (Tencent, 2020)

PLE extends MMOE by separating task-specific experts from shared experts, and progressively refining representations across layers.

Key insight: Some knowledge should be task-specific from the start, not just at the tower level.

Architecture:

At each layer ll:

hk(l)=gk(l)(h(l1))[Ek(l)(hk(l1));Es(l)(hs(l1))]\mathbf{h}_k^{(l)} = g_k^{(l)}(\mathbf{h}^{(l-1)}) \cdot [\mathbf{E}_k^{(l)}(\mathbf{h}_k^{(l-1)}); \mathbf{E}_s^{(l)}(\mathbf{h}_s^{(l-1)})]

where:

  • Ek(l)\mathbf{E}_k^{(l)} = task-specific experts for task kk
  • Es(l)\mathbf{E}_s^{(l)} = shared experts
  • gk(l)g_k^{(l)} = gating network selecting from both
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PLE: PROGRESSIVE LAYERED EXTRACTION                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                    Task A Tower        Task B Tower                      │
│                         │                   │                            │
│              ┌──────────┴───────────────────┴──────────┐                │
│              │                                          │                │
│              ▼                                          ▼                │
│  EXTRACTION LAYER 2:                                                     │
│  ──────────────────────────────────────────────────────────────────     │
│                                                                          │
│   Task A         Shared          Task B                                 │
│   Experts        Experts         Experts                                │
│   ┌───┐┌───┐    ┌───┐┌───┐     ┌───┐┌───┐                              │
│   │EA1││EA2│    │ES1││ES2│     │EB1││EB2│                              │
│   └─┬─┘└─┬─┘    └─┬─┘└─┬─┘     └─┬─┘└─┬─┘                              │
│     │    │        │    │         │    │                                  │
│     └────┴────────┴────┴─────────┴────┘                                  │
│           │                │                                             │
│     Gate A selects   Gate B selects                                     │
│     from all 6       from all 6                                         │
│           │                │                                             │
│  ─────────┴────────────────┴─────────────────────────────────────────   │
│                                                                          │
│  EXTRACTION LAYER 1:                                                     │
│  ──────────────────────────────────────────────────────────────────     │
│                                                                          │
│   Task A         Shared          Task B                                 │
│   Experts        Experts         Experts                                │
│   ┌───┐┌───┐    ┌───┐┌───┐     ┌───┐┌───┐                              │
│   │EA1││EA2│    │ES1││ES2│     │EB1││EB2│                              │
│   └─┬─┘└─┬─┘    └─┬─┘└─┬─┘     └─┬─┘└─┬─┘                              │
│     │    │        │    │         │    │                                  │
│     └────┴────────┴────┴─────────┴────┘                                  │
│                    │                                                     │
│              ┌─────┴─────┐                                              │
│              │   Input   │                                              │
│              └───────────┘                                              │
│                                                                          │
│  COMPARISON WITH MMOE:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  MMOE:                                                                   │
│  • All experts are shared                                               │
│  • Task-specific learning only in towers                                │
│                                                                          │
│  PLE:                                                                    │
│  • Explicit task-specific experts at each layer                         │
│  • Progressive refinement through multiple extraction layers            │
│  • Better handles conflicting tasks                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

ESMM: Entire Space Multi-Task Model (Alibaba, 2018)

ESMM addresses the sample selection bias problem in CVR prediction.

The problem:

  • CVR (conversion rate) = P(conversion | click)
  • Training data: Only users who clicked
  • Deployment: Predict for all impressions (including non-clickers)

This is sample selection bias: the training distribution differs from the deployment distribution.

ESMM's solution: Model the entire sample space using the decomposition:

P(conversion)=P(click)×P(conversionclick)P(\text{conversion}) = P(\text{click}) \times P(\text{conversion} | \text{click})

or equivalently:

CTCVR=CTR×CVR\text{CTCVR} = \text{CTR} \times \text{CVR}

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    ESMM: SAMPLE SELECTION BIAS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM:                                                            │
│  ────────────                                                            │
│                                                                          │
│                   All Impressions (1M)                                   │
│                          │                                               │
│              ┌───────────┴───────────┐                                  │
│              │                       │                                   │
│        Clicks (30K)           No Click (970K)                           │
│              │                       │                                   │
│      ┌───────┴───────┐               ×                                  │
│      │               │          (no conversion                          │
│  Conversion (1K)  No Conv (29K)  data here!)                            │
│                                                                          │
│  Traditional CVR model:                                                  │
│  • Trained on 30K clicks only                                           │
│  • Deployed on 1M impressions                                           │
│  • Distribution mismatch!                                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ESMM SOLUTION:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  Model CTCVR (click AND convert) over entire impression space:          │
│                                                                          │
│  P(click ∩ conversion) = P(click) × P(conversion | click)               │
│        CTCVR           =   CTR    ×       CVR                           │
│                                                                          │
│  ARCHITECTURE:                                                           │
│                                                                          │
│       ┌─────────────┐                   ┌─────────────┐                 │
│       │  CTR Tower  │                   │  CVR Tower  │                 │
│       │   pCTR      │                   │   pCVR      │                 │
│       └──────┬──────┘                   └──────┬──────┘                 │
│              │                                  │                        │
│              │               ┌──────────────────┘                        │
│              │               │                                           │
│              ▼               ▼                                           │
│         ┌────────────────────────┐                                      │
│         │   pCTCVR = pCTR × pCVR │  ← Multiplied, not concatenated     │
│         └────────────────────────┘                                      │
│                                                                          │
│  TRAINING:                                                               │
│  ─────────                                                               │
│  • CTR: supervised on all impressions (click/no-click labels)           │
│  • CTCVR: supervised on all impressions (conversion labels)             │
│  • CVR: NO direct supervision—learned implicitly!                       │
│                                                                          │
│  Loss = L_CTR(pCTR, click_label) + L_CTCVR(pCTCVR, conversion_label)   │
│                                                                          │
│  BENEFIT:                                                                │
│  ────────                                                                │
│  • CVR implicitly trained on ALL samples (via CTCVR supervision)        │
│  • No sample selection bias                                             │
│  • CTR and CVR share embeddings (transfer learning)                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mathematical insight:

Since CTCVR=CTR×CVR\text{CTCVR} = \text{CTR} \times \text{CVR}, the gradient flows through both towers:

LCTCVRθCVR=LCTCVRpCTCVR×pCTCVRpCVR×pCVRθCVR\frac{\partial \mathcal{L}_{CTCVR}}{\partial \theta_{CVR}} = \frac{\partial \mathcal{L}_{CTCVR}}{\partial \text{pCTCVR}} \times \frac{\partial \text{pCTCVR}}{\partial \text{pCVR}} \times \frac{\partial \text{pCVR}}{\partial \theta_{CVR}}

=LCTCVRpCTCVR×pCTR×pCVRθCVR= \frac{\partial \mathcal{L}_{CTCVR}}{\partial \text{pCTCVR}} \times \text{pCTR} \times \frac{\partial \text{pCVR}}{\partial \theta_{CVR}}

The CVR tower receives gradients weighted by CTR, which naturally emphasizes training on samples likely to click—exactly what we want for CVR estimation.


Part VI: Calibration and Position Bias

Why Calibration Matters

A well-calibrated model means: when you predict 10% CTR, approximately 10% of those impressions should result in clicks.

Definition (calibration):

E[yp^(x)=p]=pp[0,1]\mathbb{E}[y | \hat{p}(\mathbf{x}) = p] = p \quad \forall p \in [0, 1]

In words: among all predictions with value pp, the actual positive rate should be pp.

Why calibration matters in advertising:

  1. Revenue optimization: Expected Revenue=pCTR×bid\text{Expected Revenue} = \text{pCTR} \times \text{bid}

    • Overestimated pCTR → overpay for impressions
    • Underestimated pCTR → lose valuable impressions
  2. Budget pacing: Advertisers set daily budgets assuming predicted CTRs are accurate

  3. Auction dynamics: Second-price auctions assume truthful bidding based on accurate value estimates

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CALIBRATION IN AD SYSTEMS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WELL-CALIBRATED MODEL:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  Predicted CTR │ Actual CTR │ Calibration                               │
│  ──────────────┼────────────┼───────────────                            │
│      0.01      │    0.010   │   Perfect                                 │
│      0.05      │    0.052   │   Close                                   │
│      0.10      │    0.098   │   Close                                   │
│      0.20      │    0.195   │   Close                                   │
│                                                                          │
│  POORLY-CALIBRATED MODEL:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  Predicted CTR │ Actual CTR │ Problem                                   │
│  ──────────────┼────────────┼───────────────                            │
│      0.01      │    0.005   │   Overconfident                           │
│      0.05      │    0.030   │   Overconfident                           │
│      0.10      │    0.150   │   Underconfident                          │
│      0.20      │    0.250   │   Underconfident                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CALIBRATION PLOT (Reliability Diagram):                                │
│                                                                          │
│  Actual CTR                                                              │
│      │                              ╱                                   │
│  0.3 │                           ╱                                      │
│      │                        ╱    ● (well-calibrated)                 │
│  0.2 │                     ╱   ●                                        │
│      │                  ╱  ●                                            │
│  0.1 │               ╱●                                                 │
│      │            ╱●                                                    │
│    0 │─────────╱─────────────────────────                               │
│      0       0.1      0.2      0.3    Predicted CTR                     │
│                                                                          │
│  Perfect calibration: points lie on diagonal                            │
│  Above diagonal: underconfident (actual > predicted)                    │
│  Below diagonal: overconfident (actual < predicted)                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Calibration Methods

Platt Scaling:

Learn a post-hoc logistic transformation:

p^calibrated=σ(Ap^raw+B)\hat{p}_{\text{calibrated}} = \sigma(A \cdot \hat{p}_{\text{raw}} + B)

where A,BA, B are learned on a validation set.

Isotonic Regression:

Non-parametric: learn a monotonic step function mapping raw scores to calibrated probabilities.

Temperature Scaling:

p^calibrated=σ(logit(p^raw)T)\hat{p}_{\text{calibrated}} = \sigma\left(\frac{\text{logit}(\hat{p}_{\text{raw}})}{T}\right)

where T>1T > 1 softens predictions (less confident), T<1T < 1 sharpens them.


Position Bias

The problem: Ads in higher positions get more clicks regardless of relevance, simply because users see them first.

Observed CTR decomposition:

P(clickad,position)=P(examineposition)×P(clickexamine,ad)P(\text{click} | \text{ad}, \text{position}) = P(\text{examine} | \text{position}) \times P(\text{click} | \text{examine}, \text{ad})

where:

  • P(examineposition)P(\text{examine} | \text{position}) = probability user sees the ad (decreases with position)
  • P(clickexamine,ad)P(\text{click} | \text{examine}, \text{ad}) = true relevance (what we want to estimate)
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    POSITION BIAS IN AD CLICKS                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USER'S ATTENTION PATTERN:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Position 1: ████████████████████  P(examine) = 1.0                     │
│  Position 2: ██████████████████    P(examine) = 0.85                    │
│  Position 3: ████████████████      P(examine) = 0.70                    │
│  Position 4: ██████████████        P(examine) = 0.55                    │
│  Position 5: ████████████          P(examine) = 0.40                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE BIAS PROBLEM:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Scenario: Two ads with same true relevance (0.10 click prob if seen)   │
│                                                                          │
│  Ad A in position 1: Observed CTR = 1.0 × 0.10 = 0.10                   │
│  Ad B in position 5: Observed CTR = 0.4 × 0.10 = 0.04                   │
│                                                                          │
│  Naive model: "Ad A is 2.5x better than Ad B"  ← WRONG!                 │
│  Reality: They're equally good; position caused the difference          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THIS MATTERS FOR TRAINING:                                          │
│  ──────────────────────────────                                          │
│                                                                          │
│  Training data: Historical clicks with position information             │
│                                                                          │
│  If we ignore position bias:                                            │
│  • Ads that historically appeared in top positions → overestimated CTR  │
│  • Ads that historically appeared in bottom → underestimated CTR        │
│  • Rich get richer (biased ads keep getting top positions)              │
│                                                                          │
│  We need to estimate TRUE relevance, not position-confounded CTR        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Debiasing Approaches

1. Position as Feature:

Simply add position as an input feature during training, but set it to a reference position (e.g., position 1) during inference.

2. Propensity Weighting (Inverse Propensity Scoring):

Weight training samples inversely by their examination probability:

L=i1P(examineposi)logP(clickadi)\mathcal{L} = -\sum_{i} \frac{1}{P(\text{examine} | \text{pos}_i)} \cdot \log P(\text{click} | \text{ad}_i)

This upweights samples from low positions (which are harder to click).

3. Position Bias Models (PAL):

Model examination and relevance separately:

y^=σ(logitrelevance(ad features)+logitposition(position))\hat{y} = \sigma(\text{logit}_{\text{relevance}}(\text{ad features}) + \text{logit}_{\text{position}}(\text{position}))

At inference, use only the relevance component.


Part VII: Real-Time Bidding (RTB)

The RTB Ecosystem

Real-Time Bidding is how most display/video ads are bought and sold. When a user loads a webpage:

  1. Publisher sends bid request to ad exchange (user info, context)
  2. Exchange broadcasts to multiple Demand-Side Platforms (DSPs)
  3. DSPs evaluate their advertisers' campaigns and submit bids
  4. Exchange runs auction, winner's ad is displayed
  5. Total time: <100ms
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    RTB: REAL-TIME BIDDING FLOW                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USER LOADS PAGE                                                         │
│       │                                                                  │
│       ▼  (1) Bid Request (~10ms)                                        │
│  ┌─────────────┐                                                        │
│  │  Publisher  │                                                        │
│  │   (SSP)     │                                                        │
│  └──────┬──────┘                                                        │
│         │                                                                │
│         ▼  (2) Broadcast to DSPs                                        │
│  ┌─────────────┐                                                        │
│  │ Ad Exchange │                                                        │
│  └──────┬──────┘                                                        │
│         │                                                                │
│    ┌────┴────┬─────────┬─────────┐                                      │
│    ▼         ▼         ▼         ▼                                      │
│ ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐                                     │
│ │DSP 1│  │DSP 2│  │DSP 3│  │DSP 4│  (3) Each DSP decides bid (~50ms)  │
│ └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘                                     │
│    │        │        │        │                                          │
│    │  ┌─────┴────────┴────────┴─────┐                                   │
│    │  │ For each campaign:          │                                   │
│    │  │ 1. Predict pCTR, pCVR       │                                   │
│    │  │ 2. Calculate expected value │                                   │
│    │  │ 3. Apply bidding strategy   │                                   │
│    │  │ 4. Check budget constraints │                                   │
│    │  └─────────────────────────────┘                                   │
│    │                                                                     │
│    ▼  (4) Submit bids                                                   │
│ ┌─────────────┐                                                         │
│ │ Ad Exchange │  (5) Run auction (usually 2nd price)                    │
│ └──────┬──────┘                                                         │
│        │                                                                 │
│        ▼  (6) Winner's ad served                                        │
│ ┌─────────────┐                                                         │
│ │  User sees  │                                                         │
│ │     ad      │                                                         │
│ └─────────────┘                                                         │
│                                                                          │
│  TOTAL LATENCY BUDGET: ~100ms                                           │
│  • Network round-trip: ~20ms                                            │
│  • DSP processing: ~50ms                                                │
│  • Exchange auction: ~10ms                                              │
│  • Ad rendering: ~20ms                                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Bid Optimization

The bidding problem: Given predicted CTR, conversion rate, and campaign constraints, what bid maximizes value?

Basic formulation (for CPA campaigns):

bid=pCTR×pCVR×target CPA\text{bid} = \text{pCTR} \times \text{pCVR} \times \text{target CPA}

where target CPA is the advertiser's desired cost per acquisition.

Auction dynamics:

In a second-price auction, you pay the second-highest bid plus a small increment. The optimal strategy under truthful bidding is to bid your true value:

bid=E[value]=pCTR×pCVR×conversion value\text{bid}^* = \mathbb{E}[\text{value}] = \text{pCTR} \times \text{pCVR} \times \text{conversion value}

But real auctions have complications: budget constraints, pacing requirements, competition dynamics.


Budget Pacing

Problem: Advertiser has daily budget BB but wants to spread impressions throughout the day (not exhaust budget in the morning).

Pacing strategies:

  1. Probabilistic throttling: Bid on only a fraction of requests P(bid)=Bexpected daily spend without pacingP(\text{bid}) = \frac{B}{\text{expected daily spend without pacing}}

  2. Bid shading: Reduce bids by a pacing multiplier λ\lambda bidpaced=λ×bidoptimal\text{bid}_{\text{paced}} = \lambda \times \text{bid}_{\text{optimal}}

  3. PID controller: Dynamically adjust λ\lambda based on spend rate vs. target rate

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    BUDGET PACING OVER A DAY                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Spend ($)                                                               │
│      │                                                                   │
│  $100│─────────────────────────────────────────────  Budget limit       │
│      │                                         ╱                        │
│   $80│                                      ╱                           │
│      │                                   ╱  ← Well-paced spend          │
│   $60│                                ╱                                 │
│      │                  ╱╱╱╱╱╱                                          │
│   $40│            ╱╱╱╱╱╱        ← Uniform pacing target                │
│      │      ╱╱╱╱╱╱                                                      │
│   $20│ ╱╱╱╱╱                                                            │
│      │╱                                                                  │
│    $0├───────────────────────────────────────────────────────           │
│      0    4    8    12   16   20   24  Hour                             │
│                                                                          │
│  WITHOUT PACING:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  Spend ($)                                                               │
│      │                                                                   │
│  $100│─────────█████████████████████████████████████  Budget exhausted │
│      │        █                                       by 2pm!           │
│   $80│       █                                                          │
│      │      █                                                           │
│   $60│     █                                                            │
│      │    █                                                             │
│   $40│   █                                                              │
│      │  █                                                               │
│   $20│ █                                                                │
│      │█                                                                 │
│    $0├───────────────────────────────────────────────────────           │
│      0    4    8    12   16   20   24  Hour                             │
│                                                                          │
│  Problem: Miss all evening traffic (often high-value!)                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PID Controller for pacing:

λt+1=λt+Kpet+Kis=0tes+Kd(etet1)\lambda_{t+1} = \lambda_t + K_p \cdot e_t + K_i \cdot \sum_{s=0}^{t} e_s + K_d \cdot (e_t - e_{t-1})

where et=target spend rateactual spend ratee_t = \text{target spend rate} - \text{actual spend rate}.


Game Theory of Ad Auctions

Ad auctions are strategic games where bidders compete for impressions. Understanding game-theoretic foundations is essential for optimal bidding.

Auction Formats in Digital Advertising:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    AD AUCTION MECHANISMS                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FIRST-PRICE AUCTION:                                                    │
│  ────────────────────                                                    │
│  Winner pays their bid                                                  │
│                                                                          │
│  Bids: [$1.00, $0.80, $0.60]                                           │
│  Winner: Bidder 1 ($1.00)                                               │
│  Payment: $1.00                                                         │
│                                                                          │
│  Strategy: Shade bid below true value to increase profit margin         │
│  Equilibrium: Complex, depends on beliefs about competitors             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SECOND-PRICE AUCTION (Vickrey):                                        │
│  ───────────────────────────────                                         │
│  Winner pays second-highest bid                                         │
│                                                                          │
│  Bids: [$1.00, $0.80, $0.60]                                           │
│  Winner: Bidder 1 ($1.00)                                               │
│  Payment: $0.80 (second price)                                          │
│                                                                          │
│  Strategy: Bid true value (dominant strategy!)                          │
│  Equilibrium: Truthful bidding is optimal regardless of competitors     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENERALIZED SECOND-PRICE (GSP):                                        │
│  ───────────────────────────────                                         │
│  Multiple slots, each winner pays next-highest bid                      │
│                                                                          │
│  Bids: [$1.00, $0.80, $0.60] for 2 slots                               │
│  Slot 1: Bidder 1, pays $0.80                                          │
│  Slot 2: Bidder 2, pays $0.60                                          │
│                                                                          │
│  Strategy: NOT truthful! Bid shading is rational                        │
│  Equilibrium: Multiple equilibria exist                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Vickrey-Clarke-Groves (VCG) Mechanism:

The VCG mechanism extends second-price auctions to multiple items with the property that truthful bidding is a dominant strategy.

Payment rule:

pi=jivj(ai)jivj(a)p_i = \sum_{j \neq i} v_j(a_{-i}^*) - \sum_{j \neq i} v_j(a^*)

where:

  • aa^* = optimal allocation with bidder ii
  • aia_{-i}^* = optimal allocation without bidder ii
  • vj(a)v_j(a) = bidder jj's value under allocation aa

Intuition: Bidder ii pays the externality they impose on others—the reduction in others' total value caused by ii's presence.

Why VCG matters: Under VCG, bidding your true value is always optimal, regardless of what others do. This simplifies bidder strategy and improves auction efficiency.

Nash Equilibrium in GSP Auctions:

GSP (used by Google Ads for years) does NOT have truthful bidding as equilibrium. Instead, bidders shade bids strategically.

Symmetric Nash Equilibrium bid:

For a bidder with value viv_i competing for position kk:

bi=vi(vivi+1)αk+1αkb_i^* = v_i - \frac{(v_i - v_{i+1}) \cdot \alpha_{k+1}}{\alpha_k}

where αk\alpha_k = click-through rate for position kk (position 1 has highest CTR).

Key insight: Bid shading increases with position quality difference. If position 1 gets 10x more clicks than position 2, competition for position 1 is fierce, and bid shading is minimal.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    EQUILIBRIUM ANALYSIS: GSP vs VCG                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Example: 3 bidders, 2 slots                                            │
│  Values: v₁ = $10, v₂ = $8, v₃ = $4                                    │
│  CTRs: α₁ = 1.0 (slot 1), α₂ = 0.5 (slot 2)                            │
│                                                                          │
│  TRUTHFUL BIDDING (VCG):                                                │
│  ───────────────────────                                                 │
│  Bids = Values: [$10, $8, $4]                                          │
│  Allocation: Bidder 1 → Slot 1, Bidder 2 → Slot 2                      │
│                                                                          │
│  VCG Payments:                                                          │
│  p₁ = (v₂·α₁ + v₃·α₂) - (v₂·α₂ + v₃·0) = ($8·1 + $4·0.5) - ($8·0.5)  │
│     = $10 - $4 = $6                                                     │
│  p₂ = (v₃·α₂) - (v₃·0) = $2 - $0 = $2                                  │
│                                                                          │
│  GSP EQUILIBRIUM (with bid shading):                                    │
│  ────────────────────────────────────                                    │
│  Equilibrium bids: b₁ = $8, b₂ = $4, b₃ = $4                           │
│  Payments: p₁ = $4 (second price), p₂ = $4 (third price)               │
│                                                                          │
│  COMPARISON:                                                             │
│  ───────────                                                             │
│  │ Mechanism │ Revenue │ Efficiency │ Strategy Complexity │            │
│  ├───────────┼─────────┼────────────┼─────────────────────┤            │
│  │ VCG       │ $8      │ Optimal    │ Simple (truthful)   │            │
│  │ GSP       │ $8      │ Optimal    │ Complex (shade)     │            │
│  │ First-Prc │ Varies  │ Optimal    │ Complex (shade)     │            │
│                                                                          │
│  Revenue Equivalence Theorem: Under certain conditions, all             │
│  auction formats yield the same expected revenue!                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Reserve Prices and Optimal Auction Design:

Platforms set reserve prices rr to extract more revenue from high-value bidders.

Myerson's optimal reserve (for uniform value distribution on [0,vmax][0, v_{max}]):

r=vmax2r^* = \frac{v_{max}}{2}

Revenue impact:

Revenue(r)=rP(winbr)+E[second priceb1r,b2r]\text{Revenue}(r) = r \cdot P(\text{win} | b \geq r) + \mathbb{E}[\text{second price} | b_1 \geq r, b_2 \geq r]

Setting r>0r > 0 sacrifices some auctions (no winner) but extracts more from winners.

Modern auction trends:

  1. First-price auctions: Google and others moved from GSP to first-price (2019-2021)
  2. Header bidding: Multiple exchanges compete simultaneously
  3. Unified auctions: Combine direct deals with programmatic

Reinforcement Learning for Bid Optimization

Bidding is fundamentally a sequential decision problem: each bid affects budget, win rate, and future opportunities. RL provides a principled framework.

MDP Formulation for Bidding:

(S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma)

State space S\mathcal{S}:

st=(budget remaining,time remaining,user features,ad features,context,historical win rate,market conditions)s_t = (\text{budget remaining}, \text{time remaining}, \text{user features}, \text{ad features}, \text{context}, \text{historical win rate}, \text{market conditions})

Action space A\mathcal{A}:

at=bid amount[0,bmax]a_t = \text{bid amount} \in [0, b_{max}]

Or discretized: at{0,0.01,0.02,...,bmax}a_t \in \{0, 0.01, 0.02, ..., b_{max}\}

Or bid multiplier: at{0.5,0.75,1.0,1.25,1.5}×base_bida_t \in \{0.5, 0.75, 1.0, 1.25, 1.5\} \times \text{base\_bid}

Transition dynamics P(st+1st,at)P(s_{t+1} | s_t, a_t):

  • If win: budget decreases by payment, conversion may occur
  • If lose: budget unchanged, opportunity lost
  • Time always advances

Reward function R(st,at)R(s_t, a_t):

Code
R(s_t, a_t) =
  ┌ conversion_value - payment   if win and convert
  │ -payment                     if win and no convert
  └ 0                            if lose

Or for CPA goals:

R(st,at)=I[convert]paymenttarget_CPAR(s_t, a_t) = \mathbb{I}[\text{convert}] - \frac{\text{payment}}{\text{target\_CPA}}

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    RL BIDDING: MDP FORMULATION                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STATE at time t:                                                        │
│  ────────────────                                                        │
│  s_t = [                                                                │
│    B_t,           # Budget remaining ($)                                │
│    T - t,         # Time remaining (hours)                              │
│    pCTR,          # Predicted click probability                         │
│    pCVR,          # Predicted conversion probability                    │
│    user_embed,    # User features (dense)                               │
│    context,       # Page, device, hour, etc.                            │
│    win_rate_t,    # Recent win rate                                     │
│    spend_rate_t   # Current spend velocity                              │
│  ]                                                                       │
│                                                                          │
│  ACTION:                                                                 │
│  ───────                                                                 │
│  a_t = bid_multiplier ∈ {0.5, 0.75, 1.0, 1.25, 1.5, 2.0}               │
│  actual_bid = a_t × base_bid                                            │
│  base_bid = pCTR × pCVR × target_CPA                                   │
│                                                                          │
│  TRANSITION:                                                             │
│  ───────────                                                             │
│                                                                          │
│  ┌─────────┐    win (p=w(bid))    ┌─────────────────────┐              │
│  │  Bid    │ ──────────────────► │ B_{t+1} = B_t - cost │              │
│  │  a_t    │                      │ reward = value - cost│              │
│  └────┬────┘                      └─────────────────────┘              │
│       │                                                                  │
│       │ lose (p=1-w(bid))                                               │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────┐                                                │
│  │ B_{t+1} = B_t       │                                                │
│  │ reward = 0          │                                                │
│  │ (opportunity lost)  │                                                │
│  └─────────────────────┘                                                │
│                                                                          │
│  OBJECTIVE:                                                              │
│  ──────────                                                              │
│  Maximize: E[Σ γ^t R_t] subject to Σ cost_t ≤ B (budget constraint)    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Value Function and Q-Learning:

State-value function:

Vπ(s)=Eπ[k=0γkRt+kst=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k} \mid s_t = s\right]

Action-value function:

Qπ(s,a)=Eπ[Rt+γVπ(st+1)st=s,at=a]Q^\pi(s, a) = \mathbb{E}_\pi\left[R_t + \gamma V^\pi(s_{t+1}) \mid s_t = s, a_t = a\right]

Bellman optimality equation:

Q(s,a)=E[R+γmaxaQ(s,a)s,a]Q^*(s, a) = \mathbb{E}\left[R + \gamma \max_{a'} Q^*(s', a') \mid s, a\right]

Deep Q-Network (DQN) for bidding:

Approximate Q(s,a)Q^*(s, a) with neural network Qθ(s,a)Q_\theta(s, a):

L(θ)=E(s,a,r,s)D[(r+γmaxaQθ(s,a)Qθ(s,a))2]\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a)\right)^2\right]

where θ\theta^- = target network parameters (updated periodically for stability).

Policy Gradient Methods:

For continuous bid spaces, policy gradient methods work better than Q-learning.

Policy parametrization:

πθ(as)=N(μθ(s),σθ(s)2)\pi_\theta(a | s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2)

Policy gradient theorem:

θJ(θ)=Eπθ[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s, a)\right]

Actor-Critic for bidding:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    ACTOR-CRITIC BIDDING AGENT                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                         ┌─────────────────┐                             │
│                         │      State      │                             │
│                         │   (features)    │                             │
│                         └────────┬────────┘                             │
│                                  │                                       │
│                    ┌─────────────┴─────────────┐                        │
│                    │                           │                         │
│                    ▼                           ▼                         │
│           ┌────────────────┐         ┌────────────────┐                 │
│           │     ACTOR      │         │    CRITIC      │                 │
│           │   π_θ(a|s)     │         │    V_φ(s)      │                 │
│           │               │         │                │                 │
│           │  Policy Net    │         │  Value Net     │                 │
│           └───────┬────────┘         └───────┬────────┘                 │
│                   │                          │                           │
│                   ▼                          ▼                           │
│              ┌─────────┐              ┌───────────┐                      │
│              │  Bid    │              │ Baseline  │                      │
│              │ Action  │              │  V(s)     │                      │
│              └────┬────┘              └─────┬─────┘                      │
│                   │                         │                            │
│                   ▼                         │                            │
│              ┌─────────┐                    │                            │
│              │ Auction │                    │                            │
│              │ Result  │                    │                            │
│              └────┬────┘                    │                            │
│                   │                         │                            │
│                   ▼                         ▼                            │
│              ┌─────────────────────────────────┐                        │
│              │  Advantage: A = R + γV(s') - V(s)│                       │
│              └─────────────────────────────────┘                        │
│                              │                                           │
│              ┌───────────────┴───────────────┐                          │
│              │                               │                           │
│              ▼                               ▼                           │
│    Actor update:                   Critic update:                       │
│    θ ← θ + α∇log π(a|s)·A         φ ← φ - β∇(V(s) - target)²          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Handling Budget Constraints (Constrained RL):

Budget constraints make this a Constrained MDP (CMDP):

maxπE[tRt]s.t.E[tct]B\max_\pi \mathbb{E}\left[\sum_t R_t\right] \quad \text{s.t.} \quad \mathbb{E}\left[\sum_t c_t\right] \leq B

Lagrangian relaxation:

L(π,λ)=E[tRt]λ(E[tct]B)\mathcal{L}(\pi, \lambda) = \mathbb{E}\left[\sum_t R_t\right] - \lambda \left(\mathbb{E}\left[\sum_t c_t\right] - B\right)

Solve via dual gradient descent: update π\pi to maximize L\mathcal{L}, update λ\lambda to minimize it.

Practical constraint handling:

  1. Penalty shaping: Add αmax(0,spendbudget)-\alpha \cdot \max(0, \text{spend} - \text{budget}) to reward
  2. Budget as state: Include remaining budget in state, learn budget-aware policy
  3. Post-hoc projection: Clip actions to respect constraints

Offline RL for Bidding:

Online RL exploration can be costly (bad bids lose money). Offline RL learns from historical auction logs.

Challenge: Distribution shift—learned policy may choose bids never seen in data.

Conservative Q-Learning (CQL):

LCQL(θ)=LQ(θ)+αEsD[logaexp(Qθ(s,a))EaD[Qθ(s,a)]]\mathcal{L}_{CQL}(\theta) = \mathcal{L}_{Q}(\theta) + \alpha \cdot \mathbb{E}_{s \sim \mathcal{D}}\left[\log \sum_a \exp(Q_\theta(s,a)) - \mathbb{E}_{a \sim \mathcal{D}}[Q_\theta(s,a)]\right]

The penalty term discourages high Q-values for out-of-distribution actions.

Multi-Agent Considerations:

All bidders are simultaneously learning and adapting, creating a multi-agent RL problem.

Challenges:

  • Non-stationarity: Other bidders' policies change over time
  • Partial observability: Can't see competitors' states or strategies
  • Credit assignment: Win/loss depends on others' bids

Approaches:

  1. Opponent modeling: Estimate competitors' bidding strategies
  2. Robust RL: Optimize for worst-case competitor behavior
  3. Mean-field approximation: Model aggregate competition as a distribution
  4. Regret minimization: Guarantee no-regret against arbitrary competitors

Bid Landscape Forecasting

To optimize bids, we need to understand the competitive landscape: what bids are needed to win at different rates?

Win rate function:

w(b)=P(win auctionbid=b)w(b) = P(\text{win auction} | \text{bid} = b)

This is typically modeled as:

  • Log-normal: w(b)=Φ(logbμσ)w(b) = \Phi\left(\frac{\log b - \mu}{\sigma}\right)
  • Empirical: Learn from historical auction data

Optimal bidding with win rate:

Expected profit=w(b)(valuecost(b))\text{Expected profit} = w(b) \cdot (\text{value} - \text{cost}(b))

For second-price auctions, cost(b)E[second pricewin at b]\text{cost}(b) \approx \mathbb{E}[\text{second price} | \text{win at } b].


Part VIII: Production Considerations

Latency Requirements

Ad systems have extreme latency requirements:

ComponentBudget
Total end-to-end<100ms
Feature retrieval<10ms
Model inference<10ms
Ranking logic<5ms
Network overhead~50ms

Techniques for low-latency inference:

  1. Model distillation: Train small "student" model to mimic large "teacher"
  2. Quantization: INT8 or even INT4 inference
  3. Pruning: Remove unimportant weights
  4. Caching: Precompute user/item embeddings
  5. Cascade ranking: Cheap model filters 10K→100 candidates, expensive model ranks 100

Feature Store Architecture

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE STORE FOR AD SERVING                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OFFLINE PIPELINE (batch):                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Raw Data → Feature Engineering → Feature Store (offline)               │
│  (Spark/Flink)    (daily/hourly)     (Hive, S3)                         │
│                                                                          │
│  Features: User historical stats, item aggregates, long-term behavior   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ONLINE PIPELINE (real-time):                                            │
│  ────────────────────────────                                            │
│                                                                          │
│  Events → Stream Processing → Feature Store (online)                    │
│  (Kafka)     (Flink)            (Redis, DynamoDB)                       │
│                                                                          │
│  Features: Real-time counts, recent clicks, session features            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SERVING PATH:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Bid Request → Feature Retrieval → Model Inference → Bid Response       │
│                     │                                                    │
│           ┌─────────┴─────────┐                                         │
│           ▼                   ▼                                          │
│      Online Store        Offline Store                                  │
│      (Redis: <1ms)       (preloaded cache)                              │
│                                                                          │
│  FEATURE FRESHNESS REQUIREMENTS:                                         │
│  ────────────────────────────────                                        │
│                                                                          │
│  • User embedding: Updated daily (offline OK)                           │
│  • User recent clicks: Updated real-time (online required)              │
│  • Ad historical CTR: Updated hourly (near-line)                        │
│  • Context features: Computed at request time                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Update and A/B Testing

Continuous training pipeline:

  1. Daily retraining: Incorporate yesterday's clicks/conversions
  2. Incremental updates: Online learning on streaming data
  3. Shadow deployment: New model runs alongside production, compare metrics
  4. Gradual rollout: 1% → 5% → 20% → 50% → 100% traffic

A/B testing considerations:

  • Interference: Users in treatment may affect control (competition for same inventory)
  • Delayed conversions: Need to wait days/weeks for full conversion data
  • Novelty effects: New models may appear better initially due to exploration
  • Metric selection: CTR? Revenue? Long-term user satisfaction?

Part IX: Advanced Topics

Delayed Feedback Modeling

Conversions often happen hours or days after clicks. How do we train when labels are incomplete?

Approaches:

  1. Attribution window: Only count conversions within X days of click
  2. Importance weighting: Weight older samples higher (more complete labels)
  3. Elapsed-time model: P(conversion)=f(features)×g(elapsed time)P(\text{conversion}) = f(\text{features}) \times g(\text{elapsed time})

Defuse model (Chapelle, 2014):

P(observed conversion by time t)=P(conversion)×P(delayt)P(\text{observed conversion by time } t) = P(\text{conversion}) \times P(\text{delay} \leq t)

Model both the conversion probability and delay distribution.

Fraud Detection

Click fraud costs advertisers billions annually. Detection approaches:

  1. Anomaly detection: Unusual click patterns, timing, sources
  2. Behavioral modeling: Bots have different behavior than humans
  3. IP/device fingerprinting: Identify fraudulent traffic sources
  4. Conversion modeling: Fraudulent clicks rarely convert

Privacy-Preserving Advertising

With increasing privacy regulations (GDPR, CCPA) and deprecation of third-party cookies:

  1. Federated learning: Train models without centralizing user data
  2. Differential privacy: Add noise to prevent individual identification
  3. On-device prediction: Run models locally on user devices
  4. Cohort-based targeting: Target groups, not individuals (Google's Topics API)

Part X: Generative AI and LLM-Powered Advertising

The emergence of Large Language Models is transforming advertising beyond traditional CTR prediction. GenAI impacts every stage of the advertising pipeline: creative generation, audience understanding, personalization, and optimization.

The GenAI Advertising Stack

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    GENAI IN ADVERTISING PIPELINE                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL PIPELINE:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Advertiser → Fixed Creative → Targeting Rules → CTR Model → User       │
│               (one ad copy)    (demographics)    (predict)              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENAI-ENHANCED PIPELINE:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  Advertiser → LLM Creative   → Semantic         → Neural      → User    │
│               Generation       Audience           Ranking               │
│               (1000s of        Understanding      + LLM                 │
│               variations)      (intent, context)  Personalization       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENAI TOUCHPOINTS:                                                      │
│                                                                          │
│  1. CREATIVE GENERATION                                                  │
│     • Ad copy variations                                                │
│     • Headline optimization                                             │
│     • Image generation (DALL-E, Midjourney)                             │
│     • Video script generation                                           │
│                                                                          │
│  2. AUDIENCE UNDERSTANDING                                               │
│     • Intent classification from search queries                         │
│     • Semantic user profiling                                           │
│     • Contextual page understanding                                     │
│     • Conversation-based preference elicitation                         │
│                                                                          │
│  3. PERSONALIZATION                                                      │
│     • Dynamic ad copy adaptation                                        │
│     • Real-time message tailoring                                       │
│     • Conversational ad experiences                                     │
│                                                                          │
│  4. OPTIMIZATION                                                         │
│     • LLM-as-judge for ad quality                                       │
│     • Automated A/B test analysis                                       │
│     • Campaign strategy recommendations                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

LLM-Powered Creative Generation

Traditional advertising requires human copywriters to create ad variations. LLMs can generate thousands of variations automatically, enabling true personalization at scale.

The Creative Generation Problem:

Given:

  • Product/service description
  • Brand guidelines and tone
  • Target audience segment
  • Platform constraints (character limits, format)

Generate: Optimized ad copy that maximizes engagement

Multi-Armed Bandit for Creative Selection:

With LLM-generated variations (potentially thousands), we need efficient exploration to find winners without wasting budget on poor performers. The Upper Confidence Bound (UCB) algorithm balances exploitation (showing best-performing creatives) with exploration (testing uncertain ones):

At=argmaxa[μ^a+clntNa]A_t = \arg\max_a \left[\hat{\mu}_a + c\sqrt{\frac{\ln t}{N_a}}\right]

where:

  • μ^a\hat{\mu}_a = estimated CTR for creative aa (exploitation term)
  • NaN_a = times creative aa has been shown
  • cc = exploration constant (typically 1.0-2.0)
  • lntNa\sqrt{\frac{\ln t}{N_a}} = exploration bonus (decreases as we test creative aa more)

Intuition: The exploration bonus is large when NaN_a is small (we haven't tested this creative much, so we're uncertain). As we show the creative more, the bonus shrinks and the algorithm relies more on observed performance. This prevents premature convergence to locally optimal creatives while still exploiting known winners.

Practical considerations:

  • Cold start: New LLM-generated creatives start with high exploration bonus
  • Batch updates: In practice, update μ^a\hat{\mu}_a periodically (hourly/daily) rather than per-impression
  • Contextual bandits: Extend to μ^a(x)\hat{\mu}_a(x) where xx is user context—different creatives may perform better for different users
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LLM CREATIVE GENERATION PIPELINE                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT:                                                                  │
│  ──────                                                                  │
│  Product: "Running shoes with carbon fiber plate"                       │
│  Brand: Nike                                                            │
│  Audience: Competitive runners, 25-40                                   │
│  Platform: Google Search (30 char headline, 90 char description)        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LLM GENERATION (with constraints):                                      │
│  ──────────────────────────────────                                      │
│                                                                          │
│  Variation 1 (Performance focus):                                       │
│    Headline: "Shave Minutes Off Your PR"                                │
│    Description: "Carbon-plated running shoes engineered for speed.     │
│                  Free shipping on orders over $100."                    │
│                                                                          │
│  Variation 2 (Technology focus):                                        │
│    Headline: "Carbon Fiber Technology"                                  │
│    Description: "Experience the same tech as Olympic marathoners.      │
│                  Shop the new collection today."                        │
│                                                                          │
│  Variation 3 (Social proof):                                            │
│    Headline: "Worn by World Champions"                                  │
│    Description: "Join 100,000+ runners who improved their times.       │
│                  Rated 4.9 stars by elite athletes."                    │
│                                                                          │
│  Variation 4 (Urgency):                                                 │
│    Headline: "Limited Edition Colors"                                   │
│    Description: "Race-day ready carbon shoes. Only 500 pairs left.     │
│                  Order now for guaranteed delivery."                    │
│                                                                          │
│  ... (100s more variations)                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SELECTION:                                                              │
│  ──────────                                                              │
│                                                                          │
│  1. LLM-as-Judge filters low-quality/off-brand variations              │
│  2. Multi-armed bandit explores promising variations                    │
│  3. CTR model provides exploitation signal                              │
│  4. Best performers get more budget                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Quality Control with LLM-as-Judge:

Not all generated creatives are good. Use a judge model to filter before deployment:

Quality(c)=LLMjudge(c,brand_guidelines,policy)\text{Quality}(c) = \text{LLM}_{\text{judge}}(c, \text{brand\_guidelines}, \text{policy})

Multi-dimensional scoring (each 1-5 scale):

DimensionWhat It MeasuresFailure Examples
Brand alignmentTone, voice, values match brandLuxury brand with casual slang
Policy complianceNo prohibited claimsUnsubstantiated health claims
ClarityMessage is understandableConfusing or ambiguous copy
PersuasivenessCompelling call-to-actionWeak or missing CTA
GrammarCorrect languageTypos, awkward phrasing
Factual accuracyClaims are trueWrong prices, features

Composite score:

Scorefinal=min(Scorepolicy,Scorefactual)×mean(other scores)\text{Score}_{\text{final}} = \min(\text{Score}_{\text{policy}}, \text{Score}_{\text{factual}}) \times \text{mean}(\text{other scores})

The min\min ensures policy/factual violations are hard blockers regardless of other qualities.

Implementation options:

  1. Prompt-based: Describe scoring criteria in prompt, ask LLM to rate
  2. Fine-tuned judge: Train classifier on human-labeled creative quality data
  3. Ensemble: Multiple judges vote, require consensus for approval
  4. Hybrid: LLM pre-filter + human review for borderline cases

Threshold tuning:

  • High threshold (4.5+): Fewer creatives pass, higher average quality, less variety
  • Low threshold (3.5+): More creatives pass, more variety, some quality risk
  • Adaptive threshold: Start high for new campaigns, lower as you build trust

Semantic Audience Understanding

Traditional targeting uses demographic segments (age, gender, location). LLMs enable semantic targeting based on intent and context.

Intent Understanding from Search Queries:

Intent(q)=LLMclassifier(q){informational,navigational,transactional,commercial}\text{Intent}(q) = \text{LLM}_{\text{classifier}}(q) \in \{\text{informational}, \text{navigational}, \text{transactional}, \text{commercial}\}

Beyond simple classification, LLMs extract nuanced intent:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SEMANTIC INTENT EXTRACTION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Query: "best running shoes for marathon under $200"                    │
│                                                                          │
│  TRADITIONAL KEYWORD MATCHING:                                           │
│  ─────────────────────────────                                           │
│  Keywords: [running, shoes, marathon, $200]                             │
│  Match ads containing these keywords                                    │
│                                                                          │
│  LLM SEMANTIC UNDERSTANDING:                                             │
│  ───────────────────────────                                             │
│  {                                                                       │
│    "intent": "transactional",                                           │
│    "product_category": "performance_running_shoes",                     │
│    "use_case": "marathon_racing",                                       │
│    "experience_level": "intermediate_to_advanced",                      │
│    "price_sensitivity": "high",                                         │
│    "price_ceiling": 200,                                                │
│    "decision_stage": "comparison_shopping",                             │
│    "implicit_needs": [                                                  │
│      "durability_for_long_distance",                                    │
│      "energy_return",                                                   │
│      "lightweight"                                                      │
│    ],                                                                    │
│    "likely_follow_up_interests": [                                      │
│      "running_socks",                                                   │
│      "hydration_gear",                                                  │
│      "marathon_training_plans"                                          │
│    ]                                                                     │
│  }                                                                       │
│                                                                          │
│  TARGETING IMPLICATIONS:                                                 │
│  ───────────────────────                                                 │
│  • Show mid-tier shoes (not budget, not premium)                        │
│  • Emphasize marathon-specific features                                 │
│  • Highlight value proposition (quality at price point)                 │
│  • Cross-sell complementary gear                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

User Embedding from Behavior + LLM:

Combine traditional collaborative filtering embeddings with LLM-derived semantic embeddings:

ucombined=αuCF+(1α)uLLM\mathbf{u}_{\text{combined}} = \alpha \cdot \mathbf{u}_{\text{CF}} + (1-\alpha) \cdot \mathbf{u}_{\text{LLM}}

where:

  • uCF\mathbf{u}_{\text{CF}} = embedding from click/purchase history (DIN/DIEN style)
  • uLLM\mathbf{u}_{\text{LLM}} = embedding from LLM understanding of user's content consumption
  • α\alpha = blending weight (typically 0.3-0.7, tuned via validation)

When to use different α\alpha values:

ScenarioRecommended α\alphaRationale
User has rich click history0.7-0.8Trust behavioral signals
New/cold-start user0.2-0.3Rely on semantic understanding
High-intent queries0.4-0.5Balance both signals
Content-heavy domains (news, articles)0.3-0.4LLM understands content better

Implementation approaches:

  1. Late fusion: Compute uCF\mathbf{u}_{\text{CF}} and uLLM\mathbf{u}_{\text{LLM}} separately, combine at serving time
  2. Early fusion: Concatenate behavioral and semantic features, let model learn combination
  3. Learned fusion: Train a small network to predict optimal α\alpha per user

Contextual Page Understanding:

Instead of simple page categorization (sports, news, entertainment), LLMs understand page content semantically:

PageContext(p)=LLMencoder(content(p))\text{PageContext}(p) = \text{LLM}_{\text{encoder}}(\text{content}(p))

How it works:

  1. Extract page text (title, headings, body, metadata)
  2. Pass through LLM encoder (e.g., fine-tuned BERT, or frozen GPT embeddings)
  3. Resulting embedding captures semantic meaning, not just category

This enables:

  • Brand safety: Understand if content discusses sensitive topics (violence, controversy) even without keyword matches
  • Contextual relevance: Match "marathon training tips" article to running shoe ads even if "shoes" isn't mentioned
  • Sentiment alignment: Place upbeat ads on positive content, avoid juxtaposition issues
  • Topic nuance: Distinguish "Apple (company)" from "apple (fruit)" for targeting

Dynamic Personalization at Serving Time

The most transformative application: personalize ad creative in real-time based on user context.

Personalization Hierarchy:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PERSONALIZATION DEPTH LEVELS                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LEVEL 0: No Personalization                                            │
│  ──────────────────────────────                                          │
│  Same ad shown to everyone                                              │
│  "Buy Nike Running Shoes Today"                                         │
│                                                                          │
│  LEVEL 1: Segment-Based (Traditional)                                   │
│  ─────────────────────────────────────                                   │
│  Different ads per demographic segment                                  │
│  Male 25-34: "Dominate Your Next Race"                                  │
│  Female 25-34: "Run Your Personal Best"                                 │
│                                                                          │
│  LEVEL 2: Behavioral (DIN/DIEN era)                                     │
│  ──────────────────────────────────                                      │
│  Ad selection based on user history                                     │
│  User viewed marathons → Show marathon shoe ads                         │
│  User viewed trails → Show trail shoe ads                               │
│                                                                          │
│  LEVEL 3: LLM Dynamic Personalization                                   │
│  ─────────────────────────────────────                                   │
│  Ad CONTENT adapted in real-time                                        │
│                                                                          │
│  User A (searched "Boston Marathon qualifying times"):                  │
│  "Qualify for Boston: Shoes Trusted by 50,000+ BQ Runners"             │
│                                                                          │
│  User B (searched "couch to 5k beginner"):                              │
│  "Start Your Running Journey: Comfort-First Design for New Runners"    │
│                                                                          │
│  User C (browsing running injury articles):                             │
│  "Run Pain-Free: Engineered Support for Injury Prevention"             │
│                                                                          │
│  Same product, completely different messaging!                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Real-Time Personalization Architecture:

Adpersonalized=LLM(template,user_context,product_info)\text{Ad}_{\text{personalized}} = \text{LLM}(\text{template}, \text{user\_context}, \text{product\_info})

But LLM inference is too slow for real-time bidding (<50ms). Solutions:

  1. Pre-computation: Generate top-K personalized variants offline, select at serving time
  2. Template + Slot Filling: LLM generates templates, fast model fills slots
  3. Cached Personas: Pre-compute ads for user personas, map users to personas
  4. Speculative Generation: Generate during page load, before ad request
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    REAL-TIME PERSONALIZATION ARCHITECTURE                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OFFLINE (Batch):                                                        │
│  ────────────────                                                        │
│                                                                          │
│  For each (product, persona) pair:                                      │
│    LLM generates N ad variations                                        │
│    Store in Creative Cache                                              │
│                                                                          │
│  Personas: {beginner_runner, competitive_runner, injury_recovery,       │
│             casual_fitness, marathon_focused, trail_enthusiast, ...}    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ONLINE (Real-time):                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  1. User request arrives                                                │
│           │                                                              │
│           ▼                                                              │
│  2. ┌─────────────────┐                                                 │
│     │ Persona Classifier│  (Fast: embedding similarity, ~1ms)           │
│     │ user → persona    │                                               │
│     └────────┬──────────┘                                               │
│              │                                                           │
│              ▼                                                           │
│  3. ┌─────────────────┐                                                 │
│     │ Creative Cache   │  (Lookup: product × persona, ~1ms)             │
│     │ Lookup           │                                                │
│     └────────┬──────────┘                                               │
│              │                                                           │
│              ▼                                                           │
│  4. ┌─────────────────┐                                                 │
│     │ CTR Model        │  (Score personalized creative, ~5ms)           │
│     │ Prediction       │                                                │
│     └────────┬──────────┘                                               │
│              │                                                           │
│              ▼                                                           │
│  5. Return personalized ad                                              │
│                                                                          │
│  Total latency: <10ms (no LLM inference in critical path!)             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

LLM-Enhanced CTR Prediction

Beyond creative generation, LLMs can directly improve CTR prediction models.

Feature Augmentation with LLM Embeddings:

Traditional CTR models use sparse ID features. Add dense semantic features from LLMs:

y^=f(xsparse,euserLLM,eadLLM,econtextLLM)\hat{y} = f(\mathbf{x}_{\text{sparse}}, \mathbf{e}_{\text{user}}^{\text{LLM}}, \mathbf{e}_{\text{ad}}^{\text{LLM}}, \mathbf{e}_{\text{context}}^{\text{LLM}})

where eLLM\mathbf{e}^{\text{LLM}} are embeddings from an LLM encoder.

Benefits:

  • Cold-start handling: New ads have semantic embeddings even without click history
  • Generalization: Similar products share similar embeddings
  • Cross-domain transfer: Knowledge transfers across ad categories

Semantic Similarity for Candidate Retrieval:

Use LLM embeddings for initial candidate retrieval:

Candidates=ANN(equeryLLM,{eadLLM})\text{Candidates} = \text{ANN}(\mathbf{e}_{\text{query}}^{\text{LLM}}, \{\mathbf{e}_{\text{ad}}^{\text{LLM}}\})

Then apply traditional CTR models for final ranking.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LLM-ENHANCED TWO-TOWER RETRIEVAL                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  QUERY TOWER:                          AD TOWER:                        │
│  ─────────────                          ─────────                        │
│                                                                          │
│  User Query                             Ad Content                       │
│  + User History                         + Ad Metadata                    │
│       │                                      │                           │
│       ▼                                      ▼                           │
│  ┌─────────────┐                       ┌─────────────┐                  │
│  │ LLM Encoder │                       │ LLM Encoder │                  │
│  │ (shared)    │                       │ (shared)    │                  │
│  └──────┬──────┘                       └──────┬──────┘                  │
│         │                                     │                          │
│         ▼                                     ▼                          │
│  ┌─────────────┐                       ┌─────────────┐                  │
│  │ Projection  │                       │ Projection  │                  │
│  │ Layer       │                       │ Layer       │                  │
│  └──────┬──────┘                       └──────┬──────┘                  │
│         │                                     │                          │
│         ▼                                     ▼                          │
│      e_query                               e_ad                          │
│         │                                     │                          │
│         └──────────────┬──────────────────────┘                          │
│                        │                                                 │
│                        ▼                                                 │
│              score = <e_query, e_ad>                                    │
│                        │                                                 │
│                        ▼                                                 │
│              Top-K candidates → CTR ranking                             │
│                                                                          │
│  LLM BENEFITS:                                                           │
│  ─────────────                                                           │
│  • "marathon training" query matches "26.2 mile race shoes" ad          │
│  • "gift for runner dad" matches "men's premium running shoes"          │
│  • Semantic understanding beyond keyword matching                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Conversational Advertising

LLMs enable a new paradigm: conversational ad experiences where users interact with ads through dialogue.

Use Cases:

  1. Product Discovery: "I need shoes for my first marathon. What do you recommend?"
  2. Objection Handling: "Why are these so expensive?" → Explain value proposition
  3. Personalized Recommendations: Multi-turn dialogue to understand needs
  4. Purchase Assistance: Guide through size selection, shipping options

Conversational Ad Architecture:

Responset=LLM(Product_Info,Dialogue1:t1,User_Messaget,Brand_Guidelines)\text{Response}_t = \text{LLM}(\text{Product\_Info}, \text{Dialogue}_{1:t-1}, \text{User\_Message}_t, \text{Brand\_Guidelines})

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CONVERSATIONAL AD EXPERIENCE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Traditional Ad:                                                         │
│  ───────────────                                                         │
│  ┌─────────────────────────────────────────┐                            │
│  │  Nike ZoomX Vaporfly - $250            │                            │
│  │  The fastest marathon shoe ever.       │                            │
│  │  [Shop Now]                            │                            │
│  └─────────────────────────────────────────┘                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Conversational Ad:                                                      │
│  ──────────────────                                                      │
│  ┌─────────────────────────────────────────┐                            │
│  │  Nike Running Assistant                │                            │
│  │                                        │                            │
│  │  User: "Is this shoe good for a       │                            │
│  │         beginner marathon runner?"     │                            │
│  │                                        │                            │
│  │  Nike: "Great question! The Vaporfly │                            │
│  │   is our elite racing shoe, designed  │                            │
│  │   for experienced runners chasing PRs.│                            │
│  │   For your first marathon, I'd        │                            │
│  │   recommend the Pegasus 41 - it's     │                            │
│  │   more cushioned for training miles   │                            │
│  │   and race day comfort. Would you     │                            │
│  │   like to see it?"                    │                            │
│  │                                        │                            │
│  │  [See Pegasus 41] [Tell me more]      │                            │
│  │  [Compare both]                       │                            │
│  └─────────────────────────────────────────┘                            │
│                                                                          │
│  BENEFITS:                                                               │
│  ─────────                                                               │
│  • Higher engagement (dialogue > static ad)                             │
│  • Better matching (understand actual needs)                            │
│  • Trust building (honest recommendations)                              │
│  • Data collection (explicit preference signals)                        │
│                                                                          │
│  CHALLENGES:                                                             │
│  ───────────                                                             │
│  • Latency (LLM inference per turn)                                     │
│  • Brand safety (LLM may say wrong things)                              │
│  • Cost (compute per conversation)                                      │
│  • Measurement (how to attribute conversions)                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Measurement and Attribution for Conversational Ads:

Conversational ads create new measurement challenges—how do you attribute value across a multi-turn dialogue?

ConversationValue=t=1TγTtrt\text{ConversationValue} = \sum_{t=1}^{T} \gamma^{T-t} \cdot r_t

where:

  • rtr_t = reward signal at turn tt (click, add-to-cart, purchase)
  • γ\gamma = discount factor (earlier turns get less credit)
  • TT = total conversation turns

Attribution approaches:

ModelDescriptionWhen to Use
Last-touchCredit final turn before conversionSimple, but ignores discovery value
First-touchCredit conversation initiationValues engagement, ignores persuasion
LinearEqual credit to all turnsFair, but doesn't capture turn importance
Position-based40% first, 40% last, 20% middleBalances discovery and conversion
Data-drivenML model learns credit assignmentBest accuracy, requires volume

Key metrics for conversational ads:

  • Engagement rate: % of users who respond to first message
  • Conversation depth: Average turns per conversation
  • Resolution rate: % of conversations ending in desired action
  • Deflection rate: % of users who abandon mid-conversation
  • Cost per conversation: Total LLM compute / conversations
  • Incremental lift: Conversion rate vs. static ad control group

LLM-Based Campaign Optimization

Beyond individual ads, LLMs can optimize entire campaigns.

Automated A/B Test Analysis:

Insights=LLM(Test_Results,Campaign_Context,Historical_Learnings)\text{Insights} = \text{LLM}(\text{Test\_Results}, \text{Campaign\_Context}, \text{Historical\_Learnings})

LLMs can:

  • Identify statistically significant results (accounting for multiple comparisons)
  • Explain WHY certain variants won (not just that they won)
  • Suggest follow-up tests based on observed patterns
  • Detect Simpson's paradox and other statistical pitfalls
  • Identify segment-level winners that differ from overall winners

What makes LLM analysis different from traditional dashboards:

Traditional: "Variant B has 5% higher CTR with p<0.05"

LLM analysis: "Variant B outperformed because its 'limited time' messaging created urgency. However, this effect was concentrated in mobile users during evening hours—desktop users showed no significant difference. Consider: (1) testing urgency messaging specifically for mobile evening campaigns, (2) investigating why desktop users didn't respond (perhaps they need more product details before urgency appeals work)."

Budget Allocation Recommendations:

Allocation=LLM(Campaign_Performance,Market_Conditions,Advertiser_Goals)\text{Allocation} = \text{LLM}(\text{Campaign\_Performance}, \text{Market\_Conditions}, \text{Advertiser\_Goals})

LLMs analyze cross-channel performance and recommend budget shifts. Key capabilities:

  • Diminishing returns detection: "Search is hitting saturation—incremental CPA increasing. Consider shifting 20% to display prospecting."
  • Opportunity identification: "Competitor X reduced spend on 'running shoes' keywords—bid landscape favorable for expansion."
  • Goal alignment: "Current allocation optimizes for clicks, but your stated goal is conversions. Recommend shifting budget from awareness to consideration campaigns."
  • Seasonality anticipation: "Marathon season approaching—recommend 30% budget increase for running category starting week 8."

Audience Expansion with LLM Reasoning:

Traditional lookalike audiences use statistical similarity. LLMs add semantic reasoning about WHY an audience works, enabling more thoughtful expansion.

Given a high-performing audience segment, LLM reasons about similar segments:

Code
High-performing segment: "Marathon runners who clicked on nutrition ads"

LLM reasoning:
"This audience responds well because they're health-conscious athletes
focused on performance optimization. Similar audiences might include:
1. Triathletes (similar endurance focus)
2. CrossFit enthusiasts (performance-oriented)
3. Cycling enthusiasts (endurance athletes)
4. Health app power users (quantified-self mindset)

Recommendation: Test expansion to triathlon audiences first,
as they have the closest intent profile."

Personalization Ethics and Guardrails

LLM-powered personalization raises important ethical considerations.

Risks:

  1. Manipulation: Hyper-personalized messaging could exploit psychological vulnerabilities
  2. Filter bubbles: Users only see ads reinforcing existing preferences
  3. Privacy: Deep personalization requires extensive data collection
  4. Deception: AI-generated content may mislead users about what's human vs. machine

Guardrails:

Adfinal=Filter(Adpersonalized,Ethics_Policy,Regulations)\text{Ad}_{\text{final}} = \text{Filter}(\text{Ad}_{\text{personalized}}, \text{Ethics\_Policy}, \text{Regulations})

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PERSONALIZATION GUARDRAILS                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CONTENT FILTERS:                                                        │
│  ────────────────                                                        │
│  • No health claims without substantiation                              │
│  • No urgency manipulation ("Only 1 left!" when false)                  │
│  • No exploitation of negative emotions                                 │
│  • No discrimination based on protected characteristics                 │
│                                                                          │
│  TRANSPARENCY REQUIREMENTS:                                              │
│  ──────────────────────────                                              │
│  • Disclose AI-generated content                                        │
│  • Explain why ad was shown (ad preferences)                            │
│  • Allow users to opt out of personalization                            │
│                                                                          │
│  TECHNICAL CONTROLS:                                                     │
│  ───────────────────                                                     │
│  • LLM output classifiers for harmful content                           │
│  • Human review for new personalization strategies                      │
│  • A/B test ethics review board                                         │
│  • Audit trails for personalization decisions                           │
│                                                                          │
│  REGULATORY COMPLIANCE:                                                  │
│  ──────────────────────                                                  │
│  • GDPR: Data minimization, right to explanation                        │
│  • CCPA: Opt-out rights, disclosure requirements                        │
│  • FTC: Truth in advertising, endorsement disclosure                    │
│  • Industry self-regulation (NAI, DAA)                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Future: Agentic Advertising

The next frontier: AI agents that autonomously manage advertising campaigns.

Agentic Capabilities:

  1. Autonomous Budget Management: Agent monitors performance and reallocates budget without human intervention
  2. Creative Evolution: Agent generates, tests, and iterates on ad creative continuously
  3. Competitive Response: Agent detects competitor actions and adjusts strategy
  4. Cross-Channel Orchestration: Agent coordinates messaging across search, social, display, email

Architecture:

Actiont=Agent(Statet,Goals,Constraints)\text{Action}_t = \text{Agent}(\text{State}_t, \text{Goals}, \text{Constraints})

where State includes: current performance, budget status, market conditions, competitive landscape.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    AGENTIC ADVERTISING SYSTEM                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                    ┌─────────────────────────┐                          │
│                    │    Advertising Agent    │                          │
│                    │    (LLM + Tools)        │                          │
│                    └───────────┬─────────────┘                          │
│                                │                                         │
│           ┌────────────────────┼────────────────────┐                   │
│           │                    │                    │                    │
│           ▼                    ▼                    ▼                    │
│    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐             │
│    │  Creative   │     │   Budget    │     │  Audience   │             │
│    │  Generator  │     │  Optimizer  │     │  Expander   │             │
│    └─────────────┘     └─────────────┘     └─────────────┘             │
│           │                    │                    │                    │
│           ▼                    ▼                    ▼                    │
│    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐             │
│    │  Ad        │     │  Bid        │     │  Targeting  │             │
│    │  Platform  │     │  Management │     │  Rules      │             │
│    │  APIs      │     │  APIs       │     │  APIs       │             │
│    └─────────────┘     └─────────────┘     └─────────────┘             │
│                                                                          │
│  AGENT WORKFLOW:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  1. Monitor: Continuously track campaign KPIs                           │
│  2. Analyze: Identify underperforming segments/creatives                │
│  3. Plan: Decide on optimization actions                                │
│  4. Execute: Make changes via platform APIs                             │
│  5. Learn: Update strategy based on results                             │
│                                                                          │
│  HUMAN OVERSIGHT:                                                        │
│  ────────────────                                                        │
│  • Budget limits (agent can't exceed authorized spend)                  │
│  • Approval gates (major strategy changes need human OK)                │
│  • Alert thresholds (unusual patterns trigger human review)             │
│  • Audit logs (all agent actions recorded)                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Summary: The Modern Ad ML Stack

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MODERN AD ML ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                        SERVING LAYER                             │    │
│  │  • Low-latency inference (<10ms)                                │    │
│  │  • Model cascade (filter → rank)                                │    │
│  │  • Feature store integration                                    │    │
│  │  • A/B testing framework                                        │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                │                                         │
│  ┌─────────────────────────────┼─────────────────────────────────────┐  │
│  │                        MODEL LAYER                                │  │
│  │                                                                   │  │
│  │  CTR Model:          CVR Model:         Bid Model:               │  │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │  │
│  │  │ DeepFM/DCN  │    │ ESMM-style  │    │ Bid Shading │          │  │
│  │  │ + DIN/DIEN  │    │ Multi-task  │    │ + Pacing    │          │  │
│  │  │ behavior    │    │             │    │             │          │  │
│  │  └─────────────┘    └─────────────┘    └─────────────┘          │  │
│  │                                                                   │  │
│  │  Multi-task Framework: PLE / MMOE                                │  │
│  │  Calibration: Platt scaling / Isotonic regression                │  │
│  │  Position bias: PAL / Propensity weighting                       │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                │                                         │
│  ┌─────────────────────────────┼─────────────────────────────────────┐  │
│  │                       FEATURE LAYER                               │  │
│  │                                                                   │  │
│  │  Online Store (Redis):        Offline Store (Hive):              │  │
│  │  • Real-time counts           • User embeddings                  │  │
│  │  • Session features           • Historical aggregates            │  │
│  │  • Recent behaviors           • Item statistics                  │  │
│  │                                                                   │  │
│  │  Feature Engineering: Categorical encoding, crosses, embeddings  │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                │                                         │
│  ┌─────────────────────────────┼─────────────────────────────────────┐  │
│  │                        DATA LAYER                                 │  │
│  │                                                                   │  │
│  │  • Click/impression logs                                         │  │
│  │  • Conversion tracking (with delayed attribution)                │  │
│  │  • User behavior sequences                                       │  │
│  │  • Fraud detection signals                                       │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘



Sources

CTR Prediction Models:

User Behavior Modeling:

Multi-Task Learning:

Industry Systems:

GenAI for Advertising:

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles

RecSysPersonalization

Recommendation Systems: From Collaborative Filtering to Deep Learning

In-depth journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.

30 min read
RecSysPersonalization

Transformers for Recommendation Systems: From SASRec to HSTU

In-depth tour of transformer-based recommendation systems. From the fundamentals of sequential recommendation to Meta's trillion-parameter HSTU, understand how attention mechanisms revolutionized personalization.

30 min read
RecSysPersonalization

Generative AI for Recommendation Systems: LLMs Meet Personalization

Practical guide to LLM-powered recommendation systems. From feature augmentation to conversational agents, understand how generative AI is transforming personalization.

9 min read
EmbeddingsRAG

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

Clear walkthrough of embedding models for RAG, search, and AI applications. Comparison of text-embedding-3, BGE, E5, Cohere Embed v4, and Voyage with guidance on fine-tuning, dimensionality, multimodal embeddings, and production optimization.

15 min read
LLMsInfrastructure

Hardware Optimization for LLMs: CUDA Kernels, TPU vs GPU, and Accelerator Architecture

Field guide to hardware optimization for large language models covering GPU architecture, CUDA kernel optimization, TPU comparisons, memory hierarchies, and practical strategies for maximizing throughput on modern AI accelerators.

13 min read
EducationAgentic AI

Building Agentic AI Systems: A Complete Implementation Guide

Hands-on guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.

29 min read
PersonalizationLLMs

LLM Personalization: Building AI That Adapts to Individual Users

Clear walkthrough of personalizing Large Language Models. From memory architectures to preference learning, understand how to build AI systems that truly adapt to individual users—and the challenges that remain.

12 min read
PromptingLLMs

Advanced Prompt Engineering: From Basic to Production-Grade

Master the techniques that separate amateur prompts from production systems—chain-of-thought, structured outputs, model-specific optimization, and prompt architecture.

10 min read