What's the difference between CTR prediction and recommendation?

CTR prediction optimizes for a specific action (click) in an adversarial economic context (advertisers compete, fraud exists), with extreme latency requirements (<50ms). Recommendations optimize for user satisfaction over longer time horizons, with less strict latency (100-500ms acceptable).

Why not just use transformers for everything?

Transformers are increasingly used (AutoInt, behavior sequence models), but advertising has unique constraints: billions of sparse features, <10ms latency, extreme scale (millions QPS). Simpler models like FM/DeepFM with efficient embeddings often provide better latency/accuracy tradeoffs.

How do I start building a CTR model?

1. Start with logistic regression on well-engineered features 2. Add FM for automatic feature interactions 3. Move to DeepFM or DCN for deep + shallow interactions 4. Add user behavior modeling (DIN) if you have sequence data 5. Consider multi-task learning (ESMM) for multiple objectives

What metrics matter beyond CTR?

- **AUC/ROC**: Ranking quality - **Log loss**: Calibration (critical for bidding) - **NDCG**: Ranking quality with position weights - **Revenue metrics**: ROAS, CPA, advertiser ROI - **User metrics**: Ad load, user satisfaction, engagement

How do major platforms differ in their approaches?

- **Google**: Wide & Deep, DCN, emphasis on calibration for auction integrity - **Meta**: DLRM, DeepLight, massive embedding tables - **Alibaba**: DIN, DIEN, DSIN, emphasis on user behavior sequences - **TikTok**: Monolith (real-time training), emphasis on freshness

Can LLMs run in real-time ad serving?

Not directly—LLM inference is too slow (100ms+) for real-time bidding (<50ms). Solutions include: pre-computing personalized variants for user personas, template-based approaches where LLMs generate templates offline and fast models fill slots online, and caching common query-ad combinations.

What's the difference between LLM personalization and traditional targeting?

Traditional targeting uses demographic segments (age 25-34, male, sports interest) and selects from pre-made ads. LLM personalization understands semantic intent ("training for first marathon, concerned about injury") and can generate or select messaging that addresses specific user context. The difference is segment-based selection vs. semantic understanding.

How do you handle LLM hallucinations in ad creative?

Multiple layers: (1) **LLM-as-judge** filters generated content for brand alignment and factual accuracy, (2) **Policy classifiers** catch prohibited claims, (3) **Human review** for new creative strategies, (4) **A/B testing with guardrails** to catch performance issues early. Never deploy fully autonomous generation without oversight loops.

Back to Blog

RecSys Deep Learning ML Engineering Production Architecture

Machine Learning for Advertising: CTR Prediction, Ad Ranking, and Bidding Systems

Q: Why not just use transformers for everything?

Transformers are increasingly used (AutoInt, behavior sequence models), but advertising has unique constraints: billions of sparse features, <10ms latency, extreme scale (millions QPS). Simpler models like FM/DeepFM with efficient embeddings often provide better latency/accuracy tradeoffs.

Q: How do I start building a CTR model?

1. Start with logistic regression on well-engineered features 2. Add FM for automatic feature interactions 3. Move to DeepFM or DCN for deep + shallow interactions 4. Add user behavior modeling (DIN) if you have sequence data 5. Consider multi-task learning (ESMM) for multiple objectives

Q: What metrics matter beyond CTR?

- **AUC/ROC**: Ranking quality - **Log loss**: Calibration (critical for bidding) - **NDCG**: Ranking quality with position weights - **Revenue metrics**: ROAS, CPA, advertiser ROI - **User metrics**: Ad load, user satisfaction, engagement

Q: How do major platforms differ in their approaches?

- **Google**: Wide & Deep, DCN, emphasis on calibration for auction integrity - **Meta**: DLRM, DeepLight, massive embedding tables - **Alibaba**: DIN, DIEN, DSIN, emphasis on user behavior sequences - **TikTok**: Monolith (real-time training), emphasis on freshness

Q: How are LLMs changing advertising?

LLMs transform advertising at multiple levels: (1) **Creative generation** - automated production of thousands of ad copy variations, (2) **Semantic targeting** - understanding user intent beyond keywords, (3) **Dynamic personalization** - real-time message adaptation per user, (4) **Conversational ads** - interactive dialogue experiences, (5) **Campaign optimization** - automated analysis and strategy recommendations.

Q: Can LLMs run in real-time ad serving?

Not directly—LLM inference is too slow (100ms+) for real-time bidding (<50ms). Solutions include: pre-computing personalized variants for user personas, template-based approaches where LLMs generate templates offline and fast models fill slots online, and caching common query-ad combinations.

Q: What's the difference between LLM personalization and traditional targeting?

Traditional targeting uses demographic segments (age 25-34, male, sports interest) and selects from pre-made ads. LLM personalization understands semantic intent ("training for first marathon, concerned about injury") and can generate or select messaging that addresses specific user context. The difference is segment-based selection vs. semantic understanding.

Q: How do you handle LLM hallucinations in ad creative?

Multiple layers: (1) **LLM-as-judge** filters generated content for brand alignment and factual accuracy, (2) **Policy classifiers** catch prohibited claims, (3) **Human review** for new creative strategies, (4) **A/B testing with guardrails** to catch performance issues early. Never deploy fully autonomous generation without oversight loops.

Q: What are the ethics of hyper-personalized advertising?

Key concerns include manipulation (exploiting psychological vulnerabilities), filter bubbles (reinforcing existing preferences), privacy (extensive data collection), and transparency (users should know content is AI-generated). Best practices: clear disclosure of AI involvement, opt-out mechanisms for personalization, content guardrails against manipulation, and compliance with GDPR/CCPA/FTC regulations.

Comprehensive guide to ML systems powering digital advertising. From logistic regression to deep CTR models, user behavior sequences to multi-task learning, and real-time bidding optimization—understand the algorithms behind the $600B+ ad industry.

February 4, 202616 min read

The Scale of Advertising ML

Digital advertising is a $600+ billion industry, and machine learning is its backbone. Every time you see an ad online, dozens of ML models have executed in milliseconds: predicting whether you'll click, estimating conversion probability, optimizing bids, and ranking thousands of candidate ads.

This isn't just recommendation systems with a different name. Advertising ML has unique challenges:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│           ADVERTISING ML vs GENERAL RECOMMENDATION                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  GENERAL RECSYS:                                                         │
│  ───────────────                                                         │
│  Goal: Maximize user engagement/satisfaction                            │
│  Items: Products, videos, songs (relatively stable catalog)             │
│  Feedback: Implicit (views, clicks) or explicit (ratings)               │
│  Latency: 100-500ms acceptable                                          │
│  Stakes: Poor recommendations → user leaves                             │
│                                                                          │
│  ADVERTISING ML:                                                         │
│  ───────────────                                                         │
│  Goal: Maximize revenue while maintaining user experience               │
│  Items: Ads (constantly changing, millions of advertisers)              │
│  Feedback: Sparse (CTR ~1-3%), delayed conversions                      │
│  Latency: <10-50ms required (real-time bidding)                         │
│  Stakes: Wrong predictions → lose money (pay per impression/click)      │
│                                                                          │
│  UNIQUE CHALLENGES:                                                      │
│  ─────────────────                                                       │
│  • Feature sparsity: Billions of feature combinations                   │
│  • Class imbalance: 97-99% negative examples                            │
│  • Multi-stakeholder: Users, advertisers, platform                      │
│  • Position bias: Top positions get more clicks regardless of relevance │
│  • Delayed feedback: Conversions may happen days later                  │
│  • Adversarial dynamics: Click fraud, bid manipulation                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

This post covers the complete advertising ML stack, from foundational CTR prediction to advanced user modeling and real-time bidding optimization.

Part I: Foundations of CTR Prediction

The CTR Prediction Problem

Click-Through Rate (CTR) prediction is the cornerstone of advertising ML. Given a user $u$ , an ad $a$ , and a context $c$ (time, device, page), predict the probability that the user will click:

$P(\text{click} = 1 \mid u, a, c)$

This probability directly determines:

Ad ranking: Higher predicted CTR → higher position
Bid optimization: Expected value = $P(\text{click}) \times \text{bid}$
Revenue: Platform typically charges per click (CPC) or per impression (CPM)

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CTR PREDICTION IN THE AD STACK                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  User visits page                                                        │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────┐    Candidate ads     ┌─────────────────────┐           │
│  │   Ad       │ ─────────────────────►│   CTR Prediction    │           │
│  │  Request   │    (1000s of ads)     │      Model          │           │
│  └─────────────┘                      └──────────┬──────────┘           │
│                                                   │                      │
│                                         P(click) for each ad             │
│                                                   │                      │
│                                                   ▼                      │
│                                       ┌─────────────────────┐           │
│                                       │   Ranking Function   │           │
│                                       │                      │           │
│                                       │  Score = f(pCTR,     │           │
│                                       │           bid,       │           │
│                                       │           quality)   │           │
│                                       └──────────┬──────────┘           │
│                                                   │                      │
│                                           Top K ads                      │
│                                                   │                      │
│                                                   ▼                      │
│                                       ┌─────────────────────┐           │
│                                       │    Ad Displayed     │           │
│                                       └─────────────────────┘           │
│                                                                          │
│  The entire pipeline must complete in <50ms                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Feature Engineering: The Foundation

Before diving into models, understand that advertising ML is fundamentally about feature interactions. A user who is "male, age 25-34, interested in sports" seeing an ad for "Nike running shoes" on a "sports news website" at "7pm on weekday" has a very different click probability than any individual feature would suggest.

The features in CTR prediction are typically:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE CATEGORIES IN CTR PREDICTION                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USER FEATURES:                                                          │
│  ──────────────                                                          │
│  • Demographics: age, gender, location, language                        │
│  • Behavioral: past clicks, purchases, browsing history                 │
│  • Contextual: device, OS, browser, time of day                         │
│  • Aggregated: click rate on category, avg session duration             │
│                                                                          │
│  AD FEATURES:                                                            │
│  ────────────                                                            │
│  • Creative: ad ID, advertiser ID, campaign ID                          │
│  • Content: category, keywords, landing page domain                     │
│  • Historical: ad CTR, conversion rate, quality score                   │
│  • Bid: bid amount, budget remaining, campaign age                      │
│                                                                          │
│  CONTEXT FEATURES:                                                       │
│  ─────────────────                                                       │
│  • Publisher: site ID, page category, content keywords                  │
│  • Position: ad slot, above/below fold                                  │
│  • Temporal: hour, day of week, season, holidays                        │
│  • Request: referrer, search query (if search ads)                      │
│                                                                          │
│  CROSS FEATURES (manually engineered):                                   │
│  ─────────────────────────────────────                                   │
│  • user_gender × ad_category                                            │
│  • user_age × hour_of_day                                               │
│  • device_type × ad_format                                              │
│  • user_interest × ad_keyword                                           │
│                                                                          │
│  SCALE: Typically 10^6 to 10^9 sparse features after one-hot encoding  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The key insight: most features are categorical and high-cardinality. User ID alone might have billions of values. When one-hot encoded, the feature vector becomes extremely sparse but extremely high-dimensional.

Part II: Evolution of CTR Models

Stage 1: Logistic Regression (The Baseline)

The journey begins with logistic regression, still used in production at many companies for its simplicity and interpretability.

Model formulation:

$\hat{y} = \sigma\left(w_0 + \sum_{i=1}^{n} w_i x_i\right)$

where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function, $x_i$ are features, and $w_i$ are learned weights.

Loss function (binary cross-entropy):

$\mathcal{L} = -\frac{1}{N}\sum_{j=1}^{N}\left[y_j \log(\hat{y}_j) + (1-y_j)\log(1-\hat{y}_j)\right]$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LOGISTIC REGRESSION FOR CTR                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Input: Sparse feature vector x ∈ {0,1}^n (one-hot encoded)             │
│                                                                          │
│  Example (simplified):                                                   │
│  ─────────────────────                                                   │
│  user_gender=male:     [1, 0]           (2 dims)                        │
│  user_age=25-34:       [0, 0, 1, 0, 0]  (5 dims)                        │
│  ad_category=sports:   [0, 0, 0, 1, 0]  (5 dims)                        │
│  device=mobile:        [0, 1]           (2 dims)                        │
│                                                                          │
│  Concatenated: x = [1,0,0,0,1,0,0,0,0,0,1,0,0,1]                        │
│                                                                          │
│  Prediction:                                                             │
│  ───────────                                                             │
│  z = w₀ + w₁·1 + w₃·1 + w₁₁·1 + w₁₄·1                                  │
│    = bias + w_male + w_age25-34 + w_sports + w_mobile                   │
│                                                                          │
│  ŷ = σ(z) = P(click)                                                    │
│                                                                          │
│  LIMITATION: Only captures first-order effects                          │
│  ───────────                                                             │
│  Cannot model: "males interested in sports click more on sports ads"    │
│  This requires explicit feature crosses: x_male × x_sports              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why LR works despite its simplicity:

Interpretability: Weight $w_i$ directly shows feature importance
Scalability: Can train on billions of examples with SGD
Sparsity: Most weights are zero (L1 regularization)
Online learning: Weights can be updated incrementally

Why LR is insufficient:

Requires manual feature engineering for interactions
Cannot learn non-linear patterns
Feature crosses explode combinatorially: $O(n^2)$ for pairwise, $O(n^k)$ for k-way

Stage 2: Polynomial/Feature Cross Models

To capture interactions, we can explicitly model feature crosses:

Degree-2 Polynomial:

$\hat{y} = \sigma\left(w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} w_{ij} x_i x_j\right)$

The problem: for $n$ features, we now have $\frac{n(n-1)}{2}$ pairwise interaction terms. With millions of sparse features, this is computationally infeasible and leads to severe overfitting (most pairs never co-occur in training data).

Stage 3: Factorization Machines (FM)

The breakthrough: Instead of learning a weight $w_{ij}$ for each feature pair, learn a latent vector $\mathbf{v}_i \in \mathbb{R}^k$ for each feature and model interactions as dot products.

FM formulation (Rendle, 2010):

$\hat{y} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$

where $\langle \mathbf{v}_i, \mathbf{v}_j \rangle = \sum_{f=1}^{k} v_{if} v_{jf}$ is the dot product.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FACTORIZATION MACHINES                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  KEY INSIGHT: Factorize the interaction weight matrix                   │
│  ───────────────────────────────────────────────────                    │
│                                                                          │
│  Instead of:  W_ij (n² parameters for pairwise interactions)            │
│                                                                          │
│  Learn:       V ∈ ℝ^(n×k) where k << n                                  │
│               W_ij ≈ <v_i, v_j> = Σ v_if · v_jf                         │
│                                                                          │
│  Parameter reduction:                                                    │
│  ───────────────────                                                     │
│  Full interactions: O(n²)  →  With FM: O(n·k)                           │
│                                                                          │
│  Example: n = 10⁶ features, k = 64                                      │
│  Full: 10¹² parameters (impossible)                                     │
│  FM:   64 × 10⁶ = 6.4 × 10⁷ parameters (tractable)                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENERALIZATION POWER:                                                   │
│  ────────────────────                                                    │
│                                                                          │
│  Even if (feature_i, feature_j) never co-occur in training data,        │
│  FM can estimate their interaction through the latent vectors:          │
│                                                                          │
│  v_i learned from (feature_i, feature_k) co-occurrences                 │
│  v_j learned from (feature_j, feature_k) co-occurrences                 │
│  → <v_i, v_j> provides reasonable interaction estimate                  │
│                                                                          │
│  This is similar to how matrix factorization in RecSys handles          │
│  user-item pairs never seen in training.                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Efficient computation (the FM trick):

The naive computation of pairwise interactions is $O(kn^2)$ . But FM can be computed in $O(kn)$ :

$\sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2}\sum_{f=1}^{k}\left[\left(\sum_{i=1}^{n} v_{if} x_i\right)^2 - \sum_{i=1}^{n} v_{if}^2 x_i^2\right]$

Derivation:

Starting from the identity: $\left(\sum_{i=1}^{n} a_i\right)^2 = \sum_{i=1}^{n} a_i^2 + 2\sum_{i=1}^{n}\sum_{j=i+1}^{n} a_i a_j$

Let $a_i = v_{if} x_i$ : $\left(\sum_{i=1}^{n} v_{if} x_i\right)^2 = \sum_{i=1}^{n} v_{if}^2 x_i^2 + 2\sum_{i=1}^{n}\sum_{j=i+1}^{n} v_{if} v_{jf} x_i x_j$

Rearranging and summing over $f$ : $\sum_{i<j} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2}\sum_{f=1}^{k}\left[\left(\sum_{i=1}^{n} v_{if} x_i\right)^2 - \sum_{i=1}^{n} v_{if}^2 x_i^2\right]$

This reformulation enables linear-time computation—critical for real-time serving.

Stage 4: Field-aware Factorization Machines (FFM)

Limitation of FM: The same latent vector $\mathbf{v}_i$ is used regardless of which feature it interacts with.

FFM insight (Juan et al., 2016): Different interactions may require different representations. A user's "sports interest" should interact differently with "ad category" versus "time of day."

FFM formulation:

$\hat{y} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \mathbf{v}_{i,f_j}, \mathbf{v}_{j,f_i} \rangle x_i x_j$

where $f_j$ denotes the field of feature $j$ , and $\mathbf{v}_{i,f}$ is feature $i$ 's latent vector for interacting with field $f$ .

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FM vs FFM: FIELD-AWARE INTERACTIONS                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FIELDS: Groups of related features                                      │
│  ──────────────────────────────────                                      │
│                                                                          │
│  Field 1 (User):     user_id, user_age, user_gender                     │
│  Field 2 (Ad):       ad_id, ad_category, advertiser_id                  │
│  Field 3 (Context):  hour, day_of_week, device                          │
│  Field 4 (Publisher): site_id, page_category                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FM: Each feature has ONE latent vector                                  │
│  ────────────────────────────────────                                    │
│                                                                          │
│  user_age=25-34:  v_age = [0.1, 0.3, -0.2, ...]                         │
│                                                                          │
│  Interaction with ad_category:    <v_age, v_category>                   │
│  Interaction with hour:           <v_age, v_hour>                       │
│  (Same v_age used for both!)                                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FFM: Each feature has latent vector PER FIELD                          │
│  ─────────────────────────────────────────────                           │
│                                                                          │
│  user_age=25-34:                                                         │
│    v_age,Ad      = [0.1, 0.3, -0.2, ...]  (for Ad field)                │
│    v_age,Context = [0.4, -0.1, 0.5, ...]  (for Context field)           │
│    v_age,Pub     = [-0.2, 0.2, 0.1, ...]  (for Publisher field)         │
│                                                                          │
│  Interaction with ad_category:    <v_age,Ad, v_category,User>           │
│  Interaction with hour:           <v_age,Context, v_hour,User>          │
│  (Different vectors for different fields!)                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PARAMETER COUNT:                                                        │
│  ────────────────                                                        │
│  FM:  n × k                                                              │
│  FFM: n × F × k  (F = number of fields)                                 │
│                                                                          │
│  Tradeoff: More expressive but more parameters and slower training      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

FFM won several Kaggle CTR prediction competitions and became a standard baseline in the industry.

Part III: Deep Learning for CTR Prediction

The Deep Learning Revolution

Around 2016, deep learning entered CTR prediction. The key insight: neural networks can automatically learn feature interactions without manual engineering.

Wide & Deep Learning (Google, 2016)

Google's Wide & Deep architecture combines memorization (wide) with generalization (deep).

Motivation:

Memorization: Learning specific feature co-occurrences from history
- "Users who installed app X often install app Y"
- Requires feature engineering but captures precise patterns
Generalization: Learning transferable patterns
- "Users interested in fitness apps like health-related apps"
- DNNs learn general representations but may over-generalize

Architecture:

$\hat{y} = \sigma\left(\mathbf{w}_{\text{wide}}^T [\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}_{\text{deep}}^T \mathbf{a}^{(L)} + b\right)$

where:

$\mathbf{x}$ = raw features
$\phi(\mathbf{x})$ = cross-product transformations (manual feature crosses)
$\mathbf{a}^{(L)}$ = final hidden layer of the deep network

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    WIDE & DEEP ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                         ┌──────────────┐                                │
│                         │   Output     │                                │
│                         │   σ(·)       │                                │
│                         └──────┬───────┘                                │
│                                │                                         │
│                    ┌───────────┴───────────┐                            │
│                    │                       │                             │
│            ┌───────┴───────┐       ┌───────┴───────┐                    │
│            │     WIDE      │       │     DEEP      │                    │
│            │   (Linear)    │       │    (DNN)      │                    │
│            └───────┬───────┘       └───────┬───────┘                    │
│                    │                       │                             │
│            ┌───────┴───────┐       ┌───────┴───────┐                    │
│            │ Raw Features  │       │    Hidden     │                    │
│            │      +        │       │    Layers     │                    │
│            │ Cross Features│       │   (ReLU)      │                    │
│            │ (manual)      │       └───────┬───────┘                    │
│            └───────┬───────┘               │                             │
│                    │               ┌───────┴───────┐                    │
│                    │               │   Embedding   │                    │
│                    │               │    Layer      │                    │
│                    │               └───────┬───────┘                    │
│                    │                       │                             │
│                    └───────────┬───────────┘                            │
│                                │                                         │
│                    ┌───────────┴───────────┐                            │
│                    │    Sparse Features    │                            │
│                    │   (categorical IDs)   │                            │
│                    └───────────────────────┘                            │
│                                                                          │
│  WIDE COMPONENT: Memorization                                           │
│  ─────────────────────────────                                          │
│  • Linear model on raw + crossed features                               │
│  • Crossed features like: installed_app × impression_app                │
│  • Captures specific, frequent patterns                                 │
│                                                                          │
│  DEEP COMPONENT: Generalization                                         │
│  ──────────────────────────────                                         │
│  • Embeddings for categorical features                                  │
│  • Multiple hidden layers with ReLU                                     │
│  • Learns dense representations that generalize                         │
│                                                                          │
│  JOINT TRAINING: Both components trained end-to-end                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Key insight: The wide component still requires manual feature crosses. Can we automate this?

DeepFM (Huawei, 2017)

DeepFM replaces the wide component's manual crosses with a Factorization Machine, achieving automatic feature interaction learning at both low and high orders.

Architecture:

$\hat{y} = \sigma\left(y_{\text{FM}} + y_{\text{DNN}}\right)$

where: $y_{\text{FM}} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$

$y_{\text{DNN}} = \mathbf{w}^T \mathbf{a}^{(L)} + b$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DeepFM ARCHITECTURE                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                              ┌──────────────┐                           │
│                              │   Output     │                           │
│                              │   σ(·)       │                           │
│                              └──────┬───────┘                           │
│                                     │                                    │
│                         ┌───────────┴───────────┐                       │
│                         │         ADD           │                       │
│                         └───────────┬───────────┘                       │
│                    ┌────────────────┼────────────────┐                  │
│                    │                │                │                   │
│            ┌───────┴───────┐ ┌──────┴──────┐ ┌──────┴──────┐           │
│            │   1st Order   │ │  2nd Order  │ │    Deep     │           │
│            │   (Linear)    │ │    (FM)     │ │   (DNN)     │           │
│            └───────┬───────┘ └──────┬──────┘ └──────┬──────┘           │
│                    │                │               │                    │
│                    │                │        ┌──────┴──────┐            │
│                    │                │        │   Hidden    │            │
│                    │                │        │   Layers    │            │
│                    │                │        └──────┬──────┘            │
│                    │                │               │                    │
│                    └────────────────┴───────────────┘                   │
│                                     │                                    │
│                         ┌───────────┴───────────┐                       │
│                         │    SHARED Embeddings  │  ← KEY INNOVATION    │
│                         └───────────┬───────────┘                       │
│                                     │                                    │
│                         ┌───────────┴───────────┐                       │
│                         │   Sparse Features     │                       │
│                         └───────────────────────┘                       │
│                                                                          │
│  KEY INNOVATIONS:                                                        │
│  ────────────────                                                        │
│  1. FM replaces manual feature crosses (automatic 2nd-order)            │
│  2. DNN captures higher-order interactions                              │
│  3. SHARED embeddings between FM and DNN                                │
│     - Reduces parameters                                                │
│     - FM and DNN reinforce each other                                   │
│  4. No pre-training required (end-to-end)                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why shared embeddings matter:

The embedding vector $\mathbf{v}_i$ serves dual purposes:

In FM: Direct dot-product interactions $\langle \mathbf{v}_i, \mathbf{v}_j \rangle$
In DNN: Concatenated as input for higher-order learning

This parameter sharing creates a synergy: the FM component provides explicit 2nd-order signals that help the DNN converge faster, while the DNN's gradients improve the embeddings used by FM.

Deep & Cross Network (DCN) (Google, 2017)

DCN introduces an elegant cross network that explicitly models feature interactions of arbitrary order without the combinatorial explosion.

Cross Layer formulation:

$\mathbf{x}_{l+1} = \mathbf{x}_0 \mathbf{x}_l^T \mathbf{w}_l + \mathbf{b}_l + \mathbf{x}_l$

where:

$\mathbf{x}_0$ = input features
$\mathbf{x}_l$ = output of layer $l$
$\mathbf{w}_l, \mathbf{b}_l$ = learnable parameters

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CROSS NETWORK: HOW IT WORKS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Cross Layer Operation:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  x_{l+1} = x_0 · (x_l^T · w_l) + b_l + x_l                              │
│          = x_0 · (scalar) + b_l + x_l                                   │
│                                                                          │
│  Breakdown:                                                              │
│  ──────────                                                              │
│  1. x_l^T · w_l  →  scalar (dot product)                                │
│  2. x_0 · scalar →  feature-weighted x_0                                │
│  3. + x_l        →  residual connection                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INTERACTION ORDER ANALYSIS:                                             │
│  ───────────────────────────                                             │
│                                                                          │
│  Layer 0: x_0 = [x_1, x_2, x_3]  (1st order features)                   │
│                                                                          │
│  Layer 1: x_1 = x_0 · (x_0^T w_0) + x_0                                 │
│           Contains: x_1, x_2, x_3           (1st order)                 │
│                     x_1², x_1x_2, x_1x_3... (2nd order)                 │
│                                                                          │
│  Layer 2: x_2 = x_0 · (x_1^T w_1) + x_1                                 │
│           Contains: 1st, 2nd order (from x_1)                           │
│                     3rd order (x_0 × 2nd order terms)                   │
│                                                                          │
│  Layer L: Contains interactions up to order (L+1)                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PARAMETER EFFICIENCY:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Each cross layer: d parameters (w_l) + d parameters (b_l) = 2d         │
│  L cross layers: 2Ld parameters                                          │
│                                                                          │
│  Compare to polynomial: d^(L+1) parameters for order-(L+1) interactions │
│                                                                          │
│  Example: d=1000, L=3                                                   │
│  Cross Network: 6,000 parameters                                        │
│  Full polynomial: 10^12 parameters (impossible!)                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

DCN-v2 (2020): The original DCN's cross layer uses rank-1 weight matrices ( $\mathbf{x}_0 \mathbf{x}_l^T$ ). DCN-v2 generalizes to full-rank:

$\mathbf{x}_{l+1} = \mathbf{x}_0 \odot (\mathbf{W}_l \mathbf{x}_l + \mathbf{b}_l) + \mathbf{x}_l$

where $\odot$ is element-wise product and $\mathbf{W}_l \in \mathbb{R}^{d \times d}$ .

This increases expressiveness at the cost of more parameters, with a practical compromise using low-rank decomposition: $\mathbf{W}_l = \mathbf{U}_l \mathbf{V}_l^T$ where $\mathbf{U}_l, \mathbf{V}_l \in \mathbb{R}^{d \times r}$ .

xDeepFM (Microsoft, 2018)

xDeepFM introduces the Compressed Interaction Network (CIN) to learn explicit, bounded-degree feature interactions at the vector level (not bit level).

Key insight: In DeepFM's DNN, interactions happen at the bit level (individual embedding dimensions). CIN operates at the vector level (entire embedding vectors), which is more interpretable and controllable.

CIN formulation:

$X^k_{h,*} = \sum_{i=1}^{H_{k-1}} \sum_{j=1}^{m} W^{k,h}_{ij} (X^{k-1}_{i,*} \circ X^0_{j,*})$

where:

$X^0 \in \mathbb{R}^{m \times D}$ : input feature embeddings ( $m$ features, $D$ dimensions)
$X^k \in \mathbb{R}^{H_k \times D}$ : output of layer $k$ ( $H_k$ feature maps)
$\circ$ : Hadamard (element-wise) product

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CIN: COMPRESSED INTERACTION NETWORK                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT: m field embeddings, each D-dimensional                          │
│                                                                          │
│         X^0 = [e_1, e_2, ..., e_m]  ∈ ℝ^(m × D)                        │
│                                                                          │
│  LAYER k: Compute interactions with original embeddings                  │
│  ─────────────────────────────────────────────────────                   │
│                                                                          │
│  Step 1: Outer product along embedding dimension                        │
│                                                                          │
│          Z^k = X^(k-1) ⊗ X^0  ∈ ℝ^(H_{k-1} × m × D)                    │
│                                                                          │
│          Each Z^k_{i,j} = X^(k-1)_i ⊙ X^0_j  (Hadamard product)        │
│                                                                          │
│  Step 2: Compress along the H_{k-1} × m dimensions                      │
│                                                                          │
│          X^k_h = Σ_i Σ_j W^k_{h,i,j} · Z^k_{i,j}                        │
│                                                                          │
│          Output: X^k ∈ ℝ^(H_k × D)                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INTERACTION ORDER:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Layer 1: X^1 involves X^0 ⊗ X^0 → 2nd order interactions              │
│  Layer 2: X^2 involves X^1 ⊗ X^0 → 3rd order interactions              │
│  Layer k: Contains exactly (k+1)-order interactions                     │
│                                                                          │
│  Unlike DNN where interaction order is implicit and unbounded,          │
│  CIN gives explicit control over maximum interaction degree.            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OUTPUT: Sum pooling over each layer                                     │
│                                                                          │
│          p^k = Σ_h Σ_d X^k_{h,d}  (scalar per layer)                    │
│                                                                          │
│          y_CIN = [p^1, p^2, ..., p^T]  (T layers)                       │
│                                                                          │
│  Final: Concatenate with linear + DNN outputs                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

AutoInt (2019)

AutoInt applies multi-head self-attention to learn feature interactions, treating each feature embedding as a token.

Architecture:

$\tilde{\mathbf{e}}_m^{(h)} = \sum_{k=1}^{M} \alpha_{m,k}^{(h)} \left(\mathbf{W}_V^{(h)} \mathbf{e}_k\right)$

where: $\alpha_{m,k}^{(h)} = \frac{\exp(\psi^{(h)}(\mathbf{e}_m, \mathbf{e}_k))}{\sum_{l=1}^{M} \exp(\psi^{(h)}(\mathbf{e}_m, \mathbf{e}_l))}$

$\psi^{(h)}(\mathbf{e}_m, \mathbf{e}_k) = \langle \mathbf{W}_Q^{(h)} \mathbf{e}_m, \mathbf{W}_K^{(h)} \mathbf{e}_k \rangle$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    AutoInt: ATTENTION FOR FEATURE INTERACTION            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INTUITION: Treat feature embeddings like tokens in a transformer       │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Input: M feature embeddings [e_1, e_2, ..., e_M]                       │
│                                                                          │
│       e_1         e_2         e_3         e_4                           │
│    (user_age)  (ad_cat)    (hour)     (device)                          │
│        │           │           │           │                             │
│        ▼           ▼           ▼           ▼                             │
│  ┌─────────────────────────────────────────────────┐                    │
│  │           Multi-Head Self-Attention             │                    │
│  │                                                 │                    │
│  │   α_11  α_12  α_13  α_14                        │                    │
│  │   α_21  α_22  α_23  α_24     (attention matrix)│                    │
│  │   α_31  α_32  α_33  α_34                        │                    │
│  │   α_41  α_42  α_43  α_44                        │                    │
│  │                                                 │                    │
│  │   α_ij = how much feature i attends to j       │                    │
│  └─────────────────────────────────────────────────┘                    │
│        │           │           │           │                             │
│        ▼           ▼           ▼           ▼                             │
│      ẽ_1         ẽ_2         ẽ_3         ẽ_4                           │
│  (contextualized embeddings)                                             │
│                                                                          │
│  WHAT ATTENTION LEARNS:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  High α_12: "user_age" strongly interacts with "ad_category"            │
│  High α_34: "hour" strongly interacts with "device"                     │
│                                                                          │
│  Unlike FM (all pairs weighted equally by dot product),                 │
│  attention learns WHICH interactions matter for each example.           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STACKING: Multiple attention layers = higher-order interactions        │
│                                                                          │
│  Layer 1: ẽ^1 = Attn(e, e, e)      → 2nd order                         │
│  Layer 2: ẽ^2 = Attn(ẽ^1, ẽ^1, ẽ^1) → 3rd order                        │
│  Layer L: up to (L+1)-order interactions                                │
│                                                                          │
│  Residual connections: ẽ^l = ẽ^(l-1) + Attn(...)                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Advantages of attention-based interaction:

Dynamic: Attention weights depend on the specific input (unlike FM's fixed weights)
Interpretable: Can visualize which features interact
Efficient: Self-attention is well-optimized on modern hardware

FiBiNET (Sina Weibo, 2019)

FiBiNET introduces SENET-like attention to dynamically reweight feature importance before interaction.

Key innovation: Not all features are equally important for every prediction. FiBiNET learns to squeeze (aggregate) and excite (reweight) features.

SENET Layer:

$\mathbf{z} = F_{sq}(\mathbf{E}) = \frac{1}{k}\sum_{t=1}^{k} e_i^{(t)}$ (squeeze: average pooling per field)

$\mathbf{A} = F_{ex}(\mathbf{z}) = \sigma_2(\mathbf{W}_2 \cdot \sigma_1(\mathbf{W}_1 \cdot \mathbf{z}))$ (excite: two FC layers)

$\mathbf{V} = F_{scale}(\mathbf{A}, \mathbf{E}) = [\mathbf{a}_1 \cdot \mathbf{e}_1, ..., \mathbf{a}_f \cdot \mathbf{e}_f]$ (reweight)

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FiBiNET: FEATURE IMPORTANCE LEARNING                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM: Not all features equally important for all predictions        │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Example: Predicting click for "luxury watch" ad                        │
│                                                                          │
│  Important features: user_income, user_age, user_interests              │
│  Less important: hour_of_day, browser_type                              │
│                                                                          │
│  But for "fast food" ad:                                                │
│  Important features: hour_of_day, user_location                         │
│  Less important: user_income                                            │
│                                                                          │
│  SENET learns to dynamically reweight based on the input!               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ARCHITECTURE:                                                           │
│                                                                          │
│  Input embeddings: E = [e_1, e_2, ..., e_f]  (f fields)                 │
│                                                                          │
│  ┌──────────────────────────────────────────────────────┐               │
│  │  SQUEEZE: Global average pooling per field           │               │
│  │                                                      │               │
│  │  z_i = mean(e_i)  →  z = [z_1, z_2, ..., z_f]       │               │
│  └──────────────────────────────────────────────────────┘               │
│                         │                                                │
│                         ▼                                                │
│  ┌──────────────────────────────────────────────────────┐               │
│  │  EXCITE: Two FC layers with reduction ratio r        │               │
│  │                                                      │               │
│  │  s = W_1 · z       (f → f/r)                        │               │
│  │  s = ReLU(s)                                        │               │
│  │  a = W_2 · s       (f/r → f)                        │               │
│  │  a = sigmoid(a)    (importance weights)             │               │
│  └──────────────────────────────────────────────────────┘               │
│                         │                                                │
│                         ▼                                                │
│  ┌──────────────────────────────────────────────────────┐               │
│  │  REWEIGHT: Scale embeddings by importance            │               │
│  │                                                      │               │
│  │  v_i = a_i · e_i   →  V = [v_1, v_2, ..., v_f]      │               │
│  └──────────────────────────────────────────────────────┘               │
│                         │                                                │
│                         ▼                                                │
│  ┌──────────────────────────────────────────────────────┐               │
│  │  Bilinear interaction on reweighted embeddings       │               │
│  └──────────────────────────────────────────────────────┘               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IV: User Behavior Sequence Modeling

The Behavior Sequence Problem

So far, we've treated user features as static. But in advertising, user behavior history is critical. A user who just searched for "running shoes" is much more likely to click on a Nike ad than their static demographic profile suggests.

Challenge: How do we model sequences of past behaviors to predict future ad clicks?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    USER BEHAVIOR IN AD PREDICTION                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  User's recent behavior sequence:                                        │
│  ────────────────────────────────                                        │
│                                                                          │
│  t-5: Viewed "Nike Air Max" product page                                │
│  t-4: Searched "best running shoes 2024"                                │
│  t-3: Clicked ad for "Adidas Ultraboost"                                │
│  t-2: Read article "Marathon Training Guide"                            │
│  t-1: Added "Running Socks" to cart                                     │
│  t:   Current ad impression: "Nike Running Shoes"                       │
│                                                                          │
│  QUESTION: How does this history inform P(click)?                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  NAIVE APPROACH: Aggregate all behaviors                                │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  user_embedding = mean([e_nike, e_search, e_adidas, e_article, e_socks])│
│                                                                          │
│  Problems:                                                               │
│  • All behaviors weighted equally                                       │
│  • Recent behaviors not prioritized                                     │
│  • Relationship to target ad ignored                                    │
│                                                                          │
│  BETTER: Weight behaviors by relevance to current ad                    │
│  ─────────────────────────────────────────────────                       │
│                                                                          │
│  For "Nike Running Shoes" ad:                                           │
│  • "Nike Air Max" view: HIGH relevance (same brand + category)          │
│  • "Adidas Ultraboost" click: MEDIUM relevance (competitor)             │
│  • "Marathon Training" read: MEDIUM relevance (related interest)        │
│  • "Running Socks" cart: LOW relevance (different product type)         │
│                                                                          │
│  This is the core idea behind DIN, DIEN, and related models.           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Deep Interest Network (DIN) (Alibaba, 2018)

DIN introduces target-aware attention over user behavior sequences. Instead of treating all past behaviors equally, DIN computes attention weights based on relevance to the target ad.

Key insight: User interests are diverse and locally activated. When predicting clicks on a "Nike shoe" ad, past behaviors related to sports/shoes should matter more than unrelated behaviors.

Attention mechanism:

$\mathbf{v}_U = f(\mathbf{v}_A) = \sum_{j=1}^{N} a(\mathbf{e}_j, \mathbf{e}_A) \cdot \mathbf{e}_j$

where:

$\mathbf{e}_j$ = embedding of behavior $j$
$\mathbf{e}_A$ = embedding of target ad
$a(\cdot, \cdot)$ = attention function

Attention function (activation unit):

$a(\mathbf{e}_j, \mathbf{e}_A) = \mathbf{w}^T \cdot \text{PReLU}\left(\mathbf{W} \cdot [\mathbf{e}_j, \mathbf{e}_A, \mathbf{e}_j \odot \mathbf{e}_A, \mathbf{e}_j - \mathbf{e}_A] + \mathbf{b}\right)$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DIN: DEEP INTEREST NETWORK                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TARGET AD: Nike Running Shoes (embedding e_A)                          │
│                                                                          │
│  USER BEHAVIOR HISTORY:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│     e_1          e_2          e_3          e_4          e_5             │
│   (Nike Air)  (Search)    (Adidas)    (Article)    (Socks)              │
│       │           │           │           │           │                  │
│       └───────────┴───────────┴───────────┴───────────┘                  │
│                               │                                          │
│                   ┌───────────┴───────────┐                              │
│                   │   Attention Unit       │                             │
│                   │   a(e_j, e_A)         │← Target ad e_A              │
│                   └───────────┬───────────┘                              │
│                               │                                          │
│           ┌───────────────────┼───────────────────┐                      │
│           │                   │                   │                      │
│        a_1=0.6            a_2=0.3            a_3=0.4  ...                │
│           │                   │                   │                      │
│           ▼                   ▼                   ▼                      │
│        a_1·e_1 +          a_2·e_2 +          a_3·e_3 + ...              │
│                               │                                          │
│                               ▼                                          │
│                    ┌─────────────────┐                                  │
│                    │  User Interest  │                                  │
│                    │  Representation │                                  │
│                    │      v_U        │                                  │
│                    └────────┬────────┘                                  │
│                             │                                            │
│                             ▼                                            │
│               ┌─────────────────────────────┐                           │
│               │  Concatenate with other     │                           │
│               │  features → MLP → P(click)  │                           │
│               └─────────────────────────────┘                           │
│                                                                          │
│  KEY PROPERTIES:                                                         │
│  ───────────────                                                         │
│  1. Attention weights NOT normalized (sum ≠ 1)                          │
│     - Allows varying total interest intensity                           │
│     - User with strong interest → larger ||v_U||                        │
│                                                                          │
│  2. Activation unit uses both similarity AND difference                 │
│     - [e_j, e_A]: raw features                                          │
│     - [e_j ⊙ e_A]: element-wise product (similarity)                   │
│     - [e_j - e_A]: difference (captures contrast)                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why unnormalized attention?

In standard attention (e.g., transformers), weights sum to 1. DIN deliberately avoids normalization:

Normalized: User with 10 relevant items and user with 1 relevant item produce similar magnitude outputs
Unnormalized: User with more relevant items has larger interest representation, capturing interest intensity

Deep Interest Evolution Network (DIEN) (Alibaba, 2019)

DIEN extends DIN by modeling the temporal evolution of user interests, not just their static representation.

Key insight: User interests evolve over time. The sequence [search shoes → view Nike → view Adidas → buy Nike] tells a story of interest development that static attention misses.

Two-layer architecture:

Interest Extractor Layer: GRU captures sequential patterns
Interest Evolving Layer: Attention-augmented GRU focuses on target-relevant evolution

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DIEN: INTEREST EVOLUTION NETWORK                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LAYER 1: INTEREST EXTRACTOR (GRU)                                      │
│  ─────────────────────────────────                                       │
│                                                                          │
│  b_1 → b_2 → b_3 → b_4 → b_5   (behavior sequence)                      │
│   │      │      │      │      │                                          │
│   ▼      ▼      ▼      ▼      ▼                                          │
│  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐                                          │
│  │GRU│→│GRU│→│GRU│→│GRU│→│GRU│                                          │
│  └──┘  └──┘  └──┘  └──┘  └──┘                                          │
│   │      │      │      │      │                                          │
│   ▼      ▼      ▼      ▼      ▼                                          │
│  h_1    h_2    h_3    h_4    h_5   (hidden states = interest states)    │
│                                                                          │
│  Auxiliary loss: Predict next behavior from h_t                         │
│  L_aux = -Σ [log P(b_{t+1} | h_t) + log P(b'_t | h_t)]                  │
│          (positive + negative samples)                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LAYER 2: INTEREST EVOLVING (AUGRU)                                     │
│  ──────────────────────────────────                                      │
│                                                                          │
│  h_1    h_2    h_3    h_4    h_5                                        │
│   │      │      │      │      │                                          │
│   │   ┌──┴──┐┌──┴──┐┌──┴──┐┌──┴──┐                                     │
│   │   │Attn ││Attn ││Attn ││Attn │  ← attention w.r.t. target ad       │
│   │   └──┬──┘└──┬──┘└──┬──┘└──┬──┘                                     │
│   │   a_1│   a_2│   a_3│   a_4│                                         │
│   │      │      │      │      │                                          │
│   ▼      ▼      ▼      ▼      ▼                                          │
│  ┌────┐┌────┐┌────┐┌────┐┌────┐                                        │
│  │AUGRU││AUGRU││AUGRU││AUGRU││AUGRU│                                    │
│  └────┘└────┘└────┘└────┘└────┘                                        │
│   │      │      │      │      │                                          │
│   ▼      ▼      ▼      ▼      ▼                                          │
│  h'_1   h'_2   h'_3   h'_4   h'_5                                       │
│                               │                                          │
│                               ▼                                          │
│                    Final interest state h'_T                            │
│                    (used for prediction)                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  AUGRU (Attention Update GRU):                                          │
│  ─────────────────────────────                                           │
│                                                                          │
│  Standard GRU:                                                          │
│    ũ_t = σ(W_u · [h_{t-1}, i_t] + b_u)     (update gate)               │
│    h̃_t = tanh(W_h · [r_t ⊙ h_{t-1}, i_t])  (candidate)                │
│    h_t = (1 - ũ_t) ⊙ h_{t-1} + ũ_t ⊙ h̃_t                              │
│                                                                          │
│  AUGRU modification:                                                     │
│    u'_t = a_t · ũ_t                         (attention-scaled update)  │
│    h'_t = (1 - u'_t) ⊙ h'_{t-1} + u'_t ⊙ h̃_t                          │
│                                                                          │
│  Effect: Low attention a_t → small update → ignore this behavior       │
│          High attention a_t → normal update → incorporate behavior      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Auxiliary loss for interest extraction:

The auxiliary loss ensures hidden states $h_t$ actually capture user interests by requiring them to predict the next behavior:

$\mathcal{L}_{aux} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T-1}\left[\log \sigma(\mathbf{h}_t^i \cdot \mathbf{e}_{b_{t+1}}^i) + \log(1 - \sigma(\mathbf{h}_t^i \cdot \mathbf{e}_{b'_t}^i))\right]$

where $b'_t$ is a negative sample (item the user didn't interact with).

DSIN: Deep Session Interest Network (Alibaba, 2019)

DSIN recognizes that user behavior naturally clusters into sessions. Within a session, behaviors are highly related; across sessions, interests may differ significantly.

Architecture:

Session Division: Split behavior sequence into sessions (e.g., by time gaps)
Intra-Session Interest: Self-attention within each session
Inter-Session Interest: Bi-LSTM across sessions to capture evolution
Session Interest Activation: Attention with target ad

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DSIN: SESSION-BASED INTEREST MODELING                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RAW BEHAVIOR SEQUENCE:                                                  │
│  ──────────────────────                                                  │
│  [b1, b2, b3] | gap | [b4, b5] | gap | [b6, b7, b8, b9]                │
│  └──Session 1──┘     └Session 2┘     └───Session 3────┘                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LAYER 1: INTRA-SESSION (Self-Attention per session)                    │
│  ───────────────────────────────────────────────────                    │
│                                                                          │
│  Session 1: [b1, b2, b3] → Self-Attention → Interest I_1                │
│  Session 2: [b4, b5]     → Self-Attention → Interest I_2                │
│  Session 3: [b6, b7, b8, b9] → Self-Attention → Interest I_3            │
│                                                                          │
│  Self-attention captures relationships within session:                  │
│  "viewed Nike, then searched shoes, then viewed Nike sizes"             │
│  → coherent shopping intent                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LAYER 2: INTER-SESSION (Bi-LSTM across sessions)                       │
│  ────────────────────────────────────────────────                        │
│                                                                          │
│    I_1 ───→ I_2 ───→ I_3      (forward LSTM)                           │
│    I_1 ←─── I_2 ←─── I_3      (backward LSTM)                          │
│                                                                          │
│  Captures interest evolution across sessions:                           │
│  "First explored options, then compared prices, then ready to buy"      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LAYER 3: SESSION ACTIVATION (Target-aware attention)                   │
│  ────────────────────────────────────────────────────                    │
│                                                                          │
│    [I_1, I_2, I_3] × Attention(target_ad) → weighted sum                │
│                                                                          │
│  Recent relevant session may be more important than old relevant one    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SIM: Search-based Interest Model (Alibaba, 2020)

For users with very long behavior histories (thousands of items), attention over the full sequence is too slow. SIM introduces a two-stage approach: first retrieve relevant behaviors, then apply attention.

Architecture:

General Search Unit (GSU): Fast retrieval of top-K relevant behaviors
Exact Search Unit (ESU): Precise attention over retrieved behaviors

GSU (soft search):

$\text{rel}(b_i, a) = \mathbf{e}_{b_i}^T \mathbf{W} \mathbf{e}_a$

Select top-K behaviors with highest relevance scores.

GSU (hard search):

Use category/brand matching to retrieve candidates, e.g., "all behaviors in same category as target ad."

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SIM: HANDLING LONG BEHAVIOR SEQUENCES                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM: User has 10,000 past behaviors                                │
│  ─────────────────────────────────────                                   │
│                                                                          │
│  DIN/DIEN: Attention over 10,000 items → O(10,000) per inference        │
│  → Too slow for real-time serving (<10ms budget)                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SIM SOLUTION: Two-stage retrieval + attention                          │
│  ─────────────────────────────────────────────                           │
│                                                                          │
│  All behaviors (10,000)                                                  │
│         │                                                                │
│         ▼                                                                │
│  ┌─────────────────────────────────────┐                                │
│  │  STAGE 1: General Search Unit (GSU) │                                │
│  │                                     │                                │
│  │  Option A: Soft search              │                                │
│  │    score_i = e_i^T W e_target       │                                │
│  │    Keep top-K (e.g., K=100)         │                                │
│  │                                     │                                │
│  │  Option B: Hard search              │                                │
│  │    Filter by category/brand match   │                                │
│  │    (Very fast, O(1) with index)    │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│         Relevant behaviors (100)                                        │
│                   │                                                      │
│                   ▼                                                      │
│  ┌─────────────────────────────────────┐                                │
│  │  STAGE 2: Exact Search Unit (ESU)   │                                │
│  │                                     │                                │
│  │  Full attention (like DIN/DIEN)     │                                │
│  │  over the K retrieved behaviors     │                                │
│  │                                     │                                │
│  │  Time encoding: add position info   │                                │
│  │  for temporal awareness             │                                │
│  └────────────────┬────────────────────┘                                │
│                   │                                                      │
│                   ▼                                                      │
│          User interest vector                                           │
│                                                                          │
│  COMPLEXITY REDUCTION:                                                   │
│  ─────────────────────                                                   │
│  Original: O(N) attention = O(10,000)                                   │
│  SIM:      O(K) attention = O(100)     → 100x speedup!                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: Multi-Task Learning for Advertising

The Multi-Objective Problem

In advertising, we care about multiple outcomes:

Click: Did user click the ad? (CTR)
Conversion: Did user complete a purchase? (CVR)
Engagement: Did user spend time on landing page?
Long-term: Did user become a repeat customer?

These objectives are related but distinct. Multi-task learning (MTL) enables joint modeling.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    WHY MULTI-TASK LEARNING?                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OPTION 1: Separate models                                              │
│  ─────────────────────────                                               │
│                                                                          │
│  CTR Model:  Features → DNN_1 → P(click)                                │
│  CVR Model:  Features → DNN_2 → P(conversion)                           │
│                                                                          │
│  Problems:                                                               │
│  • No shared learning (features learned independently)                  │
│  • Inconsistent predictions (CTR and CVR may disagree)                  │
│  • Sample selection bias for CVR (only see conversions after clicks)   │
│  • 2x serving cost                                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPTION 2: Multi-task model                                             │
│  ──────────────────────────                                              │
│                                                                          │
│  Features → Shared Layers → Task-Specific Heads → [P(click), P(conv)]   │
│                                                                          │
│  Benefits:                                                               │
│  • Shared representations learn general patterns                        │
│  • Transfer learning between tasks                                      │
│  • Single model for serving                                             │
│  • Auxiliary tasks regularize main task                                 │
│                                                                          │
│  Challenges:                                                             │
│  • Negative transfer (tasks may conflict)                               │
│  • Gradient balancing between tasks                                     │
│  • Different task difficulties                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Shared-Bottom Architecture

The simplest MTL approach: share bottom layers, use task-specific towers on top.

Architecture:

$\mathbf{h}^{\text{shared}} = f^{\text{shared}}(\mathbf{x})$ $y_k = f_k^{\text{tower}}(\mathbf{h}^{\text{shared}}) \quad \text{for task } k$

Loss:

$\mathcal{L} = \sum_{k=1}^{K} \lambda_k \mathcal{L}_k$

where $\lambda_k$ balances task importance.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SHARED-BOTTOM ARCHITECTURE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│              Task 1        Task 2        Task 3                         │
│              (CTR)         (CVR)        (LTV)                           │
│                │              │             │                            │
│                ▼              ▼             ▼                            │
│           ┌────────┐    ┌────────┐    ┌────────┐                        │
│           │Tower 1 │    │Tower 2 │    │Tower 3 │                        │
│           │ (MLP)  │    │ (MLP)  │    │ (MLP)  │                        │
│           └────┬───┘    └────┬───┘    └────┬───┘                        │
│                │              │             │                            │
│                └──────────────┼─────────────┘                            │
│                               │                                          │
│                        ┌──────┴──────┐                                  │
│                        │   Shared    │                                  │
│                        │   Bottom    │                                  │
│                        │   (MLP)     │                                  │
│                        └──────┬──────┘                                  │
│                               │                                          │
│                        ┌──────┴──────┐                                  │
│                        │  Embedding  │                                  │
│                        │   Layer     │                                  │
│                        └──────┬──────┘                                  │
│                               │                                          │
│                        ┌──────┴──────┐                                  │
│                        │   Input     │                                  │
│                        │  Features   │                                  │
│                        └─────────────┘                                  │
│                                                                          │
│  LIMITATION: Assumes all tasks benefit from same representation         │
│  → Negative transfer when tasks conflict                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

MMOE: Multi-gate Mixture-of-Experts (Google, 2018)

MMOE addresses negative transfer by learning task-specific combinations of shared experts.

Key idea: Instead of one shared bottom, use multiple expert networks and let each task decide which experts to use.

Architecture:

$y_k = h_k\left(\sum_{i=1}^{n} g_k^{(i)}(\mathbf{x}) \cdot f_i(\mathbf{x})\right)$

where:

$f_i(\mathbf{x})$ = expert $i$ 's output
$g_k(\mathbf{x}) = \text{softmax}(\mathbf{W}_{gk} \mathbf{x})$ = task $k$ 's gating weights
$h_k$ = task $k$ 's tower

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MMOE: MIXTURE OF EXPERTS                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│              Task 1        Task 2        Task 3                         │
│                │              │             │                            │
│                ▼              ▼             ▼                            │
│           ┌────────┐    ┌────────┐    ┌────────┐                        │
│           │Tower 1 │    │Tower 2 │    │Tower 3 │                        │
│           └────┬───┘    └────┬───┘    └────┬───┘                        │
│                │              │             │                            │
│                ▼              ▼             ▼                            │
│           ┌────────┐    ┌────────┐    ┌────────┐                        │
│           │Gate 1  │    │Gate 2  │    │Gate 3  │                        │
│           │[.3,.5,.2]   │[.6,.2,.2]   │[.1,.3,.6]                       │
│           └────┬───┘    └────┬───┘    └────┬───┘                        │
│                │              │             │                            │
│                │              │             │                            │
│                └──────────────┼─────────────┘                            │
│                         weighted sum                                     │
│                ┌──────────────┼─────────────┐                            │
│                │              │             │                            │
│                ▼              ▼             ▼                            │
│           ┌────────┐    ┌────────┐    ┌────────┐                        │
│           │Expert 1│    │Expert 2│    │Expert 3│                        │
│           │ (MLP)  │    │ (MLP)  │    │ (MLP)  │                        │
│           └────┬───┘    └────┬───┘    └────┬───┘                        │
│                │              │             │                            │
│                └──────────────┼─────────────┘                            │
│                               │                                          │
│                        ┌──────┴──────┐                                  │
│                        │   Input     │                                  │
│                        └─────────────┘                                  │
│                                                                          │
│  KEY INSIGHT:                                                            │
│  ───────────                                                             │
│  • Task 1 uses mostly Expert 2 (weights [.3, .5, .2])                   │
│  • Task 3 uses mostly Expert 3 (weights [.1, .3, .6])                   │
│  • Tasks can specialize while still sharing some computation            │
│                                                                          │
│  GATING MECHANISM:                                                       │
│  ─────────────────                                                       │
│  g_k(x) = softmax(W_k · x)                                              │
│                                                                          │
│  Gate is input-dependent: different inputs may use different expert     │
│  combinations even for the same task                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PLE: Progressive Layered Extraction (Tencent, 2020)

PLE extends MMOE by separating task-specific experts from shared experts, and progressively refining representations across layers.

Key insight: Some knowledge should be task-specific from the start, not just at the tower level.

Architecture:

At each layer $l$ :

$\mathbf{h}_k^{(l)} = g_k^{(l)}(\mathbf{h}^{(l-1)}) \cdot [\mathbf{E}_k^{(l)}(\mathbf{h}_k^{(l-1)}); \mathbf{E}_s^{(l)}(\mathbf{h}_s^{(l-1)})]$

where:

$\mathbf{E}_k^{(l)}$ = task-specific experts for task $k$
$\mathbf{E}_s^{(l)}$ = shared experts
$g_k^{(l)}$ = gating network selecting from both

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PLE: PROGRESSIVE LAYERED EXTRACTION                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                    Task A Tower        Task B Tower                      │
│                         │                   │                            │
│              ┌──────────┴───────────────────┴──────────┐                │
│              │                                          │                │
│              ▼                                          ▼                │
│  EXTRACTION LAYER 2:                                                     │
│  ──────────────────────────────────────────────────────────────────     │
│                                                                          │
│   Task A         Shared          Task B                                 │
│   Experts        Experts         Experts                                │
│   ┌───┐┌───┐    ┌───┐┌───┐     ┌───┐┌───┐                              │
│   │EA1││EA2│    │ES1││ES2│     │EB1││EB2│                              │
│   └─┬─┘└─┬─┘    └─┬─┘└─┬─┘     └─┬─┘└─┬─┘                              │
│     │    │        │    │         │    │                                  │
│     └────┴────────┴────┴─────────┴────┘                                  │
│           │                │                                             │
│     Gate A selects   Gate B selects                                     │
│     from all 6       from all 6                                         │
│           │                │                                             │
│  ─────────┴────────────────┴─────────────────────────────────────────   │
│                                                                          │
│  EXTRACTION LAYER 1:                                                     │
│  ──────────────────────────────────────────────────────────────────     │
│                                                                          │
│   Task A         Shared          Task B                                 │
│   Experts        Experts         Experts                                │
│   ┌───┐┌───┐    ┌───┐┌───┐     ┌───┐┌───┐                              │
│   │EA1││EA2│    │ES1││ES2│     │EB1││EB2│                              │
│   └─┬─┘└─┬─┘    └─┬─┘└─┬─┘     └─┬─┘└─┬─┘                              │
│     │    │        │    │         │    │                                  │
│     └────┴────────┴────┴─────────┴────┘                                  │
│                    │                                                     │
│              ┌─────┴─────┐                                              │
│              │   Input   │                                              │
│              └───────────┘                                              │
│                                                                          │
│  COMPARISON WITH MMOE:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  MMOE:                                                                   │
│  • All experts are shared                                               │
│  • Task-specific learning only in towers                                │
│                                                                          │
│  PLE:                                                                    │
│  • Explicit task-specific experts at each layer                         │
│  • Progressive refinement through multiple extraction layers            │
│  • Better handles conflicting tasks                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

ESMM: Entire Space Multi-Task Model (Alibaba, 2018)

ESMM addresses the sample selection bias problem in CVR prediction.

The problem:

CVR (conversion rate) = P(conversion | click)
Training data: Only users who clicked
Deployment: Predict for all impressions (including non-clickers)

This is sample selection bias: the training distribution differs from the deployment distribution.

ESMM's solution: Model the entire sample space using the decomposition:

$P(\text{conversion}) = P(\text{click}) \times P(\text{conversion} | \text{click})$

or equivalently:

$\text{CTCVR} = \text{CTR} \times \text{CVR}$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ESMM: SAMPLE SELECTION BIAS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM:                                                            │
│  ────────────                                                            │
│                                                                          │
│                   All Impressions (1M)                                   │
│                          │                                               │
│              ┌───────────┴───────────┐                                  │
│              │                       │                                   │
│        Clicks (30K)           No Click (970K)                           │
│              │                       │                                   │
│      ┌───────┴───────┐               ×                                  │
│      │               │          (no conversion                          │
│  Conversion (1K)  No Conv (29K)  data here!)                            │
│                                                                          │
│  Traditional CVR model:                                                  │
│  • Trained on 30K clicks only                                           │
│  • Deployed on 1M impressions                                           │
│  • Distribution mismatch!                                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ESMM SOLUTION:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  Model CTCVR (click AND convert) over entire impression space:          │
│                                                                          │
│  P(click ∩ conversion) = P(click) × P(conversion | click)               │
│        CTCVR           =   CTR    ×       CVR                           │
│                                                                          │
│  ARCHITECTURE:                                                           │
│                                                                          │
│       ┌─────────────┐                   ┌─────────────┐                 │
│       │  CTR Tower  │                   │  CVR Tower  │                 │
│       │   pCTR      │                   │   pCVR      │                 │
│       └──────┬──────┘                   └──────┬──────┘                 │
│              │                                  │                        │
│              │               ┌──────────────────┘                        │
│              │               │                                           │
│              ▼               ▼                                           │
│         ┌────────────────────────┐                                      │
│         │   pCTCVR = pCTR × pCVR │  ← Multiplied, not concatenated     │
│         └────────────────────────┘                                      │
│                                                                          │
│  TRAINING:                                                               │
│  ─────────                                                               │
│  • CTR: supervised on all impressions (click/no-click labels)           │
│  • CTCVR: supervised on all impressions (conversion labels)             │
│  • CVR: NO direct supervision—learned implicitly!                       │
│                                                                          │
│  Loss = L_CTR(pCTR, click_label) + L_CTCVR(pCTCVR, conversion_label)   │
│                                                                          │
│  BENEFIT:                                                                │
│  ────────                                                                │
│  • CVR implicitly trained on ALL samples (via CTCVR supervision)        │
│  • No sample selection bias                                             │
│  • CTR and CVR share embeddings (transfer learning)                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mathematical insight:

Since $\text{CTCVR} = \text{CTR} \times \text{CVR}$ , the gradient flows through both towers:

$\frac{\partial \mathcal{L}_{CTCVR}}{\partial \theta_{CVR}} = \frac{\partial \mathcal{L}_{CTCVR}}{\partial \text{pCTCVR}} \times \frac{\partial \text{pCTCVR}}{\partial \text{pCVR}} \times \frac{\partial \text{pCVR}}{\partial \theta_{CVR}}$

$= \frac{\partial \mathcal{L}_{CTCVR}}{\partial \text{pCTCVR}} \times \text{pCTR} \times \frac{\partial \text{pCVR}}{\partial \theta_{CVR}}$

The CVR tower receives gradients weighted by CTR, which naturally emphasizes training on samples likely to click—exactly what we want for CVR estimation.

Part VI: Calibration and Position Bias

Why Calibration Matters

A well-calibrated model means: when you predict 10% CTR, approximately 10% of those impressions should result in clicks.

Definition (calibration):

$\mathbb{E}[y | \hat{p}(\mathbf{x}) = p] = p \quad \forall p \in [0, 1]$

In words: among all predictions with value $p$ , the actual positive rate should be $p$ .

Why calibration matters in advertising:

Revenue optimization: $\text{Expected Revenue} = \text{pCTR} \times \text{bid}$
- Overestimated pCTR → overpay for impressions
- Underestimated pCTR → lose valuable impressions
Budget pacing: Advertisers set daily budgets assuming predicted CTRs are accurate
Auction dynamics: Second-price auctions assume truthful bidding based on accurate value estimates

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CALIBRATION IN AD SYSTEMS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WELL-CALIBRATED MODEL:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  Predicted CTR │ Actual CTR │ Calibration                               │
│  ──────────────┼────────────┼───────────────                            │
│      0.01      │    0.010   │   Perfect                                 │
│      0.05      │    0.052   │   Close                                   │
│      0.10      │    0.098   │   Close                                   │
│      0.20      │    0.195   │   Close                                   │
│                                                                          │
│  POORLY-CALIBRATED MODEL:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  Predicted CTR │ Actual CTR │ Problem                                   │
│  ──────────────┼────────────┼───────────────                            │
│      0.01      │    0.005   │   Overconfident                           │
│      0.05      │    0.030   │   Overconfident                           │
│      0.10      │    0.150   │   Underconfident                          │
│      0.20      │    0.250   │   Underconfident                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CALIBRATION PLOT (Reliability Diagram):                                │
│                                                                          │
│  Actual CTR                                                              │
│      │                              ╱                                   │
│  0.3 │                           ╱                                      │
│      │                        ╱    ● (well-calibrated)                 │
│  0.2 │                     ╱   ●                                        │
│      │                  ╱  ●                                            │
│  0.1 │               ╱●                                                 │
│      │            ╱●                                                    │
│    0 │─────────╱─────────────────────────                               │
│      0       0.1      0.2      0.3    Predicted CTR                     │
│                                                                          │
│  Perfect calibration: points lie on diagonal                            │
│  Above diagonal: underconfident (actual > predicted)                    │
│  Below diagonal: overconfident (actual < predicted)                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Calibration Methods

Platt Scaling:

Learn a post-hoc logistic transformation:

$\hat{p}_{\text{calibrated}} = \sigma(A \cdot \hat{p}_{\text{raw}} + B)$

where $A, B$ are learned on a validation set.

Isotonic Regression:

Non-parametric: learn a monotonic step function mapping raw scores to calibrated probabilities.

Temperature Scaling:

$\hat{p}_{\text{calibrated}} = \sigma\left(\frac{\text{logit}(\hat{p}_{\text{raw}})}{T}\right)$

where $T > 1$ softens predictions (less confident), $T < 1$ sharpens them.

Position Bias

The problem: Ads in higher positions get more clicks regardless of relevance, simply because users see them first.

Observed CTR decomposition:

$P(\text{click} | \text{ad}, \text{position}) = P(\text{examine} | \text{position}) \times P(\text{click} | \text{examine}, \text{ad})$

where:

$P(\text{examine} | \text{position})$ = probability user sees the ad (decreases with position)
$P(\text{click} | \text{examine}, \text{ad})$ = true relevance (what we want to estimate)

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    POSITION BIAS IN AD CLICKS                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USER'S ATTENTION PATTERN:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Position 1: ████████████████████  P(examine) = 1.0                     │
│  Position 2: ██████████████████    P(examine) = 0.85                    │
│  Position 3: ████████████████      P(examine) = 0.70                    │
│  Position 4: ██████████████        P(examine) = 0.55                    │
│  Position 5: ████████████          P(examine) = 0.40                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE BIAS PROBLEM:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Scenario: Two ads with same true relevance (0.10 click prob if seen)   │
│                                                                          │
│  Ad A in position 1: Observed CTR = 1.0 × 0.10 = 0.10                   │
│  Ad B in position 5: Observed CTR = 0.4 × 0.10 = 0.04                   │
│                                                                          │
│  Naive model: "Ad A is 2.5x better than Ad B"  ← WRONG!                 │
│  Reality: They're equally good; position caused the difference          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THIS MATTERS FOR TRAINING:                                          │
│  ──────────────────────────────                                          │
│                                                                          │
│  Training data: Historical clicks with position information             │
│                                                                          │
│  If we ignore position bias:                                            │
│  • Ads that historically appeared in top positions → overestimated CTR  │
│  • Ads that historically appeared in bottom → underestimated CTR        │
│  • Rich get richer (biased ads keep getting top positions)              │
│                                                                          │
│  We need to estimate TRUE relevance, not position-confounded CTR        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Debiasing Approaches

1. Position as Feature:

Simply add position as an input feature during training, but set it to a reference position (e.g., position 1) during inference.

2. Propensity Weighting (Inverse Propensity Scoring):

Weight training samples inversely by their examination probability:

$\mathcal{L} = -\sum_{i} \frac{1}{P(\text{examine} | \text{pos}_i)} \cdot \log P(\text{click} | \text{ad}_i)$

This upweights samples from low positions (which are harder to click).

3. Position Bias Models (PAL):

Model examination and relevance separately:

$\hat{y} = \sigma(\text{logit}_{\text{relevance}}(\text{ad features}) + \text{logit}_{\text{position}}(\text{position}))$

At inference, use only the relevance component.

Part VII: Real-Time Bidding (RTB)

The RTB Ecosystem

Real-Time Bidding is how most display/video ads are bought and sold. When a user loads a webpage:

Publisher sends bid request to ad exchange (user info, context)
Exchange broadcasts to multiple Demand-Side Platforms (DSPs)
DSPs evaluate their advertisers' campaigns and submit bids
Exchange runs auction, winner's ad is displayed
Total time: <100ms

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    RTB: REAL-TIME BIDDING FLOW                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USER LOADS PAGE                                                         │
│       │                                                                  │
│       ▼  (1) Bid Request (~10ms)                                        │
│  ┌─────────────┐                                                        │
│  │  Publisher  │                                                        │
│  │   (SSP)     │                                                        │
│  └──────┬──────┘                                                        │
│         │                                                                │
│         ▼  (2) Broadcast to DSPs                                        │
│  ┌─────────────┐                                                        │
│  │ Ad Exchange │                                                        │
│  └──────┬──────┘                                                        │
│         │                                                                │
│    ┌────┴────┬─────────┬─────────┐                                      │
│    ▼         ▼         ▼         ▼                                      │
│ ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐                                     │
│ │DSP 1│  │DSP 2│  │DSP 3│  │DSP 4│  (3) Each DSP decides bid (~50ms)  │
│ └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘                                     │
│    │        │        │        │                                          │
│    │  ┌─────┴────────┴────────┴─────┐                                   │
│    │  │ For each campaign:          │                                   │
│    │  │ 1. Predict pCTR, pCVR       │                                   │
│    │  │ 2. Calculate expected value │                                   │
│    │  │ 3. Apply bidding strategy   │                                   │
│    │  │ 4. Check budget constraints │                                   │
│    │  └─────────────────────────────┘                                   │
│    │                                                                     │
│    ▼  (4) Submit bids                                                   │
│ ┌─────────────┐                                                         │
│ │ Ad Exchange │  (5) Run auction (usually 2nd price)                    │
│ └──────┬──────┘                                                         │
│        │                                                                 │
│        ▼  (6) Winner's ad served                                        │
│ ┌─────────────┐                                                         │
│ │  User sees  │                                                         │
│ │     ad      │                                                         │
│ └─────────────┘                                                         │
│                                                                          │
│  TOTAL LATENCY BUDGET: ~100ms                                           │
│  • Network round-trip: ~20ms                                            │
│  • DSP processing: ~50ms                                                │
│  • Exchange auction: ~10ms                                              │
│  • Ad rendering: ~20ms                                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Bid Optimization

The bidding problem: Given predicted CTR, conversion rate, and campaign constraints, what bid maximizes value?

Basic formulation (for CPA campaigns):

$\text{bid} = \text{pCTR} \times \text{pCVR} \times \text{target CPA}$

where target CPA is the advertiser's desired cost per acquisition.

Auction dynamics:

In a second-price auction, you pay the second-highest bid plus a small increment. The optimal strategy under truthful bidding is to bid your true value:

$\text{bid}^* = \mathbb{E}[\text{value}] = \text{pCTR} \times \text{pCVR} \times \text{conversion value}$

But real auctions have complications: budget constraints, pacing requirements, competition dynamics.

Budget Pacing

Problem: Advertiser has daily budget $B$ but wants to spread impressions throughout the day (not exhaust budget in the morning).

Pacing strategies:

Probabilistic throttling: Bid on only a fraction of requests $P(\text{bid}) = \frac{B}{\text{expected daily spend without pacing}}$
Bid shading: Reduce bids by a pacing multiplier $\lambda$ $\text{bid}_{\text{paced}} = \lambda \times \text{bid}_{\text{optimal}}$
PID controller: Dynamically adjust $\lambda$ based on spend rate vs. target rate

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    BUDGET PACING OVER A DAY                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Spend ($)                                                               │
│      │                                                                   │
│  $100│─────────────────────────────────────────────  Budget limit       │
│      │                                         ╱                        │
│   $80│                                      ╱                           │
│      │                                   ╱  ← Well-paced spend          │
│   $60│                                ╱                                 │
│      │                  ╱╱╱╱╱╱                                          │
│   $40│            ╱╱╱╱╱╱        ← Uniform pacing target                │
│      │      ╱╱╱╱╱╱                                                      │
│   $20│ ╱╱╱╱╱                                                            │
│      │╱                                                                  │
│    $0├───────────────────────────────────────────────────────           │
│      0    4    8    12   16   20   24  Hour                             │
│                                                                          │
│  WITHOUT PACING:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  Spend ($)                                                               │
│      │                                                                   │
│  $100│─────────█████████████████████████████████████  Budget exhausted │
│      │        █                                       by 2pm!           │
│   $80│       █                                                          │
│      │      █                                                           │
│   $60│     █                                                            │
│      │    █                                                             │
│   $40│   █                                                              │
│      │  █                                                               │
│   $20│ █                                                                │
│      │█                                                                 │
│    $0├───────────────────────────────────────────────────────           │
│      0    4    8    12   16   20   24  Hour                             │
│                                                                          │
│  Problem: Miss all evening traffic (often high-value!)                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PID Controller for pacing:

$\lambda_{t+1} = \lambda_t + K_p \cdot e_t + K_i \cdot \sum_{s=0}^{t} e_s + K_d \cdot (e_t - e_{t-1})$

where $e_t = \text{target spend rate} - \text{actual spend rate}$ .

Game Theory of Ad Auctions

Ad auctions are strategic games where bidders compete for impressions. Understanding game-theoretic foundations is essential for optimal bidding.

Auction Formats in Digital Advertising:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    AD AUCTION MECHANISMS                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FIRST-PRICE AUCTION:                                                    │
│  ────────────────────                                                    │
│  Winner pays their bid                                                  │
│                                                                          │
│  Bids: [$1.00, $0.80, $0.60]                                           │
│  Winner: Bidder 1 ($1.00)                                               │
│  Payment: $1.00                                                         │
│                                                                          │
│  Strategy: Shade bid below true value to increase profit margin         │
│  Equilibrium: Complex, depends on beliefs about competitors             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SECOND-PRICE AUCTION (Vickrey):                                        │
│  ───────────────────────────────                                         │
│  Winner pays second-highest bid                                         │
│                                                                          │
│  Bids: [$1.00, $0.80, $0.60]                                           │
│  Winner: Bidder 1 ($1.00)                                               │
│  Payment: $0.80 (second price)                                          │
│                                                                          │
│  Strategy: Bid true value (dominant strategy!)                          │
│  Equilibrium: Truthful bidding is optimal regardless of competitors     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENERALIZED SECOND-PRICE (GSP):                                        │
│  ───────────────────────────────                                         │
│  Multiple slots, each winner pays next-highest bid                      │
│                                                                          │
│  Bids: [$1.00, $0.80, $0.60] for 2 slots                               │
│  Slot 1: Bidder 1, pays $0.80                                          │
│  Slot 2: Bidder 2, pays $0.60                                          │
│                                                                          │
│  Strategy: NOT truthful! Bid shading is rational                        │
│  Equilibrium: Multiple equilibria exist                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Vickrey-Clarke-Groves (VCG) Mechanism:

The VCG mechanism extends second-price auctions to multiple items with the property that truthful bidding is a dominant strategy.

Payment rule:

$p_i = \sum_{j \neq i} v_j(a_{-i}^*) - \sum_{j \neq i} v_j(a^*)$

where:

$a^*$ = optimal allocation with bidder $i$
$a_{-i}^*$ = optimal allocation without bidder $i$
$v_j(a)$ = bidder $j$ 's value under allocation $a$

Intuition: Bidder $i$ pays the externality they impose on others—the reduction in others' total value caused by $i$ 's presence.

Why VCG matters: Under VCG, bidding your true value is always optimal, regardless of what others do. This simplifies bidder strategy and improves auction efficiency.

Nash Equilibrium in GSP Auctions:

GSP (used by Google Ads for years) does NOT have truthful bidding as equilibrium. Instead, bidders shade bids strategically.

Symmetric Nash Equilibrium bid:

For a bidder with value $v_i$ competing for position $k$ :

$b_i^* = v_i - \frac{(v_i - v_{i+1}) \cdot \alpha_{k+1}}{\alpha_k}$

where $\alpha_k$ = click-through rate for position $k$ (position 1 has highest CTR).

Key insight: Bid shading increases with position quality difference. If position 1 gets 10x more clicks than position 2, competition for position 1 is fierce, and bid shading is minimal.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    EQUILIBRIUM ANALYSIS: GSP vs VCG                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Example: 3 bidders, 2 slots                                            │
│  Values: v₁ = $10, v₂ = $8, v₃ = $4                                    │
│  CTRs: α₁ = 1.0 (slot 1), α₂ = 0.5 (slot 2)                            │
│                                                                          │
│  TRUTHFUL BIDDING (VCG):                                                │
│  ───────────────────────                                                 │
│  Bids = Values: [$10, $8, $4]                                          │
│  Allocation: Bidder 1 → Slot 1, Bidder 2 → Slot 2                      │
│                                                                          │
│  VCG Payments:                                                          │
│  p₁ = (v₂·α₁ + v₃·α₂) - (v₂·α₂ + v₃·0) = ($8·1 + $4·0.5) - ($8·0.5)  │
│     = $10 - $4 = $6                                                     │
│  p₂ = (v₃·α₂) - (v₃·0) = $2 - $0 = $2                                  │
│                                                                          │
│  GSP EQUILIBRIUM (with bid shading):                                    │
│  ────────────────────────────────────                                    │
│  Equilibrium bids: b₁ = $8, b₂ = $4, b₃ = $4                           │
│  Payments: p₁ = $4 (second price), p₂ = $4 (third price)               │
│                                                                          │
│  COMPARISON:                                                             │
│  ───────────                                                             │
│  │ Mechanism │ Revenue │ Efficiency │ Strategy Complexity │            │
│  ├───────────┼─────────┼────────────┼─────────────────────┤            │
│  │ VCG       │ $8      │ Optimal    │ Simple (truthful)   │            │
│  │ GSP       │ $8      │ Optimal    │ Complex (shade)     │            │
│  │ First-Prc │ Varies  │ Optimal    │ Complex (shade)     │            │
│                                                                          │
│  Revenue Equivalence Theorem: Under certain conditions, all             │
│  auction formats yield the same expected revenue!                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Reserve Prices and Optimal Auction Design:

Platforms set reserve prices $r$ to extract more revenue from high-value bidders.

Myerson's optimal reserve (for uniform value distribution on $[0, v_{max}]$ ):

$r^* = \frac{v_{max}}{2}$

Revenue impact:

$\text{Revenue}(r) = r \cdot P(\text{win} | b \geq r) + \mathbb{E}[\text{second price} | b_1 \geq r, b_2 \geq r]$

Setting $r > 0$ sacrifices some auctions (no winner) but extracts more from winners.

Modern auction trends:

First-price auctions: Google and others moved from GSP to first-price (2019-2021)
Header bidding: Multiple exchanges compete simultaneously
Unified auctions: Combine direct deals with programmatic

Reinforcement Learning for Bid Optimization

Bidding is fundamentally a sequential decision problem: each bid affects budget, win rate, and future opportunities. RL provides a principled framework.

MDP Formulation for Bidding:

$(\mathcal{S}, \mathcal{A}, P, R, \gamma)$

State space $\mathcal{S}$ :

$s_t = (\text{budget remaining}, \text{time remaining}, \text{user features}, \text{ad features}, \text{context}, \text{historical win rate}, \text{market conditions})$

Action space $\mathcal{A}$ :

$a_t = \text{bid amount} \in [0, b_{max}]$

Or discretized: $a_t \in \{0, 0.01, 0.02, ..., b_{max}\}$

Or bid multiplier: $a_t \in \{0.5, 0.75, 1.0, 1.25, 1.5\} \times \text{base\_bid}$

Transition dynamics $P(s_{t+1} | s_t, a_t)$ :

If win: budget decreases by payment, conversion may occur
If lose: budget unchanged, opportunity lost
Time always advances

Reward function $R(s_t, a_t)$ :

Code

R(s_t, a_t) =
  ┌ conversion_value - payment   if win and convert
  │ -payment                     if win and no convert
  └ 0                            if lose

Or for CPA goals:

$R(s_t, a_t) = \mathbb{I}[\text{convert}] - \frac{\text{payment}}{\text{target\_CPA}}$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    RL BIDDING: MDP FORMULATION                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STATE at time t:                                                        │
│  ────────────────                                                        │
│  s_t = [                                                                │
│    B_t,           # Budget remaining ($)                                │
│    T - t,         # Time remaining (hours)                              │
│    pCTR,          # Predicted click probability                         │
│    pCVR,          # Predicted conversion probability                    │
│    user_embed,    # User features (dense)                               │
│    context,       # Page, device, hour, etc.                            │
│    win_rate_t,    # Recent win rate                                     │
│    spend_rate_t   # Current spend velocity                              │
│  ]                                                                       │
│                                                                          │
│  ACTION:                                                                 │
│  ───────                                                                 │
│  a_t = bid_multiplier ∈ {0.5, 0.75, 1.0, 1.25, 1.5, 2.0}               │
│  actual_bid = a_t × base_bid                                            │
│  base_bid = pCTR × pCVR × target_CPA                                   │
│                                                                          │
│  TRANSITION:                                                             │
│  ───────────                                                             │
│                                                                          │
│  ┌─────────┐    win (p=w(bid))    ┌─────────────────────┐              │
│  │  Bid    │ ──────────────────► │ B_{t+1} = B_t - cost │              │
│  │  a_t    │                      │ reward = value - cost│              │
│  └────┬────┘                      └─────────────────────┘              │
│       │                                                                  │
│       │ lose (p=1-w(bid))                                               │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────┐                                                │
│  │ B_{t+1} = B_t       │                                                │
│  │ reward = 0          │                                                │
│  │ (opportunity lost)  │                                                │
│  └─────────────────────┘                                                │
│                                                                          │
│  OBJECTIVE:                                                              │
│  ──────────                                                              │
│  Maximize: E[Σ γ^t R_t] subject to Σ cost_t ≤ B (budget constraint)    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Value Function and Q-Learning:

State-value function:

$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k} \mid s_t = s\right]$

Action-value function:

$Q^\pi(s, a) = \mathbb{E}_\pi\left[R_t + \gamma V^\pi(s_{t+1}) \mid s_t = s, a_t = a\right]$

Bellman optimality equation:

$Q^*(s, a) = \mathbb{E}\left[R + \gamma \max_{a'} Q^*(s', a') \mid s, a\right]$

Deep Q-Network (DQN) for bidding:

Approximate $Q^*(s, a)$ with neural network $Q_\theta(s, a)$ :

$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a)\right)^2\right]$

where $\theta^-$ = target network parameters (updated periodically for stability).

Policy Gradient Methods:

For continuous bid spaces, policy gradient methods work better than Q-learning.

Policy parametrization:

$\pi_\theta(a | s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2)$

Policy gradient theorem:

$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s, a)\right]$

Actor-Critic for bidding:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ACTOR-CRITIC BIDDING AGENT                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                         ┌─────────────────┐                             │
│                         │      State      │                             │
│                         │   (features)    │                             │
│                         └────────┬────────┘                             │
│                                  │                                       │
│                    ┌─────────────┴─────────────┐                        │
│                    │                           │                         │
│                    ▼                           ▼                         │
│           ┌────────────────┐         ┌────────────────┐                 │
│           │     ACTOR      │         │    CRITIC      │                 │
│           │   π_θ(a|s)     │         │    V_φ(s)      │                 │
│           │               │         │                │                 │
│           │  Policy Net    │         │  Value Net     │                 │
│           └───────┬────────┘         └───────┬────────┘                 │
│                   │                          │                           │
│                   ▼                          ▼                           │
│              ┌─────────┐              ┌───────────┐                      │
│              │  Bid    │              │ Baseline  │                      │
│              │ Action  │              │  V(s)     │                      │
│              └────┬────┘              └─────┬─────┘                      │
│                   │                         │                            │
│                   ▼                         │                            │
│              ┌─────────┐                    │                            │
│              │ Auction │                    │                            │
│              │ Result  │                    │                            │
│              └────┬────┘                    │                            │
│                   │                         │                            │
│                   ▼                         ▼                            │
│              ┌─────────────────────────────────┐                        │
│              │  Advantage: A = R + γV(s') - V(s)│                       │
│              └─────────────────────────────────┘                        │
│                              │                                           │
│              ┌───────────────┴───────────────┐                          │
│              │                               │                           │
│              ▼                               ▼                           │
│    Actor update:                   Critic update:                       │
│    θ ← θ + α∇log π(a|s)·A         φ ← φ - β∇(V(s) - target)²          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Handling Budget Constraints (Constrained RL):

Budget constraints make this a Constrained MDP (CMDP):

$\max_\pi \mathbb{E}\left[\sum_t R_t\right] \quad \text{s.t.} \quad \mathbb{E}\left[\sum_t c_t\right] \leq B$

Lagrangian relaxation:

$\mathcal{L}(\pi, \lambda) = \mathbb{E}\left[\sum_t R_t\right] - \lambda \left(\mathbb{E}\left[\sum_t c_t\right] - B\right)$

Solve via dual gradient descent: update $\pi$ to maximize $\mathcal{L}$ , update $\lambda$ to minimize it.

Practical constraint handling:

Penalty shaping: Add $-\alpha \cdot \max(0, \text{spend} - \text{budget})$ to reward
Budget as state: Include remaining budget in state, learn budget-aware policy
Post-hoc projection: Clip actions to respect constraints

Offline RL for Bidding:

Online RL exploration can be costly (bad bids lose money). Offline RL learns from historical auction logs.

Challenge: Distribution shift—learned policy may choose bids never seen in data.

Conservative Q-Learning (CQL):

$\mathcal{L}_{CQL}(\theta) = \mathcal{L}_{Q}(\theta) + \alpha \cdot \mathbb{E}_{s \sim \mathcal{D}}\left[\log \sum_a \exp(Q_\theta(s,a)) - \mathbb{E}_{a \sim \mathcal{D}}[Q_\theta(s,a)]\right]$

The penalty term discourages high Q-values for out-of-distribution actions.

Multi-Agent Considerations:

All bidders are simultaneously learning and adapting, creating a multi-agent RL problem.

Challenges:

Non-stationarity: Other bidders' policies change over time
Partial observability: Can't see competitors' states or strategies
Credit assignment: Win/loss depends on others' bids

Approaches:

Opponent modeling: Estimate competitors' bidding strategies
Robust RL: Optimize for worst-case competitor behavior
Mean-field approximation: Model aggregate competition as a distribution
Regret minimization: Guarantee no-regret against arbitrary competitors

Bid Landscape Forecasting

To optimize bids, we need to understand the competitive landscape: what bids are needed to win at different rates?

Win rate function:

$w(b) = P(\text{win auction} | \text{bid} = b)$

This is typically modeled as:

Log-normal: $w(b) = \Phi\left(\frac{\log b - \mu}{\sigma}\right)$
Empirical: Learn from historical auction data

Optimal bidding with win rate:

$\text{Expected profit} = w(b) \cdot (\text{value} - \text{cost}(b))$

For second-price auctions, $\text{cost}(b) \approx \mathbb{E}[\text{second price} | \text{win at } b]$ .

Part VIII: Production Considerations

Latency Requirements

Ad systems have extreme latency requirements:

Component	Budget
Total end-to-end	<100ms
Feature retrieval	<10ms
Model inference	<10ms
Ranking logic	<5ms
Network overhead	~50ms

Techniques for low-latency inference:

Model distillation: Train small "student" model to mimic large "teacher"
Quantization: INT8 or even INT4 inference
Pruning: Remove unimportant weights
Caching: Precompute user/item embeddings
Cascade ranking: Cheap model filters 10K→100 candidates, expensive model ranks 100

Feature Store Architecture

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE STORE FOR AD SERVING                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OFFLINE PIPELINE (batch):                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Raw Data → Feature Engineering → Feature Store (offline)               │
│  (Spark/Flink)    (daily/hourly)     (Hive, S3)                         │
│                                                                          │
│  Features: User historical stats, item aggregates, long-term behavior   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ONLINE PIPELINE (real-time):                                            │
│  ────────────────────────────                                            │
│                                                                          │
│  Events → Stream Processing → Feature Store (online)                    │
│  (Kafka)     (Flink)            (Redis, DynamoDB)                       │
│                                                                          │
│  Features: Real-time counts, recent clicks, session features            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SERVING PATH:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Bid Request → Feature Retrieval → Model Inference → Bid Response       │
│                     │                                                    │
│           ┌─────────┴─────────┐                                         │
│           ▼                   ▼                                          │
│      Online Store        Offline Store                                  │
│      (Redis: <1ms)       (preloaded cache)                              │
│                                                                          │
│  FEATURE FRESHNESS REQUIREMENTS:                                         │
│  ────────────────────────────────                                        │
│                                                                          │
│  • User embedding: Updated daily (offline OK)                           │
│  • User recent clicks: Updated real-time (online required)              │
│  • Ad historical CTR: Updated hourly (near-line)                        │
│  • Context features: Computed at request time                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Update and A/B Testing

Continuous training pipeline:

Daily retraining: Incorporate yesterday's clicks/conversions
Incremental updates: Online learning on streaming data
Shadow deployment: New model runs alongside production, compare metrics
Gradual rollout: 1% → 5% → 20% → 50% → 100% traffic

A/B testing considerations:

Interference: Users in treatment may affect control (competition for same inventory)
Delayed conversions: Need to wait days/weeks for full conversion data
Novelty effects: New models may appear better initially due to exploration
Metric selection: CTR? Revenue? Long-term user satisfaction?

Part IX: Advanced Topics

Delayed Feedback Modeling

Conversions often happen hours or days after clicks. How do we train when labels are incomplete?

Approaches:

Attribution window: Only count conversions within X days of click
Importance weighting: Weight older samples higher (more complete labels)
Elapsed-time model: $P(\text{conversion}) = f(\text{features}) \times g(\text{elapsed time})$

Defuse model (Chapelle, 2014):

$P(\text{observed conversion by time } t) = P(\text{conversion}) \times P(\text{delay} \leq t)$

Model both the conversion probability and delay distribution.

Fraud Detection

Click fraud costs advertisers billions annually. Detection approaches:

Anomaly detection: Unusual click patterns, timing, sources
Behavioral modeling: Bots have different behavior than humans
IP/device fingerprinting: Identify fraudulent traffic sources
Conversion modeling: Fraudulent clicks rarely convert

Privacy-Preserving Advertising

With increasing privacy regulations (GDPR, CCPA) and deprecation of third-party cookies:

Federated learning: Train models without centralizing user data
Differential privacy: Add noise to prevent individual identification
On-device prediction: Run models locally on user devices
Cohort-based targeting: Target groups, not individuals (Google's Topics API)

Part X: Generative AI and LLM-Powered Advertising

The emergence of Large Language Models is transforming advertising beyond traditional CTR prediction. GenAI impacts every stage of the advertising pipeline: creative generation, audience understanding, personalization, and optimization.

The GenAI Advertising Stack

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    GENAI IN ADVERTISING PIPELINE                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL PIPELINE:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Advertiser → Fixed Creative → Targeting Rules → CTR Model → User       │
│               (one ad copy)    (demographics)    (predict)              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENAI-ENHANCED PIPELINE:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  Advertiser → LLM Creative   → Semantic         → Neural      → User    │
│               Generation       Audience           Ranking               │
│               (1000s of        Understanding      + LLM                 │
│               variations)      (intent, context)  Personalization       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENAI TOUCHPOINTS:                                                      │
│                                                                          │
│  1. CREATIVE GENERATION                                                  │
│     • Ad copy variations                                                │
│     • Headline optimization                                             │
│     • Image generation (DALL-E, Midjourney)                             │
│     • Video script generation                                           │
│                                                                          │
│  2. AUDIENCE UNDERSTANDING                                               │
│     • Intent classification from search queries                         │
│     • Semantic user profiling                                           │
│     • Contextual page understanding                                     │
│     • Conversation-based preference elicitation                         │
│                                                                          │
│  3. PERSONALIZATION                                                      │
│     • Dynamic ad copy adaptation                                        │
│     • Real-time message tailoring                                       │
│     • Conversational ad experiences                                     │
│                                                                          │
│  4. OPTIMIZATION                                                         │
│     • LLM-as-judge for ad quality                                       │
│     • Automated A/B test analysis                                       │
│     • Campaign strategy recommendations                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

LLM-Powered Creative Generation

Traditional advertising requires human copywriters to create ad variations. LLMs can generate thousands of variations automatically, enabling true personalization at scale.

The Creative Generation Problem:

Given:

Product/service description
Brand guidelines and tone
Target audience segment
Platform constraints (character limits, format)

Generate: Optimized ad copy that maximizes engagement

Multi-Armed Bandit for Creative Selection:

With LLM-generated variations (potentially thousands), we need efficient exploration to find winners without wasting budget on poor performers. The Upper Confidence Bound (UCB) algorithm balances exploitation (showing best-performing creatives) with exploration (testing uncertain ones):

$A_t = \arg\max_a \left[\hat{\mu}_a + c\sqrt{\frac{\ln t}{N_a}}\right]$

where:

$\hat{\mu}_a$ = estimated CTR for creative $a$ (exploitation term)
$N_a$ = times creative $a$ has been shown
$c$ = exploration constant (typically 1.0-2.0)
$\sqrt{\frac{\ln t}{N_a}}$ = exploration bonus (decreases as we test creative $a$ more)

Intuition: The exploration bonus is large when $N_a$ is small (we haven't tested this creative much, so we're uncertain). As we show the creative more, the bonus shrinks and the algorithm relies more on observed performance. This prevents premature convergence to locally optimal creatives while still exploiting known winners.

Practical considerations:

Cold start: New LLM-generated creatives start with high exploration bonus
Batch updates: In practice, update $\hat{\mu}_a$ periodically (hourly/daily) rather than per-impression
Contextual bandits: Extend to $\hat{\mu}_a(x)$ where $x$ is user context—different creatives may perform better for different users

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LLM CREATIVE GENERATION PIPELINE                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT:                                                                  │
│  ──────                                                                  │
│  Product: "Running shoes with carbon fiber plate"                       │
│  Brand: Nike                                                            │
│  Audience: Competitive runners, 25-40                                   │
│  Platform: Google Search (30 char headline, 90 char description)        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LLM GENERATION (with constraints):                                      │
│  ──────────────────────────────────                                      │
│                                                                          │
│  Variation 1 (Performance focus):                                       │
│    Headline: "Shave Minutes Off Your PR"                                │
│    Description: "Carbon-plated running shoes engineered for speed.     │
│                  Free shipping on orders over $100."                    │
│                                                                          │
│  Variation 2 (Technology focus):                                        │
│    Headline: "Carbon Fiber Technology"                                  │
│    Description: "Experience the same tech as Olympic marathoners.      │
│                  Shop the new collection today."                        │
│                                                                          │
│  Variation 3 (Social proof):                                            │
│    Headline: "Worn by World Champions"                                  │
│    Description: "Join 100,000+ runners who improved their times.       │
│                  Rated 4.9 stars by elite athletes."                    │
│                                                                          │
│  Variation 4 (Urgency):                                                 │
│    Headline: "Limited Edition Colors"                                   │
│    Description: "Race-day ready carbon shoes. Only 500 pairs left.     │
│                  Order now for guaranteed delivery."                    │
│                                                                          │
│  ... (100s more variations)                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SELECTION:                                                              │
│  ──────────                                                              │
│                                                                          │
│  1. LLM-as-Judge filters low-quality/off-brand variations              │
│  2. Multi-armed bandit explores promising variations                    │
│  3. CTR model provides exploitation signal                              │
│  4. Best performers get more budget                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Quality Control with LLM-as-Judge:

Not all generated creatives are good. Use a judge model to filter before deployment:

$\text{Quality}(c) = \text{LLM}_{\text{judge}}(c, \text{brand\_guidelines}, \text{policy})$

Multi-dimensional scoring (each 1-5 scale):

Dimension	What It Measures	Failure Examples
Brand alignment	Tone, voice, values match brand	Luxury brand with casual slang
Policy compliance	No prohibited claims	Unsubstantiated health claims
Clarity	Message is understandable	Confusing or ambiguous copy
Persuasiveness	Compelling call-to-action	Weak or missing CTA
Grammar	Correct language	Typos, awkward phrasing
Factual accuracy	Claims are true	Wrong prices, features

Composite score:

$\text{Score}_{\text{final}} = \min(\text{Score}_{\text{policy}}, \text{Score}_{\text{factual}}) \times \text{mean}(\text{other scores})$

The $\min$ ensures policy/factual violations are hard blockers regardless of other qualities.

Implementation options:

Prompt-based: Describe scoring criteria in prompt, ask LLM to rate
Fine-tuned judge: Train classifier on human-labeled creative quality data
Ensemble: Multiple judges vote, require consensus for approval
Hybrid: LLM pre-filter + human review for borderline cases

Threshold tuning:

High threshold (4.5+): Fewer creatives pass, higher average quality, less variety
Low threshold (3.5+): More creatives pass, more variety, some quality risk
Adaptive threshold: Start high for new campaigns, lower as you build trust

Semantic Audience Understanding

Traditional targeting uses demographic segments (age, gender, location). LLMs enable semantic targeting based on intent and context.

Intent Understanding from Search Queries:

$\text{Intent}(q) = \text{LLM}_{\text{classifier}}(q) \in \{\text{informational}, \text{navigational}, \text{transactional}, \text{commercial}\}$

Beyond simple classification, LLMs extract nuanced intent:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SEMANTIC INTENT EXTRACTION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Query: "best running shoes for marathon under $200"                    │
│                                                                          │
│  TRADITIONAL KEYWORD MATCHING:                                           │
│  ─────────────────────────────                                           │
│  Keywords: [running, shoes, marathon, $200]                             │
│  Match ads containing these keywords                                    │
│                                                                          │
│  LLM SEMANTIC UNDERSTANDING:                                             │
│  ───────────────────────────                                             │
│  {                                                                       │
│    "intent": "transactional",                                           │
│    "product_category": "performance_running_shoes",                     │
│    "use_case": "marathon_racing",                                       │
│    "experience_level": "intermediate_to_advanced",                      │
│    "price_sensitivity": "high",                                         │
│    "price_ceiling": 200,                                                │
│    "decision_stage": "comparison_shopping",                             │
│    "implicit_needs": [                                                  │
│      "durability_for_long_distance",                                    │
│      "energy_return",                                                   │
│      "lightweight"                                                      │
│    ],                                                                    │
│    "likely_follow_up_interests": [                                      │
│      "running_socks",                                                   │
│      "hydration_gear",                                                  │
│      "marathon_training_plans"                                          │
│    ]                                                                     │
│  }                                                                       │
│                                                                          │
│  TARGETING IMPLICATIONS:                                                 │
│  ───────────────────────                                                 │
│  • Show mid-tier shoes (not budget, not premium)                        │
│  • Emphasize marathon-specific features                                 │
│  • Highlight value proposition (quality at price point)                 │
│  • Cross-sell complementary gear                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

User Embedding from Behavior + LLM:

Combine traditional collaborative filtering embeddings with LLM-derived semantic embeddings:

$\mathbf{u}_{\text{combined}} = \alpha \cdot \mathbf{u}_{\text{CF}} + (1-\alpha) \cdot \mathbf{u}_{\text{LLM}}$

where:

$\mathbf{u}_{\text{CF}}$ = embedding from click/purchase history (DIN/DIEN style)
$\mathbf{u}_{\text{LLM}}$ = embedding from LLM understanding of user's content consumption
$\alpha$ = blending weight (typically 0.3-0.7, tuned via validation)

When to use different $\alpha$ values:

Scenario	Recommended $\alpha$	Rationale
User has rich click history	0.7-0.8	Trust behavioral signals
New/cold-start user	0.2-0.3	Rely on semantic understanding
High-intent queries	0.4-0.5	Balance both signals
Content-heavy domains (news, articles)	0.3-0.4	LLM understands content better

Implementation approaches:

Late fusion: Compute $\mathbf{u}_{\text{CF}}$ and $\mathbf{u}_{\text{LLM}}$ separately, combine at serving time
Early fusion: Concatenate behavioral and semantic features, let model learn combination
Learned fusion: Train a small network to predict optimal $\alpha$ per user

Contextual Page Understanding:

Instead of simple page categorization (sports, news, entertainment), LLMs understand page content semantically:

$\text{PageContext}(p) = \text{LLM}_{\text{encoder}}(\text{content}(p))$

How it works:

Extract page text (title, headings, body, metadata)
Pass through LLM encoder (e.g., fine-tuned BERT, or frozen GPT embeddings)
Resulting embedding captures semantic meaning, not just category

This enables:

Brand safety: Understand if content discusses sensitive topics (violence, controversy) even without keyword matches
Contextual relevance: Match "marathon training tips" article to running shoe ads even if "shoes" isn't mentioned
Sentiment alignment: Place upbeat ads on positive content, avoid juxtaposition issues
Topic nuance: Distinguish "Apple (company)" from "apple (fruit)" for targeting

Dynamic Personalization at Serving Time

The most transformative application: personalize ad creative in real-time based on user context.

Personalization Hierarchy:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PERSONALIZATION DEPTH LEVELS                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LEVEL 0: No Personalization                                            │
│  ──────────────────────────────                                          │
│  Same ad shown to everyone                                              │
│  "Buy Nike Running Shoes Today"                                         │
│                                                                          │
│  LEVEL 1: Segment-Based (Traditional)                                   │
│  ─────────────────────────────────────                                   │
│  Different ads per demographic segment                                  │
│  Male 25-34: "Dominate Your Next Race"                                  │
│  Female 25-34: "Run Your Personal Best"                                 │
│                                                                          │
│  LEVEL 2: Behavioral (DIN/DIEN era)                                     │
│  ──────────────────────────────────                                      │
│  Ad selection based on user history                                     │
│  User viewed marathons → Show marathon shoe ads                         │
│  User viewed trails → Show trail shoe ads                               │
│                                                                          │
│  LEVEL 3: LLM Dynamic Personalization                                   │
│  ─────────────────────────────────────                                   │
│  Ad CONTENT adapted in real-time                                        │
│                                                                          │
│  User A (searched "Boston Marathon qualifying times"):                  │
│  "Qualify for Boston: Shoes Trusted by 50,000+ BQ Runners"             │
│                                                                          │
│  User B (searched "couch to 5k beginner"):                              │
│  "Start Your Running Journey: Comfort-First Design for New Runners"    │
│                                                                          │
│  User C (browsing running injury articles):                             │
│  "Run Pain-Free: Engineered Support for Injury Prevention"             │
│                                                                          │
│  Same product, completely different messaging!                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Real-Time Personalization Architecture:

$\text{Ad}_{\text{personalized}} = \text{LLM}(\text{template}, \text{user\_context}, \text{product\_info})$

But LLM inference is too slow for real-time bidding (<50ms). Solutions:

Pre-computation: Generate top-K personalized variants offline, select at serving time
Template + Slot Filling: LLM generates templates, fast model fills slots
Cached Personas: Pre-compute ads for user personas, map users to personas
Speculative Generation: Generate during page load, before ad request

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    REAL-TIME PERSONALIZATION ARCHITECTURE                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OFFLINE (Batch):                                                        │
│  ────────────────                                                        │
│                                                                          │
│  For each (product, persona) pair:                                      │
│    LLM generates N ad variations                                        │
│    Store in Creative Cache                                              │
│                                                                          │
│  Personas: {beginner_runner, competitive_runner, injury_recovery,       │
│             casual_fitness, marathon_focused, trail_enthusiast, ...}    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ONLINE (Real-time):                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  1. User request arrives                                                │
│           │                                                              │
│           ▼                                                              │
│  2. ┌─────────────────┐                                                 │
│     │ Persona Classifier│  (Fast: embedding similarity, ~1ms)           │
│     │ user → persona    │                                               │
│     └────────┬──────────┘                                               │
│              │                                                           │
│              ▼                                                           │
│  3. ┌─────────────────┐                                                 │
│     │ Creative Cache   │  (Lookup: product × persona, ~1ms)             │
│     │ Lookup           │                                                │
│     └────────┬──────────┘                                               │
│              │                                                           │
│              ▼                                                           │
│  4. ┌─────────────────┐                                                 │
│     │ CTR Model        │  (Score personalized creative, ~5ms)           │
│     │ Prediction       │                                                │
│     └────────┬──────────┘                                               │
│              │                                                           │
│              ▼                                                           │
│  5. Return personalized ad                                              │
│                                                                          │
│  Total latency: <10ms (no LLM inference in critical path!)             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

LLM-Enhanced CTR Prediction

Beyond creative generation, LLMs can directly improve CTR prediction models.

Feature Augmentation with LLM Embeddings:

Traditional CTR models use sparse ID features. Add dense semantic features from LLMs:

$\hat{y} = f(\mathbf{x}_{\text{sparse}}, \mathbf{e}_{\text{user}}^{\text{LLM}}, \mathbf{e}_{\text{ad}}^{\text{LLM}}, \mathbf{e}_{\text{context}}^{\text{LLM}})$

where $\mathbf{e}^{\text{LLM}}$ are embeddings from an LLM encoder.

Benefits:

Cold-start handling: New ads have semantic embeddings even without click history
Generalization: Similar products share similar embeddings
Cross-domain transfer: Knowledge transfers across ad categories

Semantic Similarity for Candidate Retrieval:

Use LLM embeddings for initial candidate retrieval:

$\text{Candidates} = \text{ANN}(\mathbf{e}_{\text{query}}^{\text{LLM}}, \{\mathbf{e}_{\text{ad}}^{\text{LLM}}\})$

Then apply traditional CTR models for final ranking.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LLM-ENHANCED TWO-TOWER RETRIEVAL                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  QUERY TOWER:                          AD TOWER:                        │
│  ─────────────                          ─────────                        │
│                                                                          │
│  User Query                             Ad Content                       │
│  + User History                         + Ad Metadata                    │
│       │                                      │                           │
│       ▼                                      ▼                           │
│  ┌─────────────┐                       ┌─────────────┐                  │
│  │ LLM Encoder │                       │ LLM Encoder │                  │
│  │ (shared)    │                       │ (shared)    │                  │
│  └──────┬──────┘                       └──────┬──────┘                  │
│         │                                     │                          │
│         ▼                                     ▼                          │
│  ┌─────────────┐                       ┌─────────────┐                  │
│  │ Projection  │                       │ Projection  │                  │
│  │ Layer       │                       │ Layer       │                  │
│  └──────┬──────┘                       └──────┬──────┘                  │
│         │                                     │                          │
│         ▼                                     ▼                          │
│      e_query                               e_ad                          │
│         │                                     │                          │
│         └──────────────┬──────────────────────┘                          │
│                        │                                                 │
│                        ▼                                                 │
│              score = <e_query, e_ad>                                    │
│                        │                                                 │
│                        ▼                                                 │
│              Top-K candidates → CTR ranking                             │
│                                                                          │
│  LLM BENEFITS:                                                           │
│  ─────────────                                                           │
│  • "marathon training" query matches "26.2 mile race shoes" ad          │
│  • "gift for runner dad" matches "men's premium running shoes"          │
│  • Semantic understanding beyond keyword matching                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Conversational Advertising

LLMs enable a new paradigm: conversational ad experiences where users interact with ads through dialogue.

Use Cases:

Product Discovery: "I need shoes for my first marathon. What do you recommend?"
Objection Handling: "Why are these so expensive?" → Explain value proposition
Personalized Recommendations: Multi-turn dialogue to understand needs
Purchase Assistance: Guide through size selection, shipping options

Conversational Ad Architecture:

$\text{Response}_t = \text{LLM}(\text{Product\_Info}, \text{Dialogue}_{1:t-1}, \text{User\_Message}_t, \text{Brand\_Guidelines})$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CONVERSATIONAL AD EXPERIENCE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Traditional Ad:                                                         │
│  ───────────────                                                         │
│  ┌─────────────────────────────────────────┐                            │
│  │  Nike ZoomX Vaporfly - $250            │                            │
│  │  The fastest marathon shoe ever.       │                            │
│  │  [Shop Now]                            │                            │
│  └─────────────────────────────────────────┘                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Conversational Ad:                                                      │
│  ──────────────────                                                      │
│  ┌─────────────────────────────────────────┐                            │
│  │  Nike Running Assistant                │                            │
│  │                                        │                            │
│  │  User: "Is this shoe good for a       │                            │
│  │         beginner marathon runner?"     │                            │
│  │                                        │                            │
│  │  Nike: "Great question! The Vaporfly │                            │
│  │   is our elite racing shoe, designed  │                            │
│  │   for experienced runners chasing PRs.│                            │
│  │   For your first marathon, I'd        │                            │
│  │   recommend the Pegasus 41 - it's     │                            │
│  │   more cushioned for training miles   │                            │
│  │   and race day comfort. Would you     │                            │
│  │   like to see it?"                    │                            │
│  │                                        │                            │
│  │  [See Pegasus 41] [Tell me more]      │                            │
│  │  [Compare both]                       │                            │
│  └─────────────────────────────────────────┘                            │
│                                                                          │
│  BENEFITS:                                                               │
│  ─────────                                                               │
│  • Higher engagement (dialogue > static ad)                             │
│  • Better matching (understand actual needs)                            │
│  • Trust building (honest recommendations)                              │
│  • Data collection (explicit preference signals)                        │
│                                                                          │
│  CHALLENGES:                                                             │
│  ───────────                                                             │
│  • Latency (LLM inference per turn)                                     │
│  • Brand safety (LLM may say wrong things)                              │
│  • Cost (compute per conversation)                                      │
│  • Measurement (how to attribute conversions)                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Measurement and Attribution for Conversational Ads:

Conversational ads create new measurement challenges—how do you attribute value across a multi-turn dialogue?

$\text{ConversationValue} = \sum_{t=1}^{T} \gamma^{T-t} \cdot r_t$

where:

$r_t$ = reward signal at turn $t$ (click, add-to-cart, purchase)
$\gamma$ = discount factor (earlier turns get less credit)
$T$ = total conversation turns

Attribution approaches:

Model	Description	When to Use
Last-touch	Credit final turn before conversion	Simple, but ignores discovery value
First-touch	Credit conversation initiation	Values engagement, ignores persuasion
Linear	Equal credit to all turns	Fair, but doesn't capture turn importance
Position-based	40% first, 40% last, 20% middle	Balances discovery and conversion
Data-driven	ML model learns credit assignment	Best accuracy, requires volume

Key metrics for conversational ads:

Engagement rate: % of users who respond to first message
Conversation depth: Average turns per conversation
Resolution rate: % of conversations ending in desired action
Deflection rate: % of users who abandon mid-conversation
Cost per conversation: Total LLM compute / conversations
Incremental lift: Conversion rate vs. static ad control group

LLM-Based Campaign Optimization

Beyond individual ads, LLMs can optimize entire campaigns.

Automated A/B Test Analysis:

$\text{Insights} = \text{LLM}(\text{Test\_Results}, \text{Campaign\_Context}, \text{Historical\_Learnings})$

LLMs can:

Identify statistically significant results (accounting for multiple comparisons)
Explain WHY certain variants won (not just that they won)
Suggest follow-up tests based on observed patterns
Detect Simpson's paradox and other statistical pitfalls
Identify segment-level winners that differ from overall winners

What makes LLM analysis different from traditional dashboards:

Traditional: "Variant B has 5% higher CTR with p<0.05"

LLM analysis: "Variant B outperformed because its 'limited time' messaging created urgency. However, this effect was concentrated in mobile users during evening hours—desktop users showed no significant difference. Consider: (1) testing urgency messaging specifically for mobile evening campaigns, (2) investigating why desktop users didn't respond (perhaps they need more product details before urgency appeals work)."

Budget Allocation Recommendations:

$\text{Allocation} = \text{LLM}(\text{Campaign\_Performance}, \text{Market\_Conditions}, \text{Advertiser\_Goals})$

LLMs analyze cross-channel performance and recommend budget shifts. Key capabilities:

Diminishing returns detection: "Search is hitting saturation—incremental CPA increasing. Consider shifting 20% to display prospecting."
Opportunity identification: "Competitor X reduced spend on 'running shoes' keywords—bid landscape favorable for expansion."
Goal alignment: "Current allocation optimizes for clicks, but your stated goal is conversions. Recommend shifting budget from awareness to consideration campaigns."
Seasonality anticipation: "Marathon season approaching—recommend 30% budget increase for running category starting week 8."

Audience Expansion with LLM Reasoning:

Traditional lookalike audiences use statistical similarity. LLMs add semantic reasoning about WHY an audience works, enabling more thoughtful expansion.

Given a high-performing audience segment, LLM reasons about similar segments:

Code

High-performing segment: "Marathon runners who clicked on nutrition ads"

LLM reasoning:
"This audience responds well because they're health-conscious athletes
focused on performance optimization. Similar audiences might include:
1. Triathletes (similar endurance focus)
2. CrossFit enthusiasts (performance-oriented)
3. Cycling enthusiasts (endurance athletes)
4. Health app power users (quantified-self mindset)

Recommendation: Test expansion to triathlon audiences first,
as they have the closest intent profile."

Personalization Ethics and Guardrails

LLM-powered personalization raises important ethical considerations.

Risks:

Manipulation: Hyper-personalized messaging could exploit psychological vulnerabilities
Filter bubbles: Users only see ads reinforcing existing preferences
Privacy: Deep personalization requires extensive data collection
Deception: AI-generated content may mislead users about what's human vs. machine

Guardrails:

$\text{Ad}_{\text{final}} = \text{Filter}(\text{Ad}_{\text{personalized}}, \text{Ethics\_Policy}, \text{Regulations})$

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PERSONALIZATION GUARDRAILS                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CONTENT FILTERS:                                                        │
│  ────────────────                                                        │
│  • No health claims without substantiation                              │
│  • No urgency manipulation ("Only 1 left!" when false)                  │
│  • No exploitation of negative emotions                                 │
│  • No discrimination based on protected characteristics                 │
│                                                                          │
│  TRANSPARENCY REQUIREMENTS:                                              │
│  ──────────────────────────                                              │
│  • Disclose AI-generated content                                        │
│  • Explain why ad was shown (ad preferences)                            │
│  • Allow users to opt out of personalization                            │
│                                                                          │
│  TECHNICAL CONTROLS:                                                     │
│  ───────────────────                                                     │
│  • LLM output classifiers for harmful content                           │
│  • Human review for new personalization strategies                      │
│  • A/B test ethics review board                                         │
│  • Audit trails for personalization decisions                           │
│                                                                          │
│  REGULATORY COMPLIANCE:                                                  │
│  ──────────────────────                                                  │
│  • GDPR: Data minimization, right to explanation                        │
│  • CCPA: Opt-out rights, disclosure requirements                        │
│  • FTC: Truth in advertising, endorsement disclosure                    │
│  • Industry self-regulation (NAI, DAA)                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Future: Agentic Advertising

The next frontier: AI agents that autonomously manage advertising campaigns.

Agentic Capabilities:

Autonomous Budget Management: Agent monitors performance and reallocates budget without human intervention
Creative Evolution: Agent generates, tests, and iterates on ad creative continuously
Competitive Response: Agent detects competitor actions and adjusts strategy
Cross-Channel Orchestration: Agent coordinates messaging across search, social, display, email

Architecture:

$\text{Action}_t = \text{Agent}(\text{State}_t, \text{Goals}, \text{Constraints})$

where State includes: current performance, budget status, market conditions, competitive landscape.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    AGENTIC ADVERTISING SYSTEM                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                    ┌─────────────────────────┐                          │
│                    │    Advertising Agent    │                          │
│                    │    (LLM + Tools)        │                          │
│                    └───────────┬─────────────┘                          │
│                                │                                         │
│           ┌────────────────────┼────────────────────┐                   │
│           │                    │                    │                    │
│           ▼                    ▼                    ▼                    │
│    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐             │
│    │  Creative   │     │   Budget    │     │  Audience   │             │
│    │  Generator  │     │  Optimizer  │     │  Expander   │             │
│    └─────────────┘     └─────────────┘     └─────────────┘             │
│           │                    │                    │                    │
│           ▼                    ▼                    ▼                    │
│    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐             │
│    │  Ad        │     │  Bid        │     │  Targeting  │             │
│    │  Platform  │     │  Management │     │  Rules      │             │
│    │  APIs      │     │  APIs       │     │  APIs       │             │
│    └─────────────┘     └─────────────┘     └─────────────┘             │
│                                                                          │
│  AGENT WORKFLOW:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  1. Monitor: Continuously track campaign KPIs                           │
│  2. Analyze: Identify underperforming segments/creatives                │
│  3. Plan: Decide on optimization actions                                │
│  4. Execute: Make changes via platform APIs                             │
│  5. Learn: Update strategy based on results                             │
│                                                                          │
│  HUMAN OVERSIGHT:                                                        │
│  ────────────────                                                        │
│  • Budget limits (agent can't exceed authorized spend)                  │
│  • Approval gates (major strategy changes need human OK)                │
│  • Alert thresholds (unusual patterns trigger human review)             │
│  • Audit logs (all agent actions recorded)                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Summary: The Modern Ad ML Stack

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MODERN AD ML ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                        SERVING LAYER                             │    │
│  │  • Low-latency inference (<10ms)                                │    │
│  │  • Model cascade (filter → rank)                                │    │
│  │  • Feature store integration                                    │    │
│  │  • A/B testing framework                                        │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                │                                         │
│  ┌─────────────────────────────┼─────────────────────────────────────┐  │
│  │                        MODEL LAYER                                │  │
│  │                                                                   │  │
│  │  CTR Model:          CVR Model:         Bid Model:               │  │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │  │
│  │  │ DeepFM/DCN  │    │ ESMM-style  │    │ Bid Shading │          │  │
│  │  │ + DIN/DIEN  │    │ Multi-task  │    │ + Pacing    │          │  │
│  │  │ behavior    │    │             │    │             │          │  │
│  │  └─────────────┘    └─────────────┘    └─────────────┘          │  │
│  │                                                                   │  │
│  │  Multi-task Framework: PLE / MMOE                                │  │
│  │  Calibration: Platt scaling / Isotonic regression                │  │
│  │  Position bias: PAL / Propensity weighting                       │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                │                                         │
│  ┌─────────────────────────────┼─────────────────────────────────────┐  │
│  │                       FEATURE LAYER                               │  │
│  │                                                                   │  │
│  │  Online Store (Redis):        Offline Store (Hive):              │  │
│  │  • Real-time counts           • User embeddings                  │  │
│  │  • Session features           • Historical aggregates            │  │
│  │  • Recent behaviors           • Item statistics                  │  │
│  │                                                                   │  │
│  │  Feature Engineering: Categorical encoding, crosses, embeddings  │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                │                                         │
│  ┌─────────────────────────────┼─────────────────────────────────────┐  │
│  │                        DATA LAYER                                 │  │
│  │                                                                   │  │
│  │  • Click/impression logs                                         │  │
│  │  • Conversion tracking (with delayed attribution)                │  │
│  │  • User behavior sequences                                       │  │
│  │  • Fraud detection signals                                       │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Sources

CTR Prediction Models:

User Behavior Modeling:

Multi-Task Learning:

Industry Systems:

GenAI for Advertising:

Frequently Asked Questions

LLMs transform advertising at multiple levels: (1) Creative generation - automated production of thousands of ad copy variations, (2) Semantic targeting - understanding user intent beyond keywords, (3) Dynamic personalization - real-time message adaptation per user, (4) Conversational ads - interactive dialogue experiences, (5) Campaign optimization - automated analysis and strategy recommendations.

Key concerns include manipulation (exploiting psychological vulnerabilities), filter bubbles (reinforcing existing preferences), privacy (extensive data collection), and transparency (users should know content is AI-generated). Best practices: clear disclosure of AI involvement, opt-out mechanisms for personalization, content guardrails against manipulation, and compliance with GDPR/CCPA/FTC regulations.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

RecSysPersonalization

Recommendation Systems: From Collaborative Filtering to Deep Learning

In-depth journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.

30 min read

RecSysPersonalization

Transformers for Recommendation Systems: From SASRec to HSTU

In-depth tour of transformer-based recommendation systems. From the fundamentals of sequential recommendation to Meta's trillion-parameter HSTU, understand how attention mechanisms revolutionized personalization.

30 min read

RecSysPersonalization

Generative AI for Recommendation Systems: LLMs Meet Personalization

Practical guide to LLM-powered recommendation systems. From feature augmentation to conversational agents, understand how generative AI is transforming personalization.

9 min read

EmbeddingsRAG

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

Clear walkthrough of embedding models for RAG, search, and AI applications. Comparison of text-embedding-3, BGE, E5, Cohere Embed v4, and Voyage with guidance on fine-tuning, dimensionality, multimodal embeddings, and production optimization.

15 min read

LLMsInfrastructure

Hardware Optimization for LLMs: CUDA Kernels, TPU vs GPU, and Accelerator Architecture

Field guide to hardware optimization for large language models covering GPU architecture, CUDA kernel optimization, TPU comparisons, memory hierarchies, and practical strategies for maximizing throughput on modern AI accelerators.

13 min read

EducationAgentic AI

Building Agentic AI Systems: A Complete Implementation Guide

Hands-on guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.

29 min read

PersonalizationLLMs

LLM Personalization: Building AI That Adapts to Individual Users

Clear walkthrough of personalizing Large Language Models. From memory architectures to preference learning, understand how to build AI systems that truly adapt to individual users—and the challenges that remain.

12 min read

PromptingLLMs

Advanced Prompt Engineering: From Basic to Production-Grade

Master the techniques that separate amateur prompts from production systems—chain-of-thought, structured outputs, model-specific optimization, and prompt architecture.

10 min read

Table of Contents

The Scale of Advertising ML

Part I: Foundations of CTR Prediction

The CTR Prediction Problem

Feature Engineering: The Foundation

Part II: Evolution of CTR Models

Stage 1: Logistic Regression (The Baseline)

Stage 2: Polynomial/Feature Cross Models

Stage 3: Factorization Machines (FM)

Stage 4: Field-aware Factorization Machines (FFM)

Part III: Deep Learning for CTR Prediction

The Deep Learning Revolution

Wide & Deep Learning (Google, 2016)

DeepFM (Huawei, 2017)

Deep & Cross Network (DCN) (Google, 2017)

xDeepFM (Microsoft, 2018)

AutoInt (2019)

FiBiNET (Sina Weibo, 2019)

Part IV: User Behavior Sequence Modeling

The Behavior Sequence Problem

Deep Interest Network (DIN) (Alibaba, 2018)

Deep Interest Evolution Network (DIEN) (Alibaba, 2019)

DSIN: Deep Session Interest Network (Alibaba, 2019)

SIM: Search-based Interest Model (Alibaba, 2020)

Part V: Multi-Task Learning for Advertising

The Multi-Objective Problem

Shared-Bottom Architecture

MMOE: Multi-gate Mixture-of-Experts (Google, 2018)

PLE: Progressive Layered Extraction (Tencent, 2020)

ESMM: Entire Space Multi-Task Model (Alibaba, 2018)

Part VI: Calibration and Position Bias

Why Calibration Matters

Calibration Methods

Position Bias

Debiasing Approaches

Part VII: Real-Time Bidding (RTB)

The RTB Ecosystem

Bid Optimization

Budget Pacing

Game Theory of Ad Auctions

Reinforcement Learning for Bid Optimization

Bid Landscape Forecasting

Part VIII: Production Considerations

Latency Requirements

Feature Store Architecture

Model Update and A/B Testing

Part IX: Advanced Topics

Delayed Feedback Modeling

Fraud Detection

Privacy-Preserving Advertising

Part X: Generative AI and LLM-Powered Advertising

The GenAI Advertising Stack

LLM-Powered Creative Generation

Semantic Audience Understanding

Dynamic Personalization at Serving Time

LLM-Enhanced CTR Prediction

Conversational Advertising

LLM-Based Campaign Optimization

Personalization Ethics and Guardrails

The Future: Agentic Advertising

Summary: The Modern Ad ML Stack

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Recommendation Systems: From Collaborative Filtering to Deep Learning

Transformers for Recommendation Systems: From SASRec to HSTU

Generative AI for Recommendation Systems: LLMs Meet Personalization

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

Hardware Optimization for LLMs: CUDA Kernels, TPU vs GPU, and Accelerator Architecture

Building Agentic AI Systems: A Complete Implementation Guide

LLM Personalization: Building AI That Adapts to Individual Users

Advanced Prompt Engineering: From Basic to Production-Grade