Skip to main content
Back to Blog

Recommendation Systems: From Collaborative Filtering to Deep Learning

In-depth journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.

30 min read
Share:

The Evolution of Recommendation Systems

Recommendation systems are among the most impactful applications of machine learning. They power what you watch on Netflix, what you buy on Amazon, what you listen to on Spotify, and what you scroll on TikTok. Understanding their evolution—from simple heuristics to trillion-parameter transformers—is essential for any ML engineer.

This post covers the foundational architectures that came before the transformer revolution. These methods remain crucial: many production systems still use them, and modern approaches build upon their insights.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              RECOMMENDATION SYSTEMS TIMELINE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1990s          CONTENT-BASED & COLLABORATIVE FILTERING                 │
│  ─────          Nearest neighbors, TF-IDF, item-item similarity         │
│                                                                          │
│  2006-2009      MATRIX FACTORIZATION ERA (Netflix Prize)                │
│  ─────────      SVD, ALS, BPR - latent factors revolutionize CF         │
│                                                                          │
│  2016           DEEP LEARNING ENTERS RECSYS                             │
│  ─────          GRU4Rec, Wide & Deep, Neural Collaborative Filtering    │
│                                                                          │
│  2017-2018      TWO-TOWER MODELS & FEATURE INTERACTIONS                 │
│  ─────────      DSSM, YouTube DNN, DeepFM, DCN                          │
│                                                                          │
│  2018+          ATTENTION & TRANSFORMERS                                │
│  ─────          SASRec, BERT4Rec, HSTU (see next post)                  │
│                                                                          │
│  2024+          GENERATIVE & LLM-POWERED                                │
│  ─────          TIGER, OneRec, LLM augmentation (see GenAI post)        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part I: Classical Recommendation Methods

The Three Paradigms

Before diving into algorithms, understand the three fundamental approaches:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              THREE PARADIGMS OF RECOMMENDATION                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. CONTENT-BASED FILTERING                                             │
│  ─────────────────────────────                                           │
│                                                                          │
│  "Recommend items similar to what you've liked before"                  │
│                                                                          │
│  User liked: Action movies with Tom Cruise                              │
│  Recommend: More action movies, more Tom Cruise movies                  │
│                                                                          │
│  Pros: No cold-start for items, explainable                            │
│  Cons: Limited discovery, requires item features                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. COLLABORATIVE FILTERING                                             │
│  ───────────────────────────────                                         │
│                                                                          │
│  "Users who liked X also liked Y"                                       │
│                                                                          │
│  User A liked: [Inception, Interstellar, The Matrix]                   │
│  User B liked: [Inception, Interstellar, Arrival]                      │
│  → Recommend Arrival to User A, The Matrix to User B                   │
│                                                                          │
│  Pros: Discovers unexpected items, no item features needed             │
│  Cons: Cold-start problem, popularity bias                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. HYBRID METHODS                                                       │
│  ──────────────────                                                      │
│                                                                          │
│  Combine content and collaborative signals                              │
│                                                                          │
│  Examples:                                                               │
│  • Use content for cold items, CF for warm items                       │
│  • Weighted ensemble of both approaches                                 │
│  • Joint models (modern deep learning)                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

User-Based Collaborative Filtering

User-based collaborative filtering was the original recommender algorithm, pioneered at MIT Media Lab and GroupLens in the mid-1990s. The core intuition is beautifully simple: if two users agreed in the past, they'll probably agree in the future.

How it works step-by-step:

  1. Build the user-item matrix: Rows are users, columns are items, values are ratings (or 0 for no interaction). A user who rated 100 movies has 100 non-zero entries in their row.

  2. Compute user-user similarity: For each pair of users, calculate how similar their rating patterns are. The most common metric is cosine similarity—it measures the angle between two rating vectors, ignoring magnitude (a user who rates everything 5 stars vs 3 stars can still be similar in preferences).

  3. Find k-nearest neighbors: For a target user, find the k users with highest similarity scores. These are the user's "taste neighbors."

  4. Predict ratings: To predict what the target user would rate an unseen item, take a weighted average of how the neighbors rated it. Users more similar to the target get higher weight.

The similarity formula:

sim(u,v)=iIuvruirviiIurui2iIvrvi2\text{sim}(u, v) = \frac{\sum_{i \in I_{uv}} r_{ui} \cdot r_{vi}}{\sqrt{\sum_{i \in I_u} r_{ui}^2} \cdot \sqrt{\sum_{i \in I_v} r_{vi}^2}}

Where IuvI_{uv} is the set of items both users rated, and ruir_{ui} is user uu's rating for item ii.

The prediction formula:

r^ui=vNk(u)sim(u,v)rvivNk(u)sim(u,v)\hat{r}_{ui} = \frac{\sum_{v \in N_k(u)} \text{sim}(u, v) \cdot r_{vi}}{\sum_{v \in N_k(u)} |\text{sim}(u, v)|}

This is a weighted average: neighbors who are more similar contribute more to the prediction.

Why it often fails in practice:

  • Sparsity: With millions of users and items, the user-item matrix is 99%+ zeros. Finding users who overlap on rated items is hard.
  • Scalability: Computing all pairwise user similarities is O(n² × m) where n is users and m is items. This doesn't scale.
  • Cold start: New users have no ratings, so they have no neighbors.

Despite these limitations, understanding user-based CF is essential—it introduces the fundamental concepts that all modern recommenders build upon.

Python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class UserBasedCF:
    """
    User-based collaborative filtering.
    Find similar users → recommend their items.
    """

    def __init__(self, k_neighbors: int = 20):
        self.k = k_neighbors
        self.user_item_matrix = None
        self.user_similarity = None

    def fit(self, ratings: np.ndarray):
        """
        Args:
            ratings: (num_users, num_items) matrix
                     0 = no interaction, otherwise = rating/score
        """
        self.user_item_matrix = ratings

        # Compute user-user similarity
        # Cosine similarity on rating vectors
        self.user_similarity = cosine_similarity(ratings)

        # Zero out self-similarity
        np.fill_diagonal(self.user_similarity, 0)

    def predict(self, user_id: int, item_id: int) -> float:
        """Predict rating for user-item pair."""

        # Find k most similar users who rated this item
        item_ratings = self.user_item_matrix[:, item_id]
        users_who_rated = np.where(item_ratings > 0)[0]

        if len(users_who_rated) == 0:
            return 0  # No data

        # Get similarities to these users
        similarities = self.user_similarity[user_id, users_who_rated]

        # Take top-k
        top_k_idx = np.argsort(similarities)[-self.k:]
        top_k_users = users_who_rated[top_k_idx]
        top_k_sims = similarities[top_k_idx]

        # Weighted average of their ratings
        if top_k_sims.sum() == 0:
            return item_ratings[top_k_users].mean()

        weighted_sum = (top_k_sims * item_ratings[top_k_users]).sum()
        return weighted_sum / top_k_sims.sum()

    def recommend(self, user_id: int, n: int = 10) -> list[int]:
        """Get top-n recommendations for user."""

        # Items user hasn't interacted with
        user_ratings = self.user_item_matrix[user_id]
        unseen_items = np.where(user_ratings == 0)[0]

        # Predict scores for unseen items
        scores = [
            (item_id, self.predict(user_id, item_id))
            for item_id in unseen_items
        ]

        # Sort by score
        scores.sort(key=lambda x: x[1], reverse=True)

        return [item_id for item_id, _ in scores[:n]]

Problems with User-Based CF:

  • Scalability: Computing user-user similarity is O(n²) in users
  • Sparsity: Most user pairs have no common items
  • Volatility: User preferences change; similarity matrix becomes stale

Item-Based Collaborative Filtering

In 2003, Amazon published a paper that revolutionized e-commerce recommendations. Their key insight: instead of finding similar users, find similar items. This seemingly small change solves most of user-based CF's problems.

Why item-based is better than user-based:

  1. Stability: Item relationships are more stable than user relationships. "The Matrix" and "Inception" will always be similar sci-fi movies. But a user's taste can change—they might suddenly start watching documentaries.

  2. Scalability: You have fewer items than users (typically 10-100x fewer). Computing item-item similarity is O(m²) where m is items, vs O(n²) for users. More importantly, item similarities can be precomputed offline.

  3. Sparsity is less severe: Any two popular items have many users in common (who rated both). But two random users might have rated zero items in common.

  4. Explainability: "We recommend X because you bought Y, and customers who bought Y also bought X" is intuitive. User-based explanations ("we recommend X because user #847592 liked it") are creepy.

How it works:

  1. Build item-item similarity matrix: For each pair of items, compute similarity based on which users rated them. Two items are similar if the same users tend to rate them similarly.

  2. For a user's prediction: Look at items the user has already rated. For each candidate item, find its similarity to the user's rated items. The prediction is a weighted sum.

The item similarity formula:

sim(i,j)=uUijruirujuUirui2uUjruj2\text{sim}(i, j) = \frac{\sum_{u \in U_{ij}} r_{ui} \cdot r_{uj}}{\sqrt{\sum_{u \in U_i} r_{ui}^2} \cdot \sqrt{\sum_{u \in U_j} r_{uj}^2}}

Where UijU_{ij} is the set of users who rated both items.

The prediction formula:

r^ui=jNk(i)Iusim(i,j)rujjNk(i)Iusim(i,j)\hat{r}_{ui} = \frac{\sum_{j \in N_k(i) \cap I_u} \text{sim}(i, j) \cdot r_{uj}}{\sum_{j \in N_k(i) \cap I_u} |\text{sim}(i, j)|}

For item ii, look at the k most similar items that the user has rated, and take a weighted average of those ratings.

Production considerations:

  • Precompute offline: Item similarities rarely change. Compute once daily/weekly.
  • Truncate to top-k: Don't store all m² similarities. Keep only top-100 neighbors per item.
  • Incremental updates: When a new item gets enough ratings, compute its similarities without recomputing everything.

This is still Amazon's core algorithm today, enhanced with deep learning features.

Python
class ItemBasedCF:
    """
    Item-based collaborative filtering (Amazon's approach).
    Find similar items → recommend based on user's history.

    Key insight: Item similarities are more stable than user similarities.
    """

    def __init__(self, k_neighbors: int = 20):
        self.k = k_neighbors
        self.user_item_matrix = None
        self.item_similarity = None

    def fit(self, ratings: np.ndarray):
        """
        Args:
            ratings: (num_users, num_items) matrix
        """
        self.user_item_matrix = ratings

        # Compute item-item similarity
        # Transpose: each row is an item's rating profile across users
        self.item_similarity = cosine_similarity(ratings.T)

        # Zero out self-similarity
        np.fill_diagonal(self.item_similarity, 0)

    def predict(self, user_id: int, item_id: int) -> float:
        """Predict rating for user-item pair."""

        # Find items user has rated
        user_ratings = self.user_item_matrix[user_id]
        rated_items = np.where(user_ratings > 0)[0]

        if len(rated_items) == 0:
            return 0

        # Similarity of target item to user's rated items
        similarities = self.item_similarity[item_id, rated_items]

        # Take top-k similar items
        top_k_idx = np.argsort(similarities)[-self.k:]
        top_k_items = rated_items[top_k_idx]
        top_k_sims = similarities[top_k_idx]

        # Weighted average
        if top_k_sims.sum() == 0:
            return user_ratings[top_k_items].mean()

        weighted_sum = (top_k_sims * user_ratings[top_k_items]).sum()
        return weighted_sum / top_k_sims.sum()

    def recommend(self, user_id: int, n: int = 10) -> list[int]:
        """Get top-n recommendations."""

        user_ratings = self.user_item_matrix[user_id]
        unseen_items = np.where(user_ratings == 0)[0]

        scores = [
            (item_id, self.predict(user_id, item_id))
            for item_id in unseen_items
        ]

        scores.sort(key=lambda x: x[1], reverse=True)
        return [item_id for item_id, _ in scores[:n]]

Why Item-Based is Better:

  • Stability: Items don't change; their similarity is stable
  • Scalability: Pre-compute item similarities offline
  • Explainability: "Because you liked X, and Y is similar to X"

Part II: Matrix Factorization (The Netflix Prize Era)

The $1 Million Challenge

In 2006, Netflix offered $1 million to anyone who could beat their recommendation algorithm by 10%. This sparked a revolution in collaborative filtering.

The key insight: Latent factors. Instead of computing explicit item-item or user-user similarity, learn low-dimensional representations that capture underlying preferences.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MATRIX FACTORIZATION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM:                                                                │
│  ─────────                                                               │
│                                                                          │
│  Rating Matrix R (num_users × num_items):                               │
│                                                                          │
│          Item1  Item2  Item3  Item4  ...                                │
│  User1  [  5      ?      3      ?    ...]                               │
│  User2  [  ?      4      ?      2    ...]                               │
│  User3  [  4      ?      5      ?    ...]                               │
│  ...                                                                     │
│                                                                          │
│  Most entries are missing (? = unknown)                                 │
│  Goal: Predict the missing entries                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: FACTORIZE R ≈ P × Q^T                                        │
│  ─────────────────────────────────                                       │
│                                                                          │
│  P: User matrix (num_users × k)                                         │
│     Each row = user's latent preferences                                │
│                                                                          │
│  Q: Item matrix (num_items × k)                                         │
│     Each row = item's latent attributes                                 │
│                                                                          │
│  Prediction: r̂_ui = p_u · q_i (dot product)                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INTUITION:                                                              │
│  ───────────                                                             │
│                                                                          │
│  k latent factors might represent:                                      │
│  • Factor 1: How much the movie is action vs drama                     │
│  • Factor 2: How mainstream vs indie                                    │
│  • Factor 3: How visually stunning vs dialogue-driven                  │
│  • ...                                                                   │
│                                                                          │
│  User embedding captures: "I like action, mainstream, visual"          │
│  Item embedding captures: "This movie is action, indie, visual"        │
│  Dot product: How well do they match?                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SVD and ALS

Matrix factorization is the technique that won the Netflix Prize. The idea is elegant: decompose the sparse user-item rating matrix into two smaller, dense matrices that capture latent factors.

Wait, what is SVD exactly?

First, let's clarify the terminology. Singular Value Decomposition (SVD) is a fundamental theorem from linear algebra:

Any matrix RR can be decomposed as: R=UΣVTR = U \Sigma V^T

Where:

  • UU = orthonormal matrix (users × k) — left singular vectors
  • Σ\Sigma = diagonal matrix (k × k) — singular values, in decreasing order
  • VTV^T = orthonormal matrix (k × items) — right singular vectors

Classical SVD gives the optimal low-rank approximation of any matrix (Eckart-Young theorem). If you keep only the top-k singular values, you get the closest rank-k matrix to the original.

The problem: Classical SVD doesn't work for recommendations

Netflix's rating matrix is 99.9% empty (users rate <1% of movies). Classical SVD:

  1. Requires a complete matrix (no missing values)
  2. Would treat missing ratings as zeros (wrong!)
  3. Is computationally expensive (O(min(m,n)²·max(m,n)) for m×n matrix)
Code
Netflix Matrix Reality:
─────────────────────────────────────────────────────────────────────────
Users: 480,000
Movies: 18,000
Possible ratings: 8.6 billion
Actual ratings: 100 million (~1.2% filled)

If we treat missing = 0, SVD thinks users HATE 99% of movies.
That's obviously wrong—they just haven't seen them.

The recommendation "SVD" is actually something different

What the Netflix Prize community calls "SVD" is actually regularized matrix factorization optimized only on observed entries. It's inspired by SVD but doesn't compute the actual decomposition.

The key insight: Only minimize error on ratings that exist:

L=(u,i)observed(ruipuqi)2\mathcal{L} = \sum_{(u,i) \in \text{observed}} (r_{ui} - p_u \cdot q_i)^2

This is fundamentally different from classical SVD which considers all entries.

The core idea:

Instead of storing explicit similarities, learn a low-dimensional representation for each user and item. A user is represented by a vector of k numbers (their "taste profile"), and an item by another k-vector (its "attribute profile"). The predicted rating is just the dot product of these vectors.

Why latent factors work:

Think of the k dimensions as hidden attributes the algorithm discovers. For movies:

  • Dimension 1 might capture "how action-packed vs dramatic"
  • Dimension 2 might capture "how mainstream vs indie"
  • Dimension 3 might capture "how serious vs comedic"
  • etc.

A user who loves action but hates comedy would have [high, ?, low, ...] in their vector. An action comedy would have [high, ?, high, ...]. Their dot product captures partial compatibility.

The magic: we never define what the dimensions mean. The algorithm discovers useful latent factors automatically from patterns in the data. When you examine learned factors, they often align with interpretable concepts—but sometimes capture subtle patterns humans wouldn't notice.

The math:

Given rating matrix RR (users × items), find matrices PP (users × k) and QQ (items × k) such that:

RPQTR \approx P \cdot Q^T

The prediction for user uu and item ii is:

r^ui=puqi=f=1kpufqif\hat{r}_{ui} = p_u \cdot q_i = \sum_{f=1}^{k} p_{uf} \cdot q_{if}

Adding biases (crucial for accuracy):

Raw dot products miss important patterns: some users rate everything high (generous raters), some items are universally popular (hit movies). The improved model adds biases:

r^ui=μ+bu+bi+puqi\hat{r}_{ui} = \mu + b_u + b_i + p_u \cdot q_i

Where:

  • μ\mu = global average rating (e.g., 3.5 on Netflix)
  • bub_u = user bias (how much user uu deviates from average)
  • bib_i = item bias (how much item ii deviates from average)
  • puqip_u \cdot q_i = user-item interaction (personalization beyond biases)

Example decomposition:

Code
Predicting Alice's rating for "Inception":
─────────────────────────────────────────────────────────────────────────
Global average μ = 3.5

Alice's bias b_alice = +0.3 (she rates slightly above average)
Inception's bias b_inception = +0.8 (it's a popular, well-rated movie)

Alice's latent vector p_alice = [0.8, -0.2, 0.5]  (likes action, dislikes romance, likes complex plots)
Inception's latent vector q_inception = [0.9, -0.1, 0.7]  (action, minimal romance, complex plot)

Dot product: 0.8×0.9 + (-0.2)×(-0.1) + 0.5×0.7 = 0.72 + 0.02 + 0.35 = 1.09

Prediction: 3.5 + 0.3 + 0.8 + 1.09 = 5.69 → clipped to 5.0 (max rating)

Two training approaches:

1. SGD (Stochastic Gradient Descent):

Iterate over observed ratings one at a time, compute gradient, update parameters:

Code
For each observed rating r_ui:
    error = r_ui - (μ + b_u + b_i + p_u · q_i)

    b_u ← b_u + η(error - λ·b_u)
    b_i ← b_i + η(error - λ·b_i)
    p_u ← p_u + η(error·q_i - λ·p_u)
    q_i ← q_i + η(error·p_u - λ·q_i)

Pros: Simple, works well, low memory Cons: Sequential (hard to parallelize), sensitive to learning rate

2. ALS (Alternating Least Squares):

The key insight: If we fix QQ, solving for PP is a least squares problem with a closed-form solution. Vice versa.

Code
Repeat until convergence:
    1. Fix Q, solve for each p_u:  p_u = (Q^T Q + λI)^{-1} Q^T r_u
    2. Fix P, solve for each q_i:  q_i = (P^T P + λI)^{-1} P^T r_i

Pros: Embarrassingly parallel (each user/item is independent), guaranteed convergence, works great for implicit feedback Cons: More complex, requires matrix inversion per user/item

ALS became the standard for large-scale systems (Spark MLlib, etc.) because you can distribute user updates across thousands of machines.

The loss function:

L=(u,i)observed(ruir^ui)2+λ(P2+Q2+bu2+bi2)\mathcal{L} = \sum_{(u,i) \in \text{observed}} (r_{ui} - \hat{r}_{ui})^2 + \lambda(||P||^2 + ||Q||^2 + ||b_u||^2 + ||b_i||^2)

The regularization term λ\lambda prevents overfitting to the sparse observed ratings—critical because we're learning millions of parameters from limited data.


How SVD++ Won the Netflix Prize

The Netflix Prize wasn't won by basic SVD—it required several crucial innovations developed primarily by Yehuda Koren and his collaborators at AT&T Labs. Understanding these reveals what matters in production recommendation systems.

The challenge:

On October 2, 2006, Netflix released 100,480,507 ratings from 480,189 users on 17,770 movies. The goal: beat their existing system (Cinematch, RMSE = 0.9525 on test set) by 10% to achieve RMSE ≤ 0.8572. The prize: $1 million. Within a week, a team had already beaten Cinematch.

Code
Netflix Prize Timeline (verified from competition records):
─────────────────────────────────────────────────────────────────────────
Oct 2006 (start):               Cinematch baseline = 0.9525 RMSE
Nov 2006 (1 month in):          Best = ~0.9042 (~5% improvement)
Nov 2007 (Progress Prize #1):   KorBell = 0.8712 (8.43% improvement)
                                → $50,000 prize, 2000+ hours of work
2008 (Progress Prize #2):       BellKor = 0.8616 (~9.5% improvement)
Jun 26, 2009:                   BellKor's Pragmatic Chaos = 0.8558 (10.05%)
Jul 25, 2009:                   "The Ensemble" matches at 0.8554 (10.09%)
Sep 21, 2009 (WINNER):          BellKor's Pragmatic Chaos = 0.8567 (10.06%)

The final 1.5% improvement took 2+ years of work by thousands of teams!

The photo finish: On the final day, two teams crossed the 10% threshold within minutes of each other. "The Ensemble" actually achieved a slightly better RMSE (0.8554 vs 0.8558), but BellKor's Pragmatic Chaos had submitted 20 minutes earlier. Per competition rules, BellKor won.


Innovation 1: SVD++ (Implicit Feedback)

Published by Koren in "Factorization Meets the Neighborhood" (KDD 2008), SVD++ was the breakthrough that made BellKor a frontrunner.

Basic SVD only uses explicit ratings. But there's more signal: the mere fact that a user rated an item carries information, regardless of whether the rating was high or low. A user who rated 50 horror movies is signaling interest in horror.

The SVD++ prediction formula:

r^ui=μ+bu+bi+qiT(pu+Iu1/2jIuyj)\hat{r}_{ui} = \mu + b_u + b_i + q_i^T \left(p_u + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right)

Breaking down each component:

SymbolMeaningIntuition
μ\muGlobal mean ratingBaseline (e.g., 3.5 stars)
bub_uUser bias"Alice rates 0.3 stars above average"
bib_iItem bias"Inception is rated 0.8 stars above average"
pup_uUser latent factorsExplicit preferences from ratings given
qiq_iItem latent factorsItem characteristics
yjy_jImplicit item factorsWhat rating item jj reveals about user taste
IuI_uItems user uu ratedSet of all items (regardless of rating value)
$I_u^{-1/2}$

Why implicit feedback helps:

The term jIuyj\sum_{j \in I_u} y_j captures "what kind of user rates these specific items." Even without knowing the rating values, we learn something:

  • A user who rated many Marvel movies → probably likes superhero films
  • A user who rated many foreign films → probably enjoys non-Hollywood cinema
  • A user who rated both → has diverse taste

The yjy_j vectors learn these implicit signals automatically during training.

Performance: SVD++ alone achieved RMSE of 0.8911, while standard SVD achieved 0.8914. The combination with other techniques made the real difference.


Innovation 2: Temporal Dynamics (timeSVD++)

Published in "Collaborative Filtering with Temporal Dynamics" (KDD 2009), this paper showed that user preferences and item perceptions drift over time.

Koren discovered striking patterns in the Netflix data:

  • Rating scale shift: In early 2004, mean ratings jumped from ~3.4 to ~3.6 stars (Netflix changed their UI)
  • Movie aging effect: Older movies consistently receive higher ratings than newer ones
  • User preference drift: Individual users' tastes evolve over months/years

The timeSVD++ model makes parameters time-dependent:

Code
Time-dependent parameters:
─────────────────────────────────────────────────────────────────────────
User bias over time:   b_u(t) = b_u + α_u · dev_u(t) + b_{u,Bin(t)}

  b_u         = static user bias (doesn't change)
  α_u         = user-specific slope (some users' ratings drift faster)
  dev_u(t)    = sign(t - t̄_u) · |t - t̄_u|^β  (deviation from user's mean time)
  b_{u,Bin(t)} = bias for time period (weekly/monthly buckets)

Item bias over time:   b_i(t) = b_i + b_{i,Bin(t)}

  Movies don't have inherent drift, but perception changes over time

User factors over time: p_u(t) = p_u + α_u · dev_u(t)

  User preferences evolve (someone who liked comedies in 2005 might
  prefer dramas in 2008)

Dramatic impact: A timeSVD++ model with only 10 latent factors outperformed standard SVD with 200 factors. With 200 factors, timeSVD++ achieved RMSE = 0.8762—a massive improvement.


Innovation 3: Restricted Boltzmann Machines (RBMs)

While matrix factorization dominated, Ruslan Salakhutdinov's RBM approach provided a fundamentally different model that captured complementary patterns.

RBMs are probabilistic neural networks that learn a joint distribution over user ratings. Unlike SVD which learns point estimates, RBMs model uncertainty in preferences.

Performance comparison:

  • SVD alone: RMSE = 0.8914
  • RBM alone: RMSE = 0.8990
  • SVD + RBM blend: RMSE = 0.88

The key insight: models with different inductive biases make different errors. RBMs captured patterns that SVD missed (and vice versa), making their combination more powerful than either alone.


Innovation 4: Neighborhood Models

Despite deep learning hype, simple item-item similarity models captured patterns that latent factor models missed. The winning solutions combined both:

r^ui=μ+bu+bi+qiTpuLatent factors (global patterns)+jRk(i;u)wij(rujbuj)Neighborhood (local patterns)\hat{r}_{ui} = \underbrace{\mu + b_u + b_i + q_i^T p_u}_{\text{Latent factors (global patterns)}} + \underbrace{\sum_{j \in R^k(i;u)} w_{ij}(r_{uj} - b_{uj})}_{\text{Neighborhood (local patterns)}}

Where Rk(i;u)R^k(i;u) is the set of kk items most similar to ii that user uu has rated, and wijw_{ij} are learned item-item weights.

This hybrid captured:

  • Global patterns (latent factors): "Users who like action movies tend to like this"
  • Local patterns (neighborhood): "Users who liked Inception also liked Interstellar"

Innovation 5: Massive Ensembling

The final winning submission from BellKor's Pragmatic Chaos (a merger of BellKor, Pragmatic Theory, and BigChaos teams) combined hundreds of individual predictors:

Code
BellKor's Pragmatic Chaos ensemble composition:
─────────────────────────────────────────────────────────────────────────
MATRIX FACTORIZATION VARIANTS:
- Standard SVD (multiple k values: 20, 50, 100, 200, 500, 2000)
- SVD++ with implicit feedback
- timeSVD++ with temporal dynamics
- Asymmetric factor models

RESTRICTED BOLTZMANN MACHINES:
- Standard RBM
- Conditional RBM (with temporal features)
- Factored RBM

NEIGHBORHOOD MODELS:
- Item-item collaborative filtering
- User-user collaborative filtering
- Pearson correlation variants
- Cosine similarity variants

k-NEAREST NEIGHBORS:
- Various k values and distance metrics
- With and without significance weighting

REGRESSION MODELS:
- Linear regression on movie/user features
- Gradient Boosted Decision Trees

BLENDING:
- Linear regression on predictor outputs
- Neural network stacking (single hidden layer)
- Gradient Boosted Trees for nonlinear combination

Total: 100+ individual models, yielding 800+ predictors after parameter variations

The bitter lesson:

The winning ensemble was never fully deployed by Netflix. Here's what actually happened:

  1. 2007 Progress Prize solution was deployed: The simpler SVD+RBM combination from the first Progress Prize was productionized by Netflix engineers, as it was already significantly better than Cinematch.

  2. Grand Prize ensemble was too complex: Netflix evaluated the final ensemble offline but concluded "the additional accuracy gains did not seem to justify the engineering effort needed to bring them into a production environment."

  3. Business context changed: By 2009, Netflix was shifting from DVD rentals to streaming. The rating prediction problem became less central than other recommendation challenges (what to show on the homepage, etc.).

The real value of the Prize:

  1. Academic papers: Koren's work on SVD++, timeSVD++, and matrix factorization became foundational—his "Matrix Factorization Techniques for Recommender Systems" (IEEE Computer, 2009) has 45,000+ citations
  2. Talent pipeline: Many participants joined Netflix and other tech companies
  3. Proof of concept: Demonstrated that recommendations could be dramatically improved with ML
  4. Open research: Spurred massive investment in recommendation system research

What practitioners should take from the Netflix Prize:

  1. Biases explain most variance: The μ+bu+bi\mu + b_u + b_i terms often matter more than the latent factors puqip_u \cdot q_i. Always include them.

  2. Implicit feedback is gold: What users interact with (even without explicit ratings) is hugely informative. Modern systems rely almost entirely on implicit signals.

  3. Temporal dynamics matter: User preferences drift. Systems that don't account for time become stale.

  4. Ensemble diversity beats ensemble size: RBM + SVD worked because they made different errors. 100 SVD variants wouldn't help as much.

  5. Production ≠ Competition: The winning 800-predictor ensemble scored 10.06%, but a simple 3-model blend might achieve 9.5% and actually be deployable.

  6. The final 1% is exponentially harder: Going from 0% to 8% took one team 1 year. Going from 8% to 10% took the entire community 2 more years.

Sources:

Python
import torch
import torch.nn as nn

class MatrixFactorization(nn.Module):
    """
    Basic matrix factorization for collaborative filtering.
    R ≈ P × Q^T + biases
    """

    def __init__(
        self,
        num_users: int,
        num_items: int,
        num_factors: int = 50,
    ):
        super().__init__()

        # User and item embeddings (latent factors)
        self.user_factors = nn.Embedding(num_users, num_factors)
        self.item_factors = nn.Embedding(num_items, num_factors)

        # Bias terms (important for accuracy)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)
        self.global_bias = nn.Parameter(torch.zeros(1))

        # Initialize
        nn.init.normal_(self.user_factors.weight, std=0.01)
        nn.init.normal_(self.item_factors.weight, std=0.01)
        nn.init.zeros_(self.user_bias.weight)
        nn.init.zeros_(self.item_bias.weight)

    def forward(self, user_ids: torch.Tensor, item_ids: torch.Tensor) -> torch.Tensor:
        """
        Predict ratings for user-item pairs.

        Args:
            user_ids: (batch_size,)
            item_ids: (batch_size,)

        Returns:
            predictions: (batch_size,)
        """
        # Get embeddings
        user_emb = self.user_factors(user_ids)  # (B, k)
        item_emb = self.item_factors(item_ids)  # (B, k)

        # Dot product
        dot = (user_emb * item_emb).sum(dim=1)  # (B,)

        # Add biases
        user_b = self.user_bias(user_ids).squeeze()
        item_b = self.item_bias(item_ids).squeeze()

        return self.global_bias + user_b + item_b + dot

def train_mf_sgd(model, train_data, epochs=20, lr=0.01, reg=0.02):
    """
    Train matrix factorization with SGD.

    Args:
        train_data: List of (user_id, item_id, rating) tuples
    """
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=reg)

    for epoch in range(epochs):
        total_loss = 0

        for user_id, item_id, rating in train_data:
            user_id = torch.tensor([user_id])
            item_id = torch.tensor([item_id])
            rating = torch.tensor([rating], dtype=torch.float)

            # Forward
            pred = model(user_id, item_id)

            # MSE loss
            loss = (pred - rating) ** 2

            # Backward
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch}: Loss = {total_loss / len(train_data):.4f}")

# Alternating Least Squares (ALS) - Better for implicit feedback
class ALS:
    """
    Alternating Least Squares for matrix factorization.
    Better for large-scale implicit feedback (views, clicks).

    Key idea: Fix users, solve for items (closed-form).
              Then fix items, solve for users. Repeat.
    """

    def __init__(self, num_factors: int = 50, reg: float = 0.01, iterations: int = 15):
        self.k = num_factors
        self.reg = reg
        self.iterations = iterations

    def fit(self, interactions: np.ndarray):
        """
        Args:
            interactions: (num_users, num_items) binary or count matrix
        """
        num_users, num_items = interactions.shape

        # Initialize factors randomly
        self.user_factors = np.random.normal(0, 0.01, (num_users, self.k))
        self.item_factors = np.random.normal(0, 0.01, (num_items, self.k))

        # Confidence matrix (for implicit feedback)
        # Higher counts = higher confidence
        C = 1 + 40 * interactions  # Confidence scaling

        for iteration in range(self.iterations):
            # Fix items, solve for users
            self._solve_users(interactions, C)

            # Fix users, solve for items
            self._solve_items(interactions, C)

            print(f"Iteration {iteration}: Loss = {self._compute_loss(interactions, C):.4f}")

    def _solve_users(self, R, C):
        """Solve for user factors with items fixed."""
        YtY = self.item_factors.T @ self.item_factors
        reg_I = self.reg * np.eye(self.k)

        for u in range(R.shape[0]):
            # Weighted least squares
            Cu = np.diag(C[u])
            A = YtY + self.item_factors.T @ Cu @ self.item_factors + reg_I
            b = self.item_factors.T @ Cu @ R[u]
            self.user_factors[u] = np.linalg.solve(A, b)

    def _solve_items(self, R, C):
        """Solve for item factors with users fixed."""
        XtX = self.user_factors.T @ self.user_factors
        reg_I = self.reg * np.eye(self.k)

        for i in range(R.shape[1]):
            Ci = np.diag(C[:, i])
            A = XtX + self.user_factors.T @ Ci @ self.user_factors + reg_I
            b = self.user_factors.T @ Ci @ R[:, i]
            self.item_factors[i] = np.linalg.solve(A, b)

    def _compute_loss(self, R, C):
        """Compute weighted squared error."""
        pred = self.user_factors @ self.item_factors.T
        diff = R - pred
        return np.sum(C * diff ** 2)

SVD++: The Netflix Prize Winning Algorithm

SVD++ extends basic matrix factorization by incorporating implicit feedback—the fact that a user rated an item carries information beyond the rating value itself. This was one of the key innovations that helped win the Netflix Prize.

The core insight:

A user who rated 50 horror movies signals interest in horror, regardless of whether those ratings were 2 stars or 5 stars. SVD++ captures this by adding a term that sums up the "implicit factors" of all items a user has rated.

The prediction formula:

r^ui=μ+bu+bi+qiT(pu+Iu1/2jIuyj)\hat{r}_{ui} = \mu + b_u + b_i + q_i^T \left(p_u + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right)

The implementation below follows Koren's original formulation from "Factorization Meets the Neighborhood" (KDD 2008):

Python
import torch
import torch.nn as nn
import numpy as np
from collections import defaultdict

class SVDPlusPlus(nn.Module):
    """
    SVD++ (Koren, KDD 2008) - The core algorithm behind the Netflix Prize win.

    Key innovation: Combines explicit ratings with implicit feedback.
    The mere fact that a user rated an item (regardless of rating value)
    provides signal about user preferences.

    Prediction formula:
    r̂_ui = μ + b_u + b_i + q_i^T (p_u + |I_u|^(-1/2) Σ_{j∈I_u} y_j)

    Where:
    - μ: global mean rating
    - b_u: user bias
    - b_i: item bias
    - p_u: user latent factors (from explicit ratings)
    - q_i: item latent factors
    - y_j: implicit item factors (what rating item j reveals about user taste)
    - I_u: set of items user u has rated (implicit signal)
    """

    def __init__(
        self,
        num_users: int,
        num_items: int,
        num_factors: int = 50,
        global_mean: float = 0.0,
    ):
        super().__init__()

        self.num_users = num_users
        self.num_items = num_items
        self.num_factors = num_factors
        self.global_mean = nn.Parameter(torch.tensor(global_mean), requires_grad=False)

        # Bias terms (μ + b_u + b_i explains most variance!)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)

        # Explicit factor matrices (p_u and q_i)
        self.user_factors = nn.Embedding(num_users, num_factors)  # p_u
        self.item_factors = nn.Embedding(num_items, num_factors)  # q_i

        # IMPLICIT factor matrix (y_j) - THE KEY SVD++ INNOVATION
        # Each item has an implicit factor that captures "what kind of user rates this"
        self.implicit_item_factors = nn.Embedding(num_items, num_factors)  # y_j

        # Initialize with small random values (important for convergence)
        self._init_weights()

    def _init_weights(self):
        nn.init.zeros_(self.user_bias.weight)
        nn.init.zeros_(self.item_bias.weight)
        nn.init.normal_(self.user_factors.weight, std=0.01)
        nn.init.normal_(self.item_factors.weight, std=0.01)
        nn.init.normal_(self.implicit_item_factors.weight, std=0.01)

    def forward(
        self,
        user_ids: torch.Tensor,
        item_ids: torch.Tensor,
        user_item_lists: list[list[int]],
    ) -> torch.Tensor:
        """
        Predict ratings for user-item pairs.

        Args:
            user_ids: (batch_size,) user indices
            item_ids: (batch_size,) item indices
            user_item_lists: For each user in batch, list of ALL items they've rated
                             This is the implicit feedback signal!

        Returns:
            predictions: (batch_size,) predicted ratings
        """
        batch_size = user_ids.shape[0]

        # Biases
        b_u = self.user_bias(user_ids).squeeze(-1)  # (B,)
        b_i = self.item_bias(item_ids).squeeze(-1)  # (B,)

        # Explicit user/item factors
        p_u = self.user_factors(user_ids)  # (B, k)
        q_i = self.item_factors(item_ids)  # (B, k)

        # IMPLICIT FEEDBACK TERM: Σ y_j for all items user has rated
        # This is what makes SVD++ different from basic SVD!
        implicit_sum = torch.zeros(batch_size, self.num_factors, device=user_ids.device)

        for idx, items_rated in enumerate(user_item_lists):
            if len(items_rated) > 0:
                # Get implicit factors for all items this user rated
                items_tensor = torch.tensor(items_rated, device=user_ids.device)
                y_j = self.implicit_item_factors(items_tensor)  # (num_rated, k)

                # Sum and normalize by sqrt(|I_u|) to prevent bias toward active users
                normalization = 1.0 / np.sqrt(len(items_rated))
                implicit_sum[idx] = y_j.sum(dim=0) * normalization

        # Combine explicit and implicit: p_u + |I_u|^(-1/2) Σ y_j
        user_representation = p_u + implicit_sum  # (B, k)

        # Final prediction: μ + b_u + b_i + q_i^T (p_u + implicit)
        interaction = (q_i * user_representation).sum(dim=-1)  # (B,)
        prediction = self.global_mean + b_u + b_i + interaction

        return prediction

def train_svdpp(
    model: SVDPlusPlus,
    train_data: list[tuple],  # List of (user_id, item_id, rating)
    user_items: dict[int, list[int]],  # user_id -> list of item_ids they rated
    epochs: int = 20,
    lr: float = 0.007,
    reg: float = 0.02,
):
    """
    Train SVD++ using SGD (following Koren's original approach).

    Args:
        model: SVD++ model
        train_data: List of (user_id, item_id, rating) tuples
        user_items: Dict mapping user_id to list of all items they've rated
        epochs: Number of training epochs
        lr: Learning rate (Koren used 0.007)
        reg: Regularization strength (Koren used 0.02)
    """
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=reg)

    for epoch in range(epochs):
        total_loss = 0.0
        np.random.shuffle(train_data)

        for user_id, item_id, rating in train_data:
            # Get implicit feedback for this user
            items_rated = user_items.get(user_id, [])

            # Convert to tensors
            user_tensor = torch.tensor([user_id])
            item_tensor = torch.tensor([item_id])
            rating_tensor = torch.tensor([rating], dtype=torch.float32)

            # Forward pass
            pred = model(user_tensor, item_tensor, [items_rated])

            # MSE loss
            loss = (pred - rating_tensor) ** 2

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        rmse = np.sqrt(total_loss / len(train_data))
        print(f"Epoch {epoch + 1}/{epochs}: RMSE = {rmse:.4f}")

# Example usage:
"""
# Prepare data
train_data = [(0, 5, 4.0), (0, 10, 3.5), (1, 5, 5.0), ...]  # (user, item, rating)

# Build user_items dict for implicit feedback
user_items = defaultdict(list)
for user_id, item_id, _ in train_data:
    user_items[user_id].append(item_id)

# Compute global mean
global_mean = np.mean([r for _, _, r in train_data])

# Initialize and train
model = SVDPlusPlus(
    num_users=10000,
    num_items=5000,
    num_factors=50,
    global_mean=global_mean,
)
train_svdpp(model, train_data, user_items, epochs=20)
"""

SVD++ Training: The Exact Gradient Derivation

Understanding exactly how SVD++ learns is crucial. Here we derive the gradients from first principles—this is what Koren did in his original implementation.

The Loss Function (Regularized Squared Error):

For a single observed rating ruir_{ui}, the loss is:

Lui=12(ruir^ui)2+λ12(bu2+bi2)+λ22(pu2+qi2+jIuyj2)\mathcal{L}_{ui} = \frac{1}{2}\left(r_{ui} - \hat{r}_{ui}\right)^2 + \frac{\lambda_1}{2}\left(b_u^2 + b_i^2\right) + \frac{\lambda_2}{2}\left(\|p_u\|^2 + \|q_i\|^2 + \sum_{j \in I_u}\|y_j\|^2\right)

Where:

  • First term: Prediction error (what we want to minimize)
  • Second term: Regularization on biases (prevents extreme user/item biases)
  • Third term: Regularization on latent factors (prevents overfitting)

Define the prediction error:

eui=ruir^ui=rui(μ+bu+bi+qiT(pu+Iu1/2jIuyj))e_{ui} = r_{ui} - \hat{r}_{ui} = r_{ui} - \left(\mu + b_u + b_i + q_i^T \left(p_u + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right)\right)

Gradient Derivation for Each Parameter:

Taking partial derivatives of Lui\mathcal{L}_{ui} with respect to each parameter:

1. User bias gradient: Lbu=eui+λ1bu\frac{\partial \mathcal{L}}{\partial b_u} = -e_{ui} + \lambda_1 b_u

2. Item bias gradient: Lbi=eui+λ1bi\frac{\partial \mathcal{L}}{\partial b_i} = -e_{ui} + \lambda_1 b_i

3. User factor gradient: Lpu=euiqi+λ2pu\frac{\partial \mathcal{L}}{\partial p_u} = -e_{ui} \cdot q_i + \lambda_2 p_u

4. Item factor gradient: Lqi=eui(pu+Iu1/2jIuyj)+λ2qi\frac{\partial \mathcal{L}}{\partial q_i} = -e_{ui} \cdot \left(p_u + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right) + \lambda_2 q_i

5. Implicit factor gradient (for each item jj that user uu rated): Lyj=euiIu1/2qi+λ2yj\frac{\partial \mathcal{L}}{\partial y_j} = -e_{ui} \cdot |I_u|^{-1/2} \cdot q_i + \lambda_2 y_j

The SGD Update Rules:

Using gradient descent with learning rate γ\gamma, we update parameters in the opposite direction of the gradient:

Code
For each observed rating (u, i, r_ui):

    # 1. Compute prediction
    implicit_sum = (1/√|I_u|) * Σ_{j ∈ I_u} y_j
    r̂_ui = μ + b_u + b_i + q_i · (p_u + implicit_sum)

    # 2. Compute error
    e_ui = r_ui - r̂_ui

    # 3. Update biases (move toward reducing error)
    b_u ← b_u + γ · (e_ui - λ₁ · b_u)
    b_i ← b_i + γ · (e_ui - λ₁ · b_i)

    # 4. Update user factors
    p_u ← p_u + γ · (e_ui · q_i - λ₂ · p_u)

    # 5. Update item factors (note: uses p_u + implicit, not just p_u)
    q_i ← q_i + γ · (e_ui · (p_u + implicit_sum) - λ₂ · q_i)

    # 6. Update ALL implicit factors for items this user rated
    for j in I_u:
        y_j ← y_j + γ · (e_ui · (1/√|I_u|) · q_i - λ₂ · y_j)

Why updating yjy_j is expensive:

For each training example, we must update yjy_j for every item the user has rated. If a user rated 100 movies, that's 100 updates per training example. This makes SVD++ significantly slower than basic SVD (about 3-5x slower in practice).

Koren's Exact Hyperparameters (from the Netflix Prize):

ParameterSymbolValueNotes
Learning rateγ\gamma0.007For factors pup_u, qiq_i, yjy_j
Bias learning rateγ1\gamma_10.005For biases bub_u, bib_i
Factor regularizationλ2\lambda_20.015Prevents factor overfitting
Bias regularizationλ1\lambda_10.005Lower than factors
Learning rate decay-0.9× per epochMultiply γ\gamma by 0.9 after each epoch
Number of epochs-20-30More epochs with decay
Number of factorskk50-200200 factors for final submission

The Learning Rate Decay Schedule:

Koren found that decaying the learning rate was crucial for convergence:

Python
def train_svdpp_with_decay(
    model,
    train_data,
    user_items,
    epochs: int = 30,
    initial_lr: float = 0.007,
    lr_decay: float = 0.9,
    reg_factors: float = 0.015,
    reg_biases: float = 0.005,
):
    """
    Train SVD++ with Koren's exact hyperparameters and learning rate decay.
    """
    current_lr = initial_lr

    for epoch in range(epochs):
        total_se = 0.0  # Squared error sum

        # Shuffle training data each epoch (important for SGD!)
        np.random.shuffle(train_data)

        for user_id, item_id, rating in train_data:
            items_rated = user_items[user_id]
            num_rated = len(items_rated)

            if num_rated == 0:
                continue

            # Compute implicit sum
            norm_factor = 1.0 / np.sqrt(num_rated)
            implicit_sum = norm_factor * sum(model.y[j] for j in items_rated)

            # Prediction
            pred = (model.global_mean +
                    model.b_u[user_id] +
                    model.b_i[item_id] +
                    np.dot(model.q[item_id], model.p[user_id] + implicit_sum))

            # Clamp prediction to valid range
            pred = max(1.0, min(5.0, pred))

            # Error
            err = rating - pred
            total_se += err ** 2

            # Update biases
            model.b_u[user_id] += current_lr * (err - reg_biases * model.b_u[user_id])
            model.b_i[item_id] += current_lr * (err - reg_biases * model.b_i[item_id])

            # Cache for efficiency
            user_implicit = model.p[user_id] + implicit_sum

            # Update user factors
            model.p[user_id] += current_lr * (err * model.q[item_id] - reg_factors * model.p[user_id])

            # Update item factors
            model.q[item_id] += current_lr * (err * user_implicit - reg_factors * model.q[item_id])

            # Update implicit factors (expensive!)
            for j in items_rated:
                model.y[j] += current_lr * (err * norm_factor * model.q[item_id] - reg_factors * model.y[j])

        # Compute RMSE
        rmse = np.sqrt(total_se / len(train_data))
        print(f"Epoch {epoch+1}: RMSE = {rmse:.4f}, LR = {current_lr:.6f}")

        # Decay learning rate
        current_lr *= lr_decay

Convergence Behavior:

Code
Typical training progression on Netflix data:
─────────────────────────────────────────────────────────────────────────
Epoch 1:  RMSE = 0.9234  LR = 0.007000  (big initial improvement)
Epoch 2:  RMSE = 0.9089  LR = 0.006300
Epoch 3:  RMSE = 0.9012  LR = 0.005670
Epoch 5:  RMSE = 0.8967  LR = 0.004592
Epoch 10: RMSE = 0.8923  LR = 0.002824  (slowing down)
Epoch 15: RMSE = 0.8911  LR = 0.001736
Epoch 20: RMSE = 0.8907  LR = 0.001067
Epoch 25: RMSE = 0.8905  LR = 0.000656  (nearly converged)
Epoch 30: RMSE = 0.8904  LR = 0.000403

Final RMSE ≈ 0.8904 (on training set)
Probe RMSE ≈ 0.8911 (on held-out validation)

Why the Gradient Updates Work:

Let's trace through one example intuitively:

Code
Scenario: User Alice (u=42) rates movie "Inception" (i=1701) with 5 stars.
Alice has previously rated: [Matrix, Interstellar, Blade Runner] (all sci-fi)
Our model predicts: 4.2 stars

Error: e = 5.0 - 4.2 = +0.8 (we under-predicted!)

Updates (with γ=0.007):

1. b_u (Alice's bias) goes UP by 0.007 × 0.8 = +0.0056
   → "Alice tends to rate higher than we thought"

2. b_i (Inception's bias) goes UP by +0.0056
   → "Inception is better than we thought"

3. p_u (Alice's taste vector) moves TOWARD q_inception
   → "Alice likes what Inception represents"

4. q_i (Inception's attribute vector) moves TOWARD (p_alice + implicit)
   → "Inception matches users like Alice"

5. For each j in [Matrix, Interstellar, Blade Runner]:
   y_j moves TOWARD q_inception (scaled by 1/√3)
   → "Users who rate Matrix/Interstellar/Blade Runner also like Inception"

The implicit feedback update (step 5) is the key insight: we're learning that the act of rating sci-fi movies predicts liking Inception, even before we look at the rating values.


timeSVD++: Adding Temporal Dynamics

timeSVD++ extends SVD++ by making parameters time-dependent. User preferences drift over time, and movie perceptions change as they age. This was described in Koren's "Collaborative Filtering with Temporal Dynamics" (KDD 2009).

Key discoveries from the Netflix data:

  1. Rating scale shift (early 2004): Mean ratings jumped from ~3.4 to ~3.6 stars when Netflix changed their UI
  2. Movie aging effect: Older movies consistently receive higher ratings
  3. User preference drift: Individual tastes evolve over months/years

The time-dependent prediction formula:

r^ui(t)=μ+bu(t)+bi(t)+qiT(pu(t)+Iu1/2jIuyj)\hat{r}_{ui}(t) = \mu + b_u(t) + b_i(t) + q_i^T \left(p_u(t) + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right)

Where the time-dependent terms are:

bu(t)=bu+αudevu(t)+bu,Bin(t)b_u(t) = b_u + \alpha_u \cdot \text{dev}_u(t) + b_{u,\text{Bin}(t)} bi(t)=bi+bi,Bin(t)b_i(t) = b_i + b_{i,\text{Bin}(t)} pu(t)=pu+αudevu(t)p_u(t) = p_u + \alpha_u \cdot \text{dev}_u(t)

Python
class TimeSVDPlusPlus(nn.Module):
    """
    timeSVD++ (Koren, KDD 2009) - The full Netflix Prize winning model.

    Extends SVD++ with temporal dynamics:
    - User biases drift over time
    - Item biases change as movies age
    - User preferences evolve

    A timeSVD++ model with 10 factors outperforms standard SVD with 200 factors!

    Time-dependent prediction:
    r̂_ui(t) = μ + b_u(t) + b_i(t) + q_i^T (p_u(t) + |I_u|^(-1/2) Σ y_j)

    Where:
    b_u(t) = b_u + α_u · dev_u(t) + b_{u,Bin(t)}
    b_i(t) = b_i + b_{i,Bin(t)}
    p_u(t) = p_u + α_u · dev_u(t)
    """

    def __init__(
        self,
        num_users: int,
        num_items: int,
        num_factors: int = 50,
        num_time_bins: int = 30,  # e.g., 30 bins for monthly granularity
        global_mean: float = 0.0,
        beta: float = 0.4,  # Power for dev function (Koren used 0.4)
    ):
        super().__init__()

        self.num_factors = num_factors
        self.num_time_bins = num_time_bins
        self.beta = beta
        self.global_mean = nn.Parameter(torch.tensor(global_mean), requires_grad=False)

        # Static biases and factors (same as SVD++)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)
        self.user_factors = nn.Embedding(num_users, num_factors)
        self.item_factors = nn.Embedding(num_items, num_factors)
        self.implicit_item_factors = nn.Embedding(num_items, num_factors)

        # TEMPORAL COMPONENTS - THE timeSVD++ INNOVATION

        # α_u: User-specific drift rate (some users' tastes change faster)
        self.user_alpha = nn.Embedding(num_users, 1)

        # b_{u,Bin(t)}: User bias per time bin
        self.user_bias_time = nn.Embedding(num_users * num_time_bins, 1)

        # b_{i,Bin(t)}: Item bias per time bin (movies perceived differently over time)
        self.item_bias_time = nn.Embedding(num_items * num_time_bins, 1)

        # User factor drift (p_u changes over time)
        self.user_factors_alpha = nn.Embedding(num_users, num_factors)

        # Store user mean times (computed from training data)
        self.register_buffer('user_mean_time', torch.zeros(num_users))

        self._init_weights()

    def _init_weights(self):
        nn.init.zeros_(self.user_bias.weight)
        nn.init.zeros_(self.item_bias.weight)
        nn.init.zeros_(self.user_alpha.weight)
        nn.init.zeros_(self.user_bias_time.weight)
        nn.init.zeros_(self.item_bias_time.weight)
        nn.init.normal_(self.user_factors.weight, std=0.01)
        nn.init.normal_(self.item_factors.weight, std=0.01)
        nn.init.normal_(self.implicit_item_factors.weight, std=0.01)
        nn.init.normal_(self.user_factors_alpha.weight, std=0.01)

    def compute_dev(self, user_ids: torch.Tensor, times: torch.Tensor) -> torch.Tensor:
        """
        Compute dev_u(t) = sign(t - t̄_u) · |t - t̄_u|^β

        This captures how far a rating is from the user's "typical" time.
        Users who rate early vs late in their Netflix membership differ.
        """
        mean_times = self.user_mean_time[user_ids]
        diff = times - mean_times
        sign = torch.sign(diff)
        dev = sign * (torch.abs(diff) ** self.beta)
        return dev

    def forward(
        self,
        user_ids: torch.Tensor,
        item_ids: torch.Tensor,
        times: torch.Tensor,  # Normalized time values (0 to num_time_bins-1)
        time_bins: torch.Tensor,  # Discretized time bin indices
        user_item_lists: list[list[int]],
    ) -> torch.Tensor:
        """
        Predict ratings with temporal dynamics.

        Args:
            user_ids: (batch_size,) user indices
            item_ids: (batch_size,) item indices
            times: (batch_size,) continuous time values
            time_bins: (batch_size,) discretized time bin indices
            user_item_lists: For each user, list of items they've rated

        Returns:
            predictions: (batch_size,) predicted ratings
        """
        batch_size = user_ids.shape[0]

        # === STATIC COMPONENTS (same as SVD++) ===
        b_u_static = self.user_bias(user_ids).squeeze(-1)
        b_i_static = self.item_bias(item_ids).squeeze(-1)
        p_u_static = self.user_factors(user_ids)
        q_i = self.item_factors(item_ids)

        # === TEMPORAL BIAS COMPONENTS ===

        # dev_u(t) for this batch
        dev = self.compute_dev(user_ids, times)  # (B,)

        # α_u · dev_u(t) - user-specific drift
        alpha_u = self.user_alpha(user_ids).squeeze(-1)  # (B,)
        user_drift = alpha_u * dev  # (B,)

        # b_{u,Bin(t)} - user bias for this time bin
        user_time_indices = user_ids * self.num_time_bins + time_bins
        b_u_time = self.user_bias_time(user_time_indices).squeeze(-1)  # (B,)

        # b_{i,Bin(t)} - item bias for this time bin
        item_time_indices = item_ids * self.num_time_bins + time_bins
        b_i_time = self.item_bias_time(item_time_indices).squeeze(-1)  # (B,)

        # Full time-dependent biases:
        # b_u(t) = b_u + α_u · dev_u(t) + b_{u,Bin(t)}
        # b_i(t) = b_i + b_{i,Bin(t)}
        b_u_t = b_u_static + user_drift + b_u_time
        b_i_t = b_i_static + b_i_time

        # === TEMPORAL USER FACTORS ===
        # p_u(t) = p_u + α_u · dev_u(t)
        p_u_alpha = self.user_factors_alpha(user_ids)  # (B, k)
        p_u_t = p_u_static + p_u_alpha * dev.unsqueeze(-1)  # (B, k)

        # === IMPLICIT FEEDBACK (same as SVD++) ===
        implicit_sum = torch.zeros(batch_size, self.num_factors, device=user_ids.device)
        for idx, items_rated in enumerate(user_item_lists):
            if len(items_rated) > 0:
                items_tensor = torch.tensor(items_rated, device=user_ids.device)
                y_j = self.implicit_item_factors(items_tensor)
                normalization = 1.0 / np.sqrt(len(items_rated))
                implicit_sum[idx] = y_j.sum(dim=0) * normalization

        # === FINAL PREDICTION ===
        # r̂_ui(t) = μ + b_u(t) + b_i(t) + q_i^T (p_u(t) + implicit)
        user_representation = p_u_t + implicit_sum
        interaction = (q_i * user_representation).sum(dim=-1)
        prediction = self.global_mean + b_u_t + b_i_t + interaction

        return prediction

def precompute_user_mean_times(
    train_data: list[tuple],  # (user_id, item_id, rating, time)
    num_users: int,
) -> torch.Tensor:
    """
    Compute t̄_u (mean rating time) for each user.
    Required for the dev_u(t) function in timeSVD++.
    """
    user_times = defaultdict(list)
    for user_id, _, _, time in train_data:
        user_times[user_id].append(time)

    mean_times = torch.zeros(num_users)
    for user_id, times in user_times.items():
        mean_times[user_id] = np.mean(times)

    return mean_times

# Example usage:
"""
# Training data now includes timestamps
train_data = [
    (user_id, item_id, rating, timestamp),
    ...
]

# Convert timestamps to time bins (e.g., weeks or months)
def timestamp_to_bin(ts, min_ts, max_ts, num_bins=30):
    normalized = (ts - min_ts) / (max_ts - min_ts)
    return int(normalized * (num_bins - 1))

# Precompute user mean times
user_mean_times = precompute_user_mean_times(train_data, num_users)

# Initialize model
model = TimeSVDPlusPlus(
    num_users=480189,    # Netflix had 480K users
    num_items=17770,     # Netflix had 17.7K movies
    num_factors=50,
    num_time_bins=30,
    global_mean=3.6,
)
model.user_mean_time = user_mean_times

# Training loop similar to SVD++ but with time information
"""

Performance comparison on Netflix:

ModelFactorsRMSENotes
SVD (baseline)2000.8914No implicit, no time
SVD++2000.8911With implicit feedback
timeSVD++10< 0.891410 factors beats 200!
timeSVD++2000.8762Best single model

The key insight: temporal dynamics are more important than model capacity. A small model that understands time beats a large model that doesn't.

BPR: Bayesian Personalized Ranking

Most real-world recommendation data is implicit: clicks, views, purchases, time spent—not explicit 1-5 star ratings. This fundamentally changes the problem. With explicit ratings, you predict "user U would rate item I 4.2 stars." With implicit feedback, you only know what users did interact with, not what they didn't like.

The key insight of BPR:

Don't predict absolute scores. Instead, learn to rank items correctly. If user U clicked on item I but not item J, then I should rank higher than J for user U. This pairwise comparison is the core of BPR.

The training approach:

  1. Sample a user uu
  2. Sample a positive item ii (user interacted with it)
  3. Sample a negative item jj (user didn't interact with it)
  4. Train the model so that r^ui>r^uj\hat{r}_{ui} > \hat{r}_{uj}

The BPR loss function:

LBPR=(u,i,j)lnσ(r^uir^uj)+λΘ2\mathcal{L}_{BPR} = -\sum_{(u, i, j)} \ln \sigma(\hat{r}_{ui} - \hat{r}_{uj}) + \lambda ||\Theta||^2

Where σ\sigma is the sigmoid function. This loss pushes the model to score positive items higher than negative items, with a margin determined by the sigmoid.

Why BPR works better for implicit feedback:

  • Point-wise losses fail: MSE on 0/1 labels treats all non-interactions as "dislike." But a user might love an item they just haven't seen yet.
  • Pairwise comparison is natural: We're confident the user prefers clicked items over random ones. We're not confident they hate random ones.
  • Ranking is what matters: In production, we show the top-k items. Getting the ranking right matters more than exact scores.

Negative sampling strategies:

The choice of negative samples dramatically affects learning:

  • Uniform random: Simple but wastes training on easy negatives
  • Popularity-based: Sample popular items as negatives (harder negatives)
  • In-batch negatives: Use other users' positives as negatives (efficient)
  • Hard negative mining: Find items the model incorrectly ranks high
Python
class BPR(nn.Module):
    """
    Bayesian Personalized Ranking.
    Optimizes: user prefers positive item over negative item.

    Key insight: We don't need absolute scores, just relative ranking.
    """

    def __init__(self, num_users: int, num_items: int, num_factors: int = 50):
        super().__init__()

        self.user_factors = nn.Embedding(num_users, num_factors)
        self.item_factors = nn.Embedding(num_items, num_factors)

        nn.init.normal_(self.user_factors.weight, std=0.01)
        nn.init.normal_(self.item_factors.weight, std=0.01)

    def forward(self, user_ids, pos_item_ids, neg_item_ids):
        """
        Compute BPR loss for triplets.

        Args:
            user_ids: (batch_size,)
            pos_item_ids: (batch_size,) - items user interacted with
            neg_item_ids: (batch_size,) - sampled negative items
        """
        user_emb = self.user_factors(user_ids)
        pos_emb = self.item_factors(pos_item_ids)
        neg_emb = self.item_factors(neg_item_ids)

        # Scores
        pos_score = (user_emb * pos_emb).sum(dim=1)
        neg_score = (user_emb * neg_emb).sum(dim=1)

        # BPR loss: -log(sigmoid(pos - neg))
        loss = -torch.log(torch.sigmoid(pos_score - neg_score) + 1e-8)

        return loss.mean()

    def recommend(self, user_id: int, n: int = 10) -> list[int]:
        """Get top-n items for user."""
        user_emb = self.user_factors.weight[user_id]  # (k,)
        all_scores = self.item_factors.weight @ user_emb  # (num_items,)

        top_items = torch.topk(all_scores, n).indices.tolist()
        return top_items

def train_bpr(model, train_interactions, epochs=20, lr=0.01):
    """
    Train BPR with negative sampling.

    Args:
        train_interactions: List of (user_id, item_id) positive pairs
    """
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    num_items = model.item_factors.num_embeddings

    # Build user->items dict for negative sampling
    user_items = {}
    for user, item in train_interactions:
        if user not in user_items:
            user_items[user] = set()
        user_items[user].add(item)

    for epoch in range(epochs):
        total_loss = 0
        np.random.shuffle(train_interactions)

        for user, pos_item in train_interactions:
            # Sample negative item (not in user's history)
            neg_item = np.random.randint(0, num_items)
            while neg_item in user_items[user]:
                neg_item = np.random.randint(0, num_items)

            user_t = torch.tensor([user])
            pos_t = torch.tensor([pos_item])
            neg_t = torch.tensor([neg_item])

            loss = model(user_t, pos_t, neg_t)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch}: Loss = {total_loss / len(train_interactions):.4f}")

Matrix Factorization Legacy:

  • Netflix Prize winner used an ensemble of 107 models, many MF-based
  • Fundamental insight: Embeddings capture latent preferences
  • This idea persists in all modern deep learning RecSys

Part III: Deep Learning for Recommendations

The Limitations of Matrix Factorization

Matrix factorization is linear: prediction = dot product of embeddings. It cannot capture complex, non-linear interactions between user preferences and item attributes.

Code
MF limitation:
If user likes [action, comedy] and item is [action, romance],
MF computes: action*action + comedy*romance = positive (maybe wrong)

We need non-linear combinations:
"User likes action AND comedy together" (not just sum of factors)

Neural Collaborative Filtering (NCF)

Matrix factorization uses a simple dot product to combine user and item embeddings. But a dot product is a linear operation—it can only model linear interactions between latent factors. What if the true user-item relationship is nonlinear?

NCF (WWW 2017) answers this by replacing the dot product with a neural network. The key insight: let the network learn an arbitrary interaction function between user and item representations.

The architecture:

  1. Embedding layer: Map user ID → user embedding, item ID → item embedding (same as MF)
  2. Interaction layers: Instead of dot product, concatenate embeddings and pass through MLPs
  3. Output layer: Predict probability of interaction

Two interaction approaches combined:

NCF actually combines two sub-networks:

  1. GMF (Generalized Matrix Factorization): Element-wise product of embeddings (like MF, but learnable weights) ϕGMF=puqi\phi^{GMF} = p_u \odot q_i

  2. MLP: Concatenate embeddings and pass through hidden layers ϕMLP=MLP([pu;qi])\phi^{MLP} = \text{MLP}([p_u; q_i])

The final prediction combines both: y^ui=σ(hT[ϕGMF;ϕMLP])\hat{y}_{ui} = \sigma(h^T [\phi^{GMF}; \phi^{MLP}])

Why this matters:

  • Nonlinear interactions: A user might like action movies AND comedies, but hate action-comedies. Linear models can't capture this. Neural networks can.
  • Automatic feature learning: MLPs learn what combinations of latent factors matter, instead of assuming simple dot product.
  • Flexible capacity: Add more layers or neurons to increase model capacity.

The trade-off:

NCF is more expressive but also more prone to overfitting. Matrix factorization has an elegant closed-form interpretation; NCF is a black box. In practice, the gains from NCF over well-tuned MF are often modest (5-10%), and MF remains competitive for many applications.

Python
class NCF(nn.Module):
    """
    Neural Collaborative Filtering.
    Replace dot product with learnable MLP.

    Architecture:
    - GMF branch: Generalized Matrix Factorization (element-wise product)
    - MLP branch: Multi-layer perceptron for non-linear interactions
    - Combine: Concatenate and predict
    """

    def __init__(
        self,
        num_users: int,
        num_items: int,
        gmf_dim: int = 32,
        mlp_dims: list[int] = [64, 32, 16],
    ):
        super().__init__()

        # GMF embeddings
        self.gmf_user = nn.Embedding(num_users, gmf_dim)
        self.gmf_item = nn.Embedding(num_items, gmf_dim)

        # MLP embeddings (separate from GMF)
        self.mlp_user = nn.Embedding(num_users, mlp_dims[0] // 2)
        self.mlp_item = nn.Embedding(num_items, mlp_dims[0] // 2)

        # MLP layers
        layers = []
        for i in range(len(mlp_dims) - 1):
            layers.append(nn.Linear(mlp_dims[i], mlp_dims[i + 1]))
            layers.append(nn.ReLU())
        self.mlp = nn.Sequential(*layers)

        # Final prediction layer
        self.predict = nn.Linear(gmf_dim + mlp_dims[-1], 1)

        self._init_weights()

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Embedding):
                nn.init.normal_(m.weight, std=0.01)
            elif isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                nn.init.zeros_(m.bias)

    def forward(self, user_ids: torch.Tensor, item_ids: torch.Tensor) -> torch.Tensor:
        """
        Args:
            user_ids: (batch_size,)
            item_ids: (batch_size,)

        Returns:
            predictions: (batch_size,)
        """
        # GMF branch: element-wise product
        gmf_user_emb = self.gmf_user(user_ids)
        gmf_item_emb = self.gmf_item(item_ids)
        gmf_out = gmf_user_emb * gmf_item_emb  # (B, gmf_dim)

        # MLP branch: concatenate and transform
        mlp_user_emb = self.mlp_user(user_ids)
        mlp_item_emb = self.mlp_item(item_ids)
        mlp_input = torch.cat([mlp_user_emb, mlp_item_emb], dim=1)
        mlp_out = self.mlp(mlp_input)  # (B, mlp_dims[-1])

        # Combine GMF and MLP
        combined = torch.cat([gmf_out, mlp_out], dim=1)

        # Predict
        pred = self.predict(combined).squeeze()

        return torch.sigmoid(pred)  # For binary (click/no-click)

NCF Controversy: A 2020 RecSys paper showed that with proper hyperparameters, simple dot products can match NCF. The lesson: Don't overcomplicate unless you have evidence non-linearity helps.

Wide & Deep Learning

Google's Wide & Deep (2016) paper introduced one of the most influential architectures in industrial recommender systems. It powers Google Play app recommendations and has been adopted across the industry.

The core insight: Memorization vs Generalization

Recommendation systems face a fundamental tension:

  1. Memorization: Learn specific patterns from historical data. "Users who installed Netflix also install Hulu." This captures rare but valuable co-occurrences.

  2. Generalization: Learn broad patterns that apply to new situations. "Users who like streaming apps might like any new streaming app." This handles new items and sparse data.

Linear models (like logistic regression) are great at memorization through feature crosses, but require manual feature engineering. Deep neural networks automatically learn features and generalize well, but can over-generalize and miss important specific patterns.

Wide & Deep combines both:

Code
         ┌─────────────────────────────────────────┐
         │              Wide & Deep                 │
         │                                          │
         │   Wide (Linear)    Deep (DNN)           │
         │        ↓               ↓                 │
         │   Memorization    Generalization        │
         │        ↓               ↓                 │
         │        └───────┬───────┘                │
         │                ↓                         │
         │           Combined                       │
         │           Prediction                     │
         └─────────────────────────────────────────┘

The Wide component:

A linear model on cross-product features: ywide=wT[x,ϕ(x)]+by_{wide} = w^T [x, \phi(x)] + b

Where ϕ(x)\phi(x) are cross-product transformations like:

  • AND(installed_app=netflix, impression_app=hulu)
  • AND(user_country=US, app_language=english)

These must be manually engineered based on domain knowledge.

The Deep component:

A standard feedforward network on dense embeddings: a(l+1)=f(W(l)a(l)+b(l))a^{(l+1)} = f(W^{(l)} a^{(l)} + b^{(l)})

Categorical features are embedded; continuous features are normalized. The network automatically learns feature interactions.

The combined prediction:

P(Y=1x)=σ(wwideT[x,ϕ(x)]+wdeepTa(lf)+b)P(Y=1|x) = \sigma(w_{wide}^T [x, \phi(x)] + w_{deep}^T a^{(l_f)} + b)

Both components are trained jointly end-to-end.

Why this works in practice:

  • Best of both worlds: Wide captures specific user-item affinities; deep generalizes to new combinations
  • Handles cold start: New items with features but no history can get recommendations through the deep component
  • Production-friendly: The wide component can use any feature engineering you already have
Python
class WideAndDeep(nn.Module):
    """
    Wide & Deep Learning for Recommendations.

    Wide component: Linear model for memorization
    - Captures specific patterns: "users who bought X also bought Y"

    Deep component: DNN for generalization
    - Captures general patterns: "users who like action movies"
    """

    def __init__(
        self,
        num_wide_features: int,
        num_deep_features: int,
        deep_dims: list[int] = [256, 128, 64],
    ):
        super().__init__()

        # Wide: Linear model on cross-product features
        self.wide = nn.Linear(num_wide_features, 1)

        # Deep: MLP on dense features
        layers = [nn.Linear(num_deep_features, deep_dims[0]), nn.ReLU()]
        for i in range(len(deep_dims) - 1):
            layers.append(nn.Linear(deep_dims[i], deep_dims[i + 1]))
            layers.append(nn.ReLU())
        layers.append(nn.Linear(deep_dims[-1], 1))
        self.deep = nn.Sequential(*layers)

    def forward(self, wide_features: torch.Tensor, deep_features: torch.Tensor):
        """
        Args:
            wide_features: (B, num_wide_features) - sparse cross-product features
            deep_features: (B, num_deep_features) - dense embeddings
        """
        wide_out = self.wide(wide_features)
        deep_out = self.deep(deep_features)

        return torch.sigmoid(wide_out + deep_out)

Wide & Deep at Google:

  • Used for Google Play app recommendations
  • Wide: Manually engineered cross-features (e.g., app × installed_apps)
  • Deep: Embeddings for categorical features

Part IV: Feature Interaction Models

DeepFM: Automating Feature Crosses

Wide & Deep's limitation: you must manually engineer the cross features for the wide component. This requires domain expertise and experimentation. What if we could automatically learn all useful feature interactions?

DeepFM (IJCAI 2017) solves this by replacing Wide & Deep's manual crosses with Factorization Machines (FM), which automatically model all pairwise feature interactions.

Understanding Factorization Machines:

FM models all second-order feature interactions:

yFM=w0+i=1nwixi+i=1nj=i+1nvi,vjxixjy_{FM} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle v_i, v_j \rangle x_i x_j

Where:

  • w0,wiw_0, w_i are first-order weights (like linear regression)
  • viRkv_i \in \mathbb{R}^k is the latent vector for feature ii
  • vi,vj=f=1kvi,fvj,f\langle v_i, v_j \rangle = \sum_{f=1}^{k} v_{i,f} \cdot v_{j,f} is the interaction strength

The key insight: instead of learning n2n^2 interaction weights directly (intractable for sparse data), learn n×kn \times k latent vectors and compute interactions via dot products. This makes sparse interactions learnable.

The DeepFM architecture:

Code
          ┌────────────────────────────────────────┐
          │               DeepFM                    │
          │                                         │
          │ ┌─────────┐         ┌─────────────┐   │
          │ │   FM    │         │    DNN      │   │
          │ │ (auto   │         │ (high-order │   │
          │ │ 2nd     │         │ interactions│   │
          │ │ order)  │         │ )           │   │
          │ └────┬────┘         └──────┬──────┘   │
          │      │                     │           │
          │      └──────────┬──────────┘           │
          │                 ↓                      │
          │         Final Prediction               │
          └────────────────────────────────────────┘

FM component captures all 2nd-order interactions automatically:

  • user_age × item_category
  • user_country × item_language
  • hour_of_day × item_type
  • ... all pairs, without manual feature engineering

DNN component captures higher-order interactions through deep layers.

Why DeepFM beats Wide & Deep:

  1. No manual feature engineering: FM automatically learns which pairs interact
  2. Shared embeddings: FM and DNN share the same embedding layer, reducing parameters and improving learning
  3. End-to-end learning: All interactions are learned jointly

The math:

y^=σ(yFM+yDNN)\hat{y} = \sigma(y_{FM} + y_{DNN})

Where yFMy_{FM} is the FM output and yDNNy_{DNN} is the deep network output.

Python
class DeepFM(nn.Module):
    """
    DeepFM: FM + DNN for automatic feature interactions.

    FM component: 2nd-order feature interactions (efficient)
    DNN component: Higher-order feature interactions
    """

    def __init__(
        self,
        field_dims: list[int],  # Number of unique values per field
        embedding_dim: int = 16,
        mlp_dims: list[int] = [256, 128, 64],
    ):
        super().__init__()

        self.num_fields = len(field_dims)
        self.embedding_dim = embedding_dim

        # Embedding tables for each field
        self.embeddings = nn.ModuleList([
            nn.Embedding(dim, embedding_dim)
            for dim in field_dims
        ])

        # First-order weights
        self.first_order = nn.ModuleList([
            nn.Embedding(dim, 1)
            for dim in field_dims
        ])

        # Global bias
        self.bias = nn.Parameter(torch.zeros(1))

        # DNN
        dnn_input_dim = self.num_fields * embedding_dim
        layers = [nn.Linear(dnn_input_dim, mlp_dims[0]), nn.ReLU(), nn.Dropout(0.2)]
        for i in range(len(mlp_dims) - 1):
            layers.extend([
                nn.Linear(mlp_dims[i], mlp_dims[i + 1]),
                nn.ReLU(),
                nn.Dropout(0.2)
            ])
        layers.append(nn.Linear(mlp_dims[-1], 1))
        self.dnn = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (B, num_fields) - field indices

        Returns:
            predictions: (B,)
        """
        # Get embeddings for each field
        emb_list = [
            self.embeddings[i](x[:, i])
            for i in range(self.num_fields)
        ]
        emb_stack = torch.stack(emb_list, dim=1)  # (B, num_fields, emb_dim)

        # First-order term
        first_order = sum([
            self.first_order[i](x[:, i]).squeeze()
            for i in range(self.num_fields)
        ])

        # FM second-order: (sum of embeddings)^2 - sum of (embeddings^2)
        # This efficiently computes all pairwise interactions
        sum_of_emb = emb_stack.sum(dim=1)  # (B, emb_dim)
        sum_of_square = (emb_stack ** 2).sum(dim=1)  # (B, emb_dim)
        square_of_sum = sum_of_emb ** 2  # (B, emb_dim)

        fm_second_order = 0.5 * (square_of_sum - sum_of_square).sum(dim=1)

        # DNN
        dnn_input = emb_stack.view(x.size(0), -1)  # (B, num_fields * emb_dim)
        dnn_out = self.dnn(dnn_input).squeeze()

        # Combine
        output = self.bias + first_order + fm_second_order + dnn_out

        return torch.sigmoid(output)

DCN: Deep & Cross Network

DeepFM automates 2nd-order feature interactions. But what about 3rd-order? 4th-order? Higher-order interactions can be crucial: "young male users in California who browse on mobile in the evening" is a 5-way interaction.

DCN (KDD 2017) introduces the Cross Network, which explicitly models feature crosses of bounded degree at each layer.

The key innovation: Cross layers

A cross layer computes: xl+1=x0(Wlxl+bl)+xlx_{l+1} = x_0 \odot (W_l \cdot x_l + b_l) + x_l

Where:

  • x0x_0 is the original input (always included!)
  • xlx_l is the output of layer ll
  • \odot is element-wise multiplication
  • The +xl+ x_l is a residual connection

Why this captures high-order interactions:

  • Layer 1 captures 2nd-order interactions: x0(W1x0)x_0 \odot (W_1 x_0) contains all pairs
  • Layer 2 captures 3rd-order: x0(W2x1)x_0 \odot (W_2 x_1) where x1x_1 already has pairs
  • Layer ll captures up to (l+1)(l+1)-order interactions

The beauty: each layer adds only dd parameters (where dd is input dimension), not d2d^2. This is polynomial complexity in interaction order, not exponential.

DCN architecture:

Code
         ┌────────────────────────────────────────┐
         │                DCN                      │
         │                                         │
         │   ┌──────────┐      ┌──────────────┐  │
         │   │  Cross   │      │    Deep      │  │
         │   │  Network │      │   Network    │  │
         │   │  (explicit│      │   (implicit  │  │
         │   │  crosses) │      │   crosses)   │  │
         │   └────┬─────┘      └──────┬───────┘  │
         │        │                   │           │
         │        └─────────┬─────────┘           │
         │                  ↓                     │
         │          Concatenate & MLP             │
         │                  ↓                     │
         │            Prediction                  │
         └────────────────────────────────────────┘

Cross Network vs Deep Network:

AspectCross NetworkDeep Network
InteractionsExplicit, bounded-degreeImplicit, learned
ParametersLinear in depthQuadratic in width
InterpretabilityCan analyze which crosses matterBlack box
ExpressivenessAll crosses up to degree LArbitrary nonlinear

DCN-v2 (2020 improvement):

The original DCN uses a vector wlw_l for each layer. DCN-v2 uses a full matrix WlW_l, making each cross layer a low-rank approximation of all pairwise interactions. This significantly improves performance.

Python
class CrossNetwork(nn.Module):
    """
    Cross Network: Explicit bounded-degree feature crossing.

    Each layer computes:
    x_{l+1} = x_0 * (w_l · x_l + b_l) + x_l

    This captures l-th order feature interactions at layer l.
    """

    def __init__(self, input_dim: int, num_layers: int = 3):
        super().__init__()

        self.num_layers = num_layers
        self.weights = nn.ParameterList([
            nn.Parameter(torch.randn(input_dim))
            for _ in range(num_layers)
        ])
        self.biases = nn.ParameterList([
            nn.Parameter(torch.zeros(input_dim))
            for _ in range(num_layers)
        ])

    def forward(self, x0: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x0: (B, input_dim) - initial features

        Returns:
            output: (B, input_dim) - crossed features
        """
        x = x0
        for i in range(self.num_layers):
            # x_{l+1} = x_0 * (w_l · x_l) + b_l + x_l
            xw = (x * self.weights[i]).sum(dim=1, keepdim=True)
            x = x0 * xw + self.biases[i] + x

        return x

class DCN(nn.Module):
    """Deep & Cross Network."""

    def __init__(
        self,
        input_dim: int,
        cross_layers: int = 3,
        deep_dims: list[int] = [256, 128, 64],
    ):
        super().__init__()

        self.cross_net = CrossNetwork(input_dim, cross_layers)

        # Deep network
        layers = [nn.Linear(input_dim, deep_dims[0]), nn.ReLU()]
        for i in range(len(deep_dims) - 1):
            layers.extend([nn.Linear(deep_dims[i], deep_dims[i + 1]), nn.ReLU()])
        self.deep_net = nn.Sequential(*layers)

        # Combine cross and deep
        self.output = nn.Linear(input_dim + deep_dims[-1], 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        cross_out = self.cross_net(x)
        deep_out = self.deep_net(x)

        combined = torch.cat([cross_out, deep_out], dim=1)
        return torch.sigmoid(self.output(combined))

Part V: Two-Tower Models for Retrieval

The Retrieval Problem

At scale (millions of items), you can't score every item for every user. The solution: Two-tower models for efficient retrieval.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TWO-TOWER ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CHALLENGE:                                                              │
│  ───────────                                                             │
│  Score all 10M items for each user?                                     │
│  At 100M users × 10M items = 10^15 computations = IMPOSSIBLE           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: DECOUPLE USER AND ITEM ENCODING                              │
│  ───────────────────────────────────────────                             │
│                                                                          │
│  USER TOWER              ITEM TOWER                                     │
│  ───────────             ───────────                                    │
│  User features    →      Item features    →                             │
│  Neural network   →      Neural network   →                             │
│  User embedding   →      Item embedding   →                             │
│       ↓                       ↓                                         │
│       └───────── DOT PRODUCT ─┘                                        │
│                      ↓                                                   │
│                   Score                                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHT:                                                            │
│  ─────────────                                                           │
│  Item embeddings are INDEPENDENT of users                               │
│  → Pre-compute ALL item embeddings offline                              │
│  → Store in vector index (FAISS, ScaNN, HNSW)                          │
│  → At serving: Compute user embedding, find nearest items              │
│  → Complexity: O(log N) instead of O(N)                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

YouTube DNN

YouTube's Deep Neural Network (RecSys 2016) was foundational—perhaps the most influential industrial recommendation paper. It revealed how YouTube serves billions of recommendations daily with sub-100ms latency.

The core problem YouTube faced:

YouTube has over 800 million videos and 2 billion logged-in users monthly. Scoring every video for every user is computationally impossible (10^15+ operations). The solution: a two-stage funnel architecture.

Code
Stage 1: CANDIDATE GENERATION (what we call "retrieval")
─────────────────────────────────────────────────────────
800M videos → Neural network → ~1000 candidates
Goal: High recall (don't miss relevant videos)
Latency budget: ~10ms

Stage 2: RANKING (what we call "scoring")
─────────────────────────────────────────────────────────
1000 candidates → Complex model → ~50 ranked videos
Goal: High precision (order correctly)
Latency budget: ~50ms

The candidate generation model (what this code implements):

The key insight is treating recommendation as an extreme multiclass classification problem. Given the user's context (watch history, search history, demographics), predict which video they'll watch next—out of millions of classes.

Input features for the user tower:

  1. Watch history: The user's last N watched videos. Instead of using a complex sequence model, YouTube simply averages the embeddings of watched videos. This works surprisingly well and is fast.

  2. Search history: Similarly averaged embeddings of search queries (tokenized and embedded).

  3. Demographics: Age bucket, gender, geographic region.

  4. Context features: Time since last watch (crucial for freshness), device type.

The "example age" trick:

YouTube discovered that neural networks learn to predict popular videos because they appear more often in training data. To counteract this freshness bias, they include the video's age (time since upload) as a feature during training. At serving time, they set this to zero—telling the model "assume all videos were just uploaded"—which dramatically improved recommendation freshness.

Training with negative sampling:

With millions of classes, computing the full softmax is impossible. YouTube uses sampled softmax: for each positive example, sample a few thousand negative classes from the video corpus (weighted by popularity). The learned embedding space places similar users and items close together.

Why average pooling works:

Averaging embeddings loses order information but captures "interest topics." A user who watched [sports, sports, cooking] ends up in a similar embedding region to another sports+cooking fan. For coarse-grained retrieval, this is sufficient—ranking handles fine-grained preferences.

Python
class YouTubeDNN(nn.Module):
    """
    YouTube's Deep Neural Network for Recommendations.

    User tower:
    - Watch history (average of video embeddings)
    - Search history
    - Demographics
    - Context (time, device)

    Item tower:
    - Video embeddings
    """

    def __init__(
        self,
        num_videos: int,
        embedding_dim: int = 256,
        hidden_dims: list[int] = [1024, 512, 256],
    ):
        super().__init__()

        # Video embeddings (shared between history and candidates)
        self.video_embedding = nn.Embedding(num_videos, embedding_dim)

        # User tower
        user_input_dim = embedding_dim * 2 + 32  # History + search + demographics
        layers = [nn.Linear(user_input_dim, hidden_dims[0]), nn.ReLU()]
        for i in range(len(hidden_dims) - 1):
            layers.extend([nn.Linear(hidden_dims[i], hidden_dims[i + 1]), nn.ReLU()])
        self.user_tower = nn.Sequential(*layers)

        # Demographics embedding
        self.age_embedding = nn.Embedding(10, 16)  # Age buckets
        self.gender_embedding = nn.Embedding(3, 16)

    def encode_user(
        self,
        watch_history: torch.Tensor,  # (B, max_history)
        watch_mask: torch.Tensor,  # (B, max_history) - 1 for valid, 0 for padding
        search_history: torch.Tensor,  # (B, max_search)
        search_mask: torch.Tensor,
        age: torch.Tensor,  # (B,)
        gender: torch.Tensor,  # (B,)
    ) -> torch.Tensor:
        """Encode user into embedding."""

        # Average watch history embeddings
        watch_emb = self.video_embedding(watch_history)  # (B, H, D)
        watch_emb = (watch_emb * watch_mask.unsqueeze(-1)).sum(dim=1)
        watch_emb = watch_emb / watch_mask.sum(dim=1, keepdim=True).clamp(min=1)

        # Average search history (simplified - could use separate embeddings)
        search_emb = self.video_embedding(search_history)
        search_emb = (search_emb * search_mask.unsqueeze(-1)).sum(dim=1)
        search_emb = search_emb / search_mask.sum(dim=1, keepdim=True).clamp(min=1)

        # Demographics
        demo_emb = torch.cat([
            self.age_embedding(age),
            self.gender_embedding(gender)
        ], dim=1)

        # Combine and transform
        user_features = torch.cat([watch_emb, search_emb, demo_emb], dim=1)
        user_embedding = self.user_tower(user_features)

        return user_embedding

    def encode_items(self, item_ids: torch.Tensor) -> torch.Tensor:
        """Encode items into embeddings."""
        return self.video_embedding(item_ids)

    def forward(self, user_embedding: torch.Tensor, item_ids: torch.Tensor) -> torch.Tensor:
        """Score user-item pairs."""
        item_emb = self.encode_items(item_ids)
        scores = (user_embedding.unsqueeze(1) * item_emb).sum(dim=-1)
        return scores

DSSM: Deep Structured Semantic Models

DSSM (Microsoft, 2013) pioneered the two-tower approach for search and remains influential today. Understanding DSSM is essential because modern retrieval systems (including embedding-based search and RAG) are direct descendants of this architecture.

The problem DSSM solves:

Traditional search systems rely on lexical matching: the query "NY hotels" finds documents containing those exact words. But what about "New York accommodations" or "Manhattan lodging"? Users and documents can express the same concept with completely different words—the vocabulary mismatch problem.

DSSM's solution: Semantic matching in embedding space

Instead of matching words, learn to map queries and documents into a shared semantic space where similar meanings have similar vectors, regardless of exact wording:

Code
Query: "NY hotels"           Document: "Manhattan lodging options"
       ↓                              ↓
   Query Tower                    Doc Tower
       ↓                              ↓
   [0.8, 0.3, ...]    ←cosine→   [0.75, 0.35, ...]
                         ↓
                   High similarity!

The architecture in detail:

  1. Word hashing (character trigrams): Raw text is first converted to a bag of character trigrams. "hotel" → {"#ho", "hot", "ote", "tel", "el#"}. This reduces vocabulary from millions of words to ~30K trigrams, handles misspellings, and generalizes to unseen words.

  2. Query tower: Trigram features → Dense layer (30K→300) → Tanh → Dense (300→300) → Tanh → Dense (300→128). The final 128-dim vector is the query embedding.

  3. Document tower: Same architecture as query tower (but separate parameters). Documents are also converted to 128-dim embeddings.

  4. Similarity scoring: Cosine similarity between normalized query and document embeddings.

Why Tanh activation?

DSSM uses Tanh (not ReLU) throughout. The bounded output [-1, 1] helps with cosine similarity computation and provides more stable gradients. Later work showed ReLU can work too, but Tanh remains common in embedding models.

Training with contrastive loss:

For each query qq, we have one positive document d+d^+ (clicked) and several negative documents d1,d2,...,dkd^-_1, d^-_2, ..., d^-_k (not clicked). The loss maximizes:

P(d+q)=exp(γcos(q,d+))exp(γcos(q,d+))+i=1kexp(γcos(q,di))P(d^+|q) = \frac{\exp(\gamma \cdot \cos(q, d^+))}{\exp(\gamma \cdot \cos(q, d^+)) + \sum_{i=1}^{k} \exp(\gamma \cdot \cos(q, d^-_i))}

Where γ\gamma is a temperature parameter (typically ~10). This softmax-style loss pushes the query embedding closer to positive documents and farther from negative documents.

Negative sampling strategies:

  • Random negatives: Sample from the document corpus. Easy but may include false negatives.
  • In-batch negatives: Use other queries' positive documents as negatives. Efficient and provides harder negatives.
  • Hard negatives: Mine documents that are similar but not relevant (e.g., from the same topic but not answering the query).

Why DSSM is still relevant:

Modern sentence transformers, dense retrievers (DPR, Contriever), and embedding APIs (OpenAI, Cohere) all use the same fundamental architecture: encode query and document with separate towers, score via similarity, train with contrastive loss.

Python
import torch
import torch.nn as nn
import torch.nn.functional as F

class DSSM(nn.Module):
    """
    Deep Structured Semantic Model.
    Originally for query-document matching, adapted for recommendations.
    """

    def __init__(
        self,
        query_input_dim: int,
        doc_input_dim: int,
        hidden_dims: list[int] = [300, 300, 128],
    ):
        super().__init__()

        # Query tower
        q_layers = [nn.Linear(query_input_dim, hidden_dims[0]), nn.Tanh()]
        for i in range(len(hidden_dims) - 1):
            q_layers.extend([nn.Linear(hidden_dims[i], hidden_dims[i + 1]), nn.Tanh()])
        self.query_tower = nn.Sequential(*q_layers)

        # Document tower
        d_layers = [nn.Linear(doc_input_dim, hidden_dims[0]), nn.Tanh()]
        for i in range(len(hidden_dims) - 1):
            d_layers.extend([nn.Linear(hidden_dims[i], hidden_dims[i + 1]), nn.Tanh()])
        self.doc_tower = nn.Sequential(*d_layers)

    def forward(self, query_features: torch.Tensor, doc_features: torch.Tensor) -> torch.Tensor:
        """
        Args:
            query_features: (B, query_dim)
            doc_features: (B, num_docs, doc_dim)

        Returns:
            scores: (B, num_docs) - cosine similarity scores
        """
        query_emb = self.query_tower(query_features)  # (B, hidden)
        query_emb = F.normalize(query_emb, dim=-1)

        doc_emb = self.doc_tower(doc_features)  # (B, num_docs, hidden)
        doc_emb = F.normalize(doc_emb, dim=-1)

        # Cosine similarity
        scores = (query_emb.unsqueeze(1) * doc_emb).sum(dim=-1)

        return scores

Serving Two-Tower Models

Training a two-tower model is only half the battle. The real challenge is serving it at scale—delivering personalized recommendations to millions of users in milliseconds. This section covers the production infrastructure that makes it possible.

The serving challenge:

At serving time, given a user, you need to find the most similar items from potentially billions of candidates. Even with decoupled towers, a naive linear scan is too slow:

Code
10M items × 256 dimensions × 1 float op = 2.56 billion operations per user
At 10K QPS = 25.6 trillion operations per second = IMPOSSIBLE

The solution: Approximate Nearest Neighbor (ANN) search

Instead of exact nearest neighbor search (O(N)), we use approximate methods (O(log N) or O(1)) that find "close enough" neighbors with high probability. The key insight: we can trade a small amount of accuracy for massive speedup.

Popular ANN indexes:

  1. FAISS (Facebook AI Similarity Search):

    • Most popular library for dense vector search
    • Multiple index types: Flat (exact), IVF (inverted file), HNSW (graph-based), PQ (product quantization)
    • GPU support for batch processing
    • Used by: Meta, Spotify, Pinterest
  2. ScaNN (Google):

    • Anisotropic vector quantization—learns quantization optimized for MIPS (maximum inner product search)
    • 2x faster than HNSW at same recall
    • Used internally at Google/YouTube
  3. HNSW (Hierarchical Navigable Small World):

    • Graph-based index with hierarchical structure
    • Excellent recall-latency tradeoff
    • Available in: Faiss, Milvus, Qdrant, Weaviate

Index building workflow:

Code
OFFLINE (daily/weekly):
───────────────────────────────────────────────────────────────
1. Export all item IDs from catalog
2. Run item tower inference on all items → item embeddings
3. Normalize embeddings (for cosine similarity)
4. Build ANN index from embeddings
5. Deploy index to serving infrastructure

ONLINE (per request):
───────────────────────────────────────────────────────────────
1. Receive user request
2. Fetch user features (watch history, demographics, etc.)
3. Run user tower inference → user embedding (typically ~5-10ms)
4. Query ANN index for top-K items → candidate set (typically ~1-5ms)
5. Pass candidates to ranking model

Embedding normalization for cosine similarity:

FAISS's most efficient indexes use inner product (dot product) similarity. To get cosine similarity, we normalize all vectors to unit length:

cos(u,v)=uvuv=unormvnorm\cos(u, v) = \frac{u \cdot v}{\|u\| \|v\|} = u_{norm} \cdot v_{norm}

After normalization, inner product equals cosine similarity.

Production considerations:

  1. Index sharding: For billions of items, shard the index across multiple machines. Route queries to all shards, merge results.

  2. Index updates: Item embeddings change as you retrain. Options:

    • Full rebuild (safest, but expensive)
    • Incremental updates (add new items, but index quality degrades over time)
    • Hybrid: Periodic full rebuilds + incremental updates
  3. Filtering: Often you need to filter candidates (e.g., exclude already-watched items, region restrictions). Options:

    • Pre-filter: Build separate indexes per filter dimension (expensive)
    • Post-filter: Retrieve more candidates, filter afterward (wasteful)
    • In-query filter: Some indexes (Milvus, Qdrant) support filtered search
  4. Caching: User embeddings can be cached for repeat visitors (typically with short TTL as preferences change).

Python
import faiss
import numpy as np

class TwoTowerServing:
    """
    Serving infrastructure for two-tower models.
    """

    def __init__(self, model: YouTubeDNN, num_items: int, embedding_dim: int):
        self.model = model
        self.embedding_dim = embedding_dim

        # Pre-compute all item embeddings
        print("Computing item embeddings...")
        item_ids = torch.arange(num_items)
        with torch.no_grad():
            self.item_embeddings = model.encode_items(item_ids).numpy()

        # Build FAISS index for fast retrieval
        print("Building FAISS index...")
        self.index = faiss.IndexFlatIP(embedding_dim)  # Inner product
        faiss.normalize_L2(self.item_embeddings)  # For cosine similarity
        self.index.add(self.item_embeddings)

    def retrieve(self, user_embedding: np.ndarray, k: int = 100) -> tuple[np.ndarray, np.ndarray]:
        """
        Retrieve top-k items for user.

        Args:
            user_embedding: (embedding_dim,)
            k: Number of items to retrieve

        Returns:
            item_ids: (k,)
            scores: (k,)
        """
        # Normalize user embedding
        user_embedding = user_embedding / np.linalg.norm(user_embedding)
        user_embedding = user_embedding.reshape(1, -1)

        # Search
        scores, item_ids = self.index.search(user_embedding, k)

        return item_ids[0], scores[0]

    def recommend(self, user_features: dict, k: int = 100) -> list[int]:
        """Full recommendation pipeline."""

        # Encode user
        with torch.no_grad():
            user_emb = self.model.encode_user(**user_features).numpy()[0]

        # Retrieve candidates
        item_ids, scores = self.retrieve(user_emb, k)

        return item_ids.tolist()

Part VI: Sequential Models

The models we've seen so far treat user-item interactions as a bag—they don't consider the order. But order matters! A user who watched [Inception, Interstellar, The Martian] likely wants more sci-fi, while [The Martian, Cast Away, Into the Wild] suggests survival films. Sequential models capture these temporal dynamics.

GRU4Rec: RNNs for Session-Based Recommendation

GRU4Rec (ICLR 2016 Workshop) was the first successful application of deep learning to session-based recommendation, fundamentally changing how we model user behavior.

The session-based recommendation problem:

In many scenarios, you don't have long-term user profiles:

  • Anonymous users (no login)
  • New users with no history (cold start)
  • E-commerce sessions where intent changes rapidly

Instead, you only have the current session: a sequence of items the user interacted with in the past few minutes/hours. The task: predict the next item they'll interact with.

Code
Session: [Product A] → [Product B] → [Product C] → ???
                                                    ↑
                                         Predict next item

Why RNNs for sequences?

Recurrent Neural Networks (RNNs) are designed for sequential data. They maintain a hidden state that evolves as they process each element:

Code
Input:   Item₁    Item₂    Item₃    Item₄
           ↓        ↓        ↓        ↓
         [GRU] → [GRU] → [GRU] → [GRU]
           ↓        ↓        ↓        ↓
Hidden:   h₁   →   h₂   →   h₃   →   h₄
                                       ↓
                                   Prediction

Each hidden state hth_t encodes the "summary" of the session so far. The final hidden state captures the user's current intent.

Why GRU instead of LSTM?

GRU4Rec uses Gated Recurrent Units (GRU) rather than LSTMs:

  1. Fewer parameters: GRU has 2 gates vs LSTM's 3, meaning faster training
  2. Similar performance: For recommendation sequences (typically short), GRU performs comparably to LSTM
  3. Simpler: Easier to implement and debug

The GRU update equations:

zt=σ(Wzxt+Uzht1)z_t = \sigma(W_z x_t + U_z h_{t-1}) (update gate: how much to update) rt=σ(Wrxt+Urht1)r_t = \sigma(W_r x_t + U_r h_{t-1}) (reset gate: how much to forget) h~t=tanh(Whxt+Uh(rtht1))\tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1})) (candidate hidden state) ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t (final hidden state)

Training with session-parallel mini-batches:

A key contribution of GRU4Rec is how they handle batching. Sessions have variable lengths, so naive batching is inefficient. Their solution:

  1. Sort sessions by length
  2. Create a sliding window across sessions
  3. When a session ends, replace it with a new one and reset its hidden state

This allows efficient GPU utilization while maintaining session independence.

BPR loss for recommendation:

GRU4Rec uses BPR (Bayesian Personalized Ranking) loss rather than cross-entropy:

LBPR=(s,i,j)logσ(rs,irs,j)L_{BPR} = -\sum_{(s, i, j)} \log \sigma(r_{s,i} - r_{s,j})

Where ii is the actual next item and jj is a negative sample. This ranking loss directly optimizes for correct ordering.

TOP1 loss (GRU4Rec's innovation):

The paper also introduces TOP1 loss, which regularizes scores:

LTOP1=1Nsjσ(rs,jrs,i)+σ(rs,j2)L_{TOP1} = \frac{1}{N_s} \sum_j \sigma(r_{s,j} - r_{s,i}) + \sigma(r_{s,j}^2)

The second term penalizes large scores, improving stability.

Practical considerations:

  • Sequence length: Most value comes from recent items. 20-50 items is typical.
  • Item embeddings: Shared between input and output (parameter tying).
  • Dropout: Applied to inputs and between layers. Critical for preventing overfitting.
  • Hidden size: 100-500 dimensions typically. Larger helps for diverse catalogs.
Python
class GRU4Rec(nn.Module):
    """
    GRU-based Session Recommendation.
    Models user's current session as a sequence.
    """

    def __init__(
        self,
        num_items: int,
        embedding_dim: int = 100,
        hidden_dim: int = 100,
        num_layers: int = 1,
        dropout: float = 0.2,
    ):
        super().__init__()

        self.item_embedding = nn.Embedding(num_items, embedding_dim, padding_idx=0)

        self.gru = nn.GRU(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

        self.output = nn.Linear(hidden_dim, num_items)

    def forward(self, session: torch.Tensor) -> torch.Tensor:
        """
        Args:
            session: (B, seq_len) - item IDs in session

        Returns:
            scores: (B, num_items) - scores for next item
        """
        # Embed items
        emb = self.item_embedding(session)  # (B, L, D)

        # GRU
        output, _ = self.gru(emb)  # (B, L, H)

        # Use last hidden state to predict next item
        last_hidden = output[:, -1, :]  # (B, H)
        scores = self.output(last_hidden)  # (B, num_items)

        return scores

    def predict_next(self, session: torch.Tensor, k: int = 20) -> torch.Tensor:
        """Get top-k next items."""
        scores = self.forward(session)
        top_k = torch.topk(scores, k, dim=-1)
        return top_k.indices

Caser: CNNs for Sequential Patterns

Caser (WSDM 2018) challenged the assumption that RNNs are the best architecture for sequences. Instead, Caser uses Convolutional Neural Networks to capture sequential patterns—and often outperforms RNN-based methods.

The key insight: Local patterns matter

Consider a shopping sequence: [Laptop] → [Laptop Bag] → [Mouse] → [Keyboard]. The patterns here are mostly local—consecutive items are related. You don't need to remember what happened 50 items ago; the last few items determine the next purchase.

CNNs excel at capturing local patterns through sliding windows (convolutions). This makes them surprisingly effective for short-term sequential recommendation.

Two types of patterns Caser captures:

Caser uses two distinct types of convolutions to capture different pattern types:

Code
HORIZONTAL CONVOLUTIONS (union-level patterns):
──────────────────────────────────────────────────────────────
Capture which COMBINATIONS of recent items predict the next item.
Filter slides HORIZONTALLY across the embedding matrix.

Example: Filter height = 2 captures bigrams
         [Laptop, Bag] → likely to buy accessories

  Embedding matrix (L items × D dimensions):
  ┌─────────────────────┐
  │ Laptop    [0.8 0.2] │ ← Filter slides
  │ Bag       [0.7 0.3] │   horizontally
  │ Mouse     [0.6 0.4] │
  │ Keyboard  [0.5 0.5] │
  └─────────────────────┘

VERTICAL CONVOLUTIONS (point-level patterns):
──────────────────────────────────────────────────────────────
Capture how INDIVIDUAL items influence the next item.
Filter slides VERTICALLY, aggregating information across features.

Example: "Laptop" (regardless of what followed) → predict accessories
         This captures "skip patterns" where a single item signals intent.

  Filter slides vertically across ALL positions:
  ┌─────────────────────┐
  │ Laptop    [0.8 0.2] │
  │ Bag       [0.7 0.3] │  ← Same filter applied
  │ Mouse     [0.6 0.4] │     to each row
  │ Keyboard  [0.5 0.5] │
  └─────────────────────┘

Architecture breakdown:

  1. Input: User's last L items (e.g., L=5), embedded into D dimensions → (L × D) matrix

  2. Horizontal convolutions: Multiple filters with different heights (1, 2, 3, ..., L)

    • Height 1: Unigrams (single item patterns)
    • Height 2: Bigrams (consecutive pairs)
    • Height L: The entire sequence
    • Each filter produces a 1D output, then max-pooled
  3. Vertical convolutions: Filters with height L (full sequence) and width 1

    • Slides across embedding dimensions
    • Captures per-dimension patterns across all items
  4. User embedding: Added to capture long-term preferences (not just session)

  5. Output: Concatenate [horizontal features, vertical features, user embedding] → MLP → item scores

Why CNNs work for sequences:

  • Parallelization: All convolutions run in parallel (unlike RNNs which are sequential)
  • Local pattern capture: Explicit modeling of n-gram patterns
  • Position sensitivity: Unlike bag-of-words, convolutions respect order
  • Faster training: 2-3x faster than GRU4Rec in practice

The math behind horizontal convolutions:

For a filter FF of height hh and width DD (embedding dim), applied to embedding matrix EE:

oi=ReLU(j=0h1k=0D1Fj,kEi+j,k+b)o_i = \text{ReLU}(\sum_{j=0}^{h-1} \sum_{k=0}^{D-1} F_{j,k} \cdot E_{i+j, k} + b)

Then max-pool over all positions: o^=maxi(oi)\hat{o} = \max_i(o_i)

When to choose Caser over GRU4Rec:

  • Short sequences: Caser shines when recent history (last 5-10 items) dominates
  • Clear local patterns: E-commerce, music playlists with strong sequential dependencies
  • Need for speed: 2-3x faster training, similar inference latency
  • Explicit n-grams: When you want to interpret which item combinations matter
Python
import torch
import torch.nn as nn
import torch.nn.functional as F

class Caser(nn.Module):
    """
    Convolutional Sequence Embedding for Sequential Recommendation.
    Uses CNNs to capture point-level and union-level patterns.
    """

    def __init__(
        self,
        num_items: int,
        num_users: int,
        embedding_dim: int = 50,
        max_seq_len: int = 5,
        num_horizontal_filters: int = 16,
        num_vertical_filters: int = 4,
    ):
        super().__init__()

        self.embedding_dim = embedding_dim
        self.max_seq_len = max_seq_len

        # Embeddings
        self.item_embedding = nn.Embedding(num_items, embedding_dim, padding_idx=0)
        self.user_embedding = nn.Embedding(num_users, embedding_dim)

        # Horizontal convolutions (capture union-level patterns)
        # Different filter heights for different n-gram sizes
        self.horizontal_convs = nn.ModuleList([
            nn.Conv2d(1, num_horizontal_filters, (i + 1, embedding_dim))
            for i in range(max_seq_len)
        ])

        # Vertical convolution (capture point-level patterns)
        self.vertical_conv = nn.Conv2d(1, num_vertical_filters, (max_seq_len, 1))

        # Output
        horizontal_out = num_horizontal_filters * max_seq_len
        vertical_out = num_vertical_filters * embedding_dim
        self.fc = nn.Linear(horizontal_out + vertical_out + embedding_dim, num_items)

    def forward(self, user: torch.Tensor, sequence: torch.Tensor) -> torch.Tensor:
        """
        Args:
            user: (B,) user IDs
            sequence: (B, max_seq_len) item sequence

        Returns:
            scores: (B, num_items)
        """
        B = user.size(0)

        # Embed sequence
        seq_emb = self.item_embedding(sequence)  # (B, L, D)
        seq_emb = seq_emb.unsqueeze(1)  # (B, 1, L, D) - add channel dim

        # Horizontal convolutions
        h_outs = []
        for conv in self.horizontal_convs:
            h = F.relu(conv(seq_emb))  # (B, filters, L-k+1, 1)
            h = h.squeeze(-1)  # (B, filters, L-k+1)
            h = F.max_pool1d(h, h.size(-1))  # (B, filters, 1)
            h_outs.append(h.squeeze(-1))
        horizontal = torch.cat(h_outs, dim=1)  # (B, filters * L)

        # Vertical convolution
        v = F.relu(self.vertical_conv(seq_emb))  # (B, filters, 1, D)
        vertical = v.view(B, -1)  # (B, filters * D)

        # User embedding
        user_emb = self.user_embedding(user)  # (B, D)

        # Combine
        combined = torch.cat([horizontal, vertical, user_emb], dim=1)
        scores = self.fc(combined)

        return scores

The Transition to Attention

GRU4Rec and Caser showed that sequence matters. But they had limitations:

  • GRU: Sequential processing (slow), vanishing gradients
  • Caser: Fixed receptive field, limited long-range modeling

This set the stage for attention-based models (SASRec, BERT4Rec) covered in the next post.


Part VII: Production Systems (Pre-Transformer Era)

How Tech Giants Built Their RecSys

Understanding how major companies evolved their recommendation systems provides crucial context. Each faced unique scale challenges and made different architectural choices.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│          PRODUCTION RECSYS EVOLUTION BY COMPANY                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  NETFLIX (2006-2017)                                                    │
│  ───────────────────                                                     │
│  2006: Netflix Prize launches (MF era begins)                           │
│  2009: BellKor's Pragmatic Chaos wins with 107-model ensemble          │
│  2012: Moves to hybrid: MF + content features + context                │
│  2016: Deep learning experiments begin                                  │
│  Key lesson: Ensembles beat single models at scale                      │
│                                                                          │
│  AMAZON (2003-2017)                                                     │
│  ─────────────────────                                                   │
│  2003: Item-based CF paper (foundational)                               │
│  2007: DSSTNE (Deep Scalable Sparse Tensor Network Engine)             │
│  2013: Hybrid CF + content for "Customers who bought..."               │
│  Key lesson: Item-item similarity scales better than user-user          │
│                                                                          │
│  YOUTUBE (2010-2018)                                                    │
│  ───────────────────                                                     │
│  2010: Video recommendations via click-through rates                    │
│  2016: Deep Neural Networks paper (two-tower revolution)               │
│  2018: Adds watch time prediction, not just CTR                        │
│  Key lesson: Optimize for engagement, not just clicks                   │
│                                                                          │
│  SPOTIFY (2014-2018)                                                    │
│  ──────────────────────                                                  │
│  2014: Discover Weekly launches (CF + audio features)                  │
│  2016: Adds NLP on playlists ("chill vibes" → similar tracks)         │
│  2018: BaRT (Bandits for Recommendations) for exploration             │
│  Key lesson: Audio content features enable cold-start                   │
│                                                                          │
│  LINKEDIN (2012-2018)                                                   │
│  ─────────────────────                                                   │
│  2012: People You May Know (graph-based CF)                            │
│  2015: GLMix (Generalized Linear Mixed Models)                         │
│  2017: Photon-ML for feature engineering                               │
│  Key lesson: Professional graph structure is gold                       │
│                                                                          │
│  PINTEREST (2015-2018)                                                  │
│  ───────────────────────                                                 │
│  2015: Related Pins via visual similarity (CNN features)               │
│  2017: PinSage (graph neural networks at scale)                        │
│  2018: Multi-task learning (engagement + saves)                        │
│  Key lesson: Visual features dominate in image-first platforms         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Netflix: The Prize That Changed Everything

The Netflix Prize ($1M, 2006-2009) was a watershed moment for recommendation systems—it brought collaborative filtering from an academic curiosity to a mainstream machine learning problem, attracting thousands of teams worldwide and spawning techniques still used today.

The challenge:

Netflix released 100 million movie ratings (1-5 stars) from 480,000 users on 18,000 movies. The goal: improve their existing recommendation system (Cinematch, a basic neighborhood CF) by 10% in RMSE. The prize seemed impossible at the time—Netflix's own engineers had been stuck at 0.9514 RMSE for years.

Why the Prize mattered:

  1. Open data at scale: For the first time, researchers had access to real industrial-scale recommendation data
  2. Clear metric: RMSE provided an objective benchmark that everyone optimized for
  3. Competition drove innovation: Teams shared insights, combined approaches, and pushed boundaries
  4. Industry attention: The $1M prize attracted researchers who had never worked on recommendations

The evolution of winning approaches:

Code
Year 1 (2007): Single-model innovations
────────────────────────────────────────
- SVD with bias terms
- Temporal dynamics (preferences change over time)
- Best single model: ~0.88 RMSE

Year 2 (2008): Feature engineering
────────────────────────────────────────
- Implicit feedback (what users DIDN'T rate matters)
- Neighborhood models with shrinkage
- Best: ~0.87 RMSE

Year 3 (2009): Ensemble everything
────────────────────────────────────────
- 100+ model ensembles
- Stacking and blending
- Final winner: 0.8567 RMSE (10.09% improvement)

Key technical innovations from the Prize:

  1. SVD++ (implicit feedback): Users who rated many horror movies are horror fans, even if individual ratings vary. SVD++ added "implicit" factors based on which items a user interacted with—not just how they rated.

  2. Temporal dynamics: User preferences drift over time. A user who loved rom-coms in 2005 might prefer thrillers in 2008. Time-aware models captured this.

  3. Regularization: L2 regularization (weight decay) on latent factors was crucial. Without it, models memorized training data.

  4. Neighborhood + latent factor hybrids: The best results combined global latent factor models (matrix factorization) with local neighborhood models (kNN).

Why Netflix didn't deploy the winning solution:

The winning 107-model ensemble was too complex for production. Netflix's actual deployed system used a simpler blend of ~3-5 models. The Prize's real value was the techniques and talent it surfaced—many participants joined Netflix.

The lesson for practitioners:

Single-model improvements matter, but ensembles almost always win competitions. However, production systems favor simplicity. The Prize demonstrated both truths simultaneously.

Python
# Simplified version of BellKor's winning approach
class NetflixPrizeEnsemble:
    """
    Key insight from Netflix Prize: Blend many models.
    The winning solution combined 107 models.
    """

    def __init__(self):
        self.models = [
            # Matrix factorization variants
            SVDModel(factors=100),
            SVDModel(factors=200),
            SVDPlusPlusModel(factors=100),  # SVD with implicit feedback

            # Neighborhood models
            ItemKNN(k=20),
            ItemKNN(k=50),

            # Restricted Boltzmann Machines
            RBMModel(hidden_units=100),

            # Time-aware models
            TimeSVDModel(factors=100),  # Captures preference drift
        ]

        # Learned blending weights
        self.weights = None

    def fit(self, train_data, val_data):
        """Train all models and learn blending weights."""
        # Train each model
        for model in self.models:
            model.fit(train_data)

        # Get predictions on validation set
        val_predictions = np.array([
            model.predict(val_data)
            for model in self.models
        ]).T  # (num_samples, num_models)

        # Learn optimal weights via linear regression
        self.weights = self._learn_weights(val_predictions, val_data.ratings)

    def predict(self, user_item_pairs):
        """Blended prediction."""
        predictions = np.array([
            model.predict(user_item_pairs)
            for model in self.models
        ]).T

        return predictions @ self.weights

Netflix lessons:

  • SVD++ with implicit feedback was crucial (views, not just ratings)
  • Time-aware models captured preference drift
  • Blending provided 10%+ improvement over single models
  • Real deployment was much simpler than the prize-winning solution

YouTube: Two-Tower at Scale

YouTube's 2016 paper defined how to do deep learning recommendations at billion-scale. This paper is arguably the most influential industrial recommendation paper ever written—it established patterns that every major tech company now follows.

The scale challenge YouTube faced:

  • 800+ million videos in the catalog
  • 2+ billion logged-in users monthly
  • 500+ hours of video uploaded every minute
  • Latency budget: <200ms total for the entire recommendation flow

Scoring every video for every user is computationally impossible: 800M × 2B = 1.6 × 10^18 operations per day.

The solution: Two-stage architecture

YouTube's key architectural insight was splitting the problem into two stages, each optimized for different goals:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    YOUTUBE'S TWO-STAGE ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STAGE 1: CANDIDATE GENERATION                                          │
│  ─────────────────────────────                                          │
│  Input:  User context (history, demographics, time)                     │
│  Output: ~1000 candidate videos (from 800M)                             │
│  Model:  Deep Neural Network (two-tower)                                │
│  Goal:   HIGH RECALL - don't miss relevant videos                       │
│  Budget: ~10ms                                                           │
│                                                                          │
│  Features (intentionally sparse for speed):                             │
│  - Watch history (averaged embeddings of last 50 videos)                │
│  - Search history (averaged query embeddings)                           │
│  - Age, gender, geographic region                                       │
│  - Time since last watch (freshness signal)                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STAGE 2: RANKING                                                       │
│  ────────────                                                            │
│  Input:  ~1000 candidates + rich features                               │
│  Output: ~50 ranked videos for display                                  │
│  Model:  Feature-rich neural network                                    │
│  Goal:   HIGH PRECISION - order candidates correctly                    │
│  Budget: ~100ms                                                          │
│                                                                          │
│  Features (much richer, expensive to compute):                          │
│  - Video metadata (title, description, category, duration)              │
│  - Video quality signals (watch time, likes, comments)                  │
│  - User-video interaction history (has user seen similar?)              │
│  - Contextual (time of day, device, page location)                      │
│  - Freshness (video age, creator's recent performance)                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why this architecture works:

  1. Candidate generation can be pre-computed. Video embeddings don't change per-user, so they're computed offline and indexed. Only user embedding computation is online (~1ms).

  2. Ranking uses expensive features that would be impossible to compute for all videos, but feasible for 1000 candidates.

  3. Different objectives: Candidate generation optimizes for recall (any potentially relevant video), ranking optimizes for engagement (watch time).

The "watch time" insight:

YouTube's ranking model predicts expected watch time, not click probability. This was crucial:

  • Clicks are gameable: Clickbait titles/thumbnails generate clicks but disappoint users
  • Watch time = actual engagement: If users watch the video, it was genuinely interesting
  • Better business alignment: YouTube's ad revenue correlates with watch time, not clicks

Training the ranking model:

The model is trained to predict the weighted logistic regression of watch time. For each impression:

  • Positive example: User watched the video. Weight = watch time in seconds.
  • Negative example: User didn't click. Weight = 1.

This gives longer-watched videos more influence on the model.

The business rules layer:

After ranking, YouTube applies business rules:

  • Diversity: Don't show 10 videos from the same channel
  • Freshness: Mix in recent uploads
  • Fairness: Ensure smaller creators get exposure
  • Brand safety: Filter inappropriate content
Python
class YouTubeProductionSystem:
    """
    YouTube's production recommendation architecture (2016).

    Key insight: Split into RETRIEVAL (fast, broad) and RANKING (slow, precise).
    """

    def __init__(self):
        # Candidate generation: Two-tower DNN
        self.candidate_generator = YouTubeDNN(
            num_videos=1_000_000_000,  # Billion videos
            embedding_dim=256,
        )

        # Ranking: Feature-rich model
        self.ranker = YouTubeRanker(
            video_features=['duration', 'freshness', 'channel_ctr'],
            user_features=['watch_history', 'search_history', 'demographics'],
            context_features=['time_of_day', 'device', 'page'],
        )

        # ANN index for retrieval
        self.ann_index = None

    def serve(self, user_context: dict, num_results: int = 20) -> list:
        """Full serving pipeline."""

        # Stage 1: Candidate Generation (100ms budget)
        # Retrieve ~1000 candidates from billions of videos
        user_embedding = self.candidate_generator.encode_user(user_context)
        candidates = self.ann_index.search(user_embedding, k=1000)

        # Stage 2: Ranking (100ms budget)
        # Score candidates with rich features
        video_features = self.fetch_features(candidates)
        scores = self.ranker.score(user_context, candidates, video_features)

        # Stage 3: Business Rules (10ms budget)
        # Diversity, freshness, creator fairness
        final_list = self.apply_business_rules(candidates, scores)

        return final_list[:num_results]

class YouTubeRanker(nn.Module):
    """
    YouTube's ranking model optimizes for WATCH TIME, not clicks.

    Key insight: Clicks are easy to game. Watch time = real engagement.
    """

    def __init__(self, video_features, user_features, context_features):
        super().__init__()

        # Feature processing
        self.video_encoder = FeatureEncoder(video_features)
        self.user_encoder = FeatureEncoder(user_features)
        self.context_encoder = FeatureEncoder(context_features)

        # Prediction heads
        combined_dim = 512
        self.watch_time_head = nn.Sequential(
            nn.Linear(combined_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
        )

    def forward(self, user_features, video_features, context_features):
        """Predict expected watch time."""
        user_emb = self.user_encoder(user_features)
        video_emb = self.video_encoder(video_features)
        context_emb = self.context_encoder(context_features)

        combined = torch.cat([user_emb, video_emb, context_emb], dim=-1)

        # Predict watch time (not binary click)
        watch_time = self.watch_time_head(combined)

        return watch_time

YouTube lessons:

  • Two-stage architecture is essential at scale
  • Optimize for watch time, not clicks (better engagement metric)
  • Fresh content needs special handling (cold-start)
  • Example age feature: Train on recent data, serve on current context

Amazon: Item-Based CF at Scale

Amazon's 2003 paper "Amazon.com Recommendations: Item-to-Item Collaborative Filtering" is one of the most influential recommendation papers ever published. It established item-based collaborative filtering as the go-to approach for e-commerce and shaped how we think about scalable recommendations.

The problem with user-based CF at Amazon's scale:

Before 2003, most recommendation systems used user-based collaborative filtering: "Find users similar to you, recommend what they liked." This approach hit a wall at Amazon's scale:

Code
User-Based CF Scalability Problem:
─────────────────────────────────────────────────────────────────────────
Users:  ~50 million (in 2003, now 300M+)
Items:  ~10 million products

To recommend for user U:
1. Compare U to ALL 50M users → 50M similarity calculations
2. For each similar user, iterate their purchases → expensive

This must happen in REAL-TIME (<100ms). Impossible.

Amazon's insight: Items are more stable than users

Item-based CF flips the approach: instead of "users like you liked X," it's "items similar to X are Y, Z."

Code
Item-Based CF:
─────────────────────────────────────────────────────────────────────────
Pre-compute: For each item, find similar items (based on co-purchase)
             This is OFFLINE. Can take hours. Run nightly.

Online: Given user's purchase history [A, B, C]
        Look up pre-computed similar items for A, B, C
        Aggregate and rank
        Total: O(purchase_history_size × k) = O(100 × 50) = 5000 ops

5000 ops vs 50,000,000 ops. That's the difference.

Why items are more stable:

  1. User preferences drift: A user might love sci-fi this year, thrillers next year
  2. Item relationships are stable: "Laptops are bought with laptop bags" doesn't change
  3. New users appear constantly: 1000s of new users per minute. Recomputing user-user similarity is expensive.
  4. Items change slowly: New products launch, but most catalog is stable

Computing item similarity:

The key question: How do we define "similar items"? Amazon uses purchase co-occurrence:

sim(i,j)=Users who bought both i and jUsers who bought i×Users who bought j\text{sim}(i, j) = \frac{|\text{Users who bought both } i \text{ and } j|}{|\text{Users who bought } i| \times |\text{Users who bought } j|}

This is essentially cosine similarity over the user-item purchase matrix, viewed from the item perspective.

The "Customers who bought this also bought" pattern:

This is item-based CF in action:

  1. User views product X
  2. Look up pre-computed similar items for X
  3. Display top-k as "Customers who bought this also bought..."

This simple pattern drives a significant portion of Amazon's revenue.

Implementation considerations:

  1. Sparsity handling: Most users buy few items. Use co-occurrence counts, not Pearson correlation.
  2. Popular item bias: Very popular items co-occur with everything. Dampen by dividing by item popularity.
  3. Temporal filtering: Recent purchases matter more. Decay older co-occurrences.
  4. Category constraints: "Laptop bag" is similar to "Laptop," but maybe don't recommend both.
Python
class AmazonItemCF:
    """
    Amazon's item-to-item collaborative filtering.

    Key insight: Item similarity is more stable than user similarity.
    "Customers who bought X also bought Y"
    """

    def __init__(self):
        self.item_similarity = {}  # item_id -> [(similar_item, score)]

    def compute_similarities_offline(self, purchase_matrix):
        """
        Pre-compute all item-item similarities.
        Run daily/weekly as batch job.
        """
        num_items = purchase_matrix.shape[1]

        for i in range(num_items):
            # Users who bought item i
            users_i = set(np.where(purchase_matrix[:, i] > 0)[0])

            similarities = []
            for j in range(num_items):
                if i == j:
                    continue

                # Users who bought item j
                users_j = set(np.where(purchase_matrix[:, j] > 0)[0])

                # Jaccard similarity (or cosine)
                intersection = len(users_i & users_j)
                union = len(users_i | users_j)

                if union > 0:
                    sim = intersection / union
                    if sim > 0.01:  # Threshold for storage
                        similarities.append((j, sim))

            # Keep top-k similar items
            similarities.sort(key=lambda x: x[1], reverse=True)
            self.item_similarity[i] = similarities[:100]

    def recommend(self, user_purchase_history: list, n: int = 10) -> list:
        """
        Real-time recommendation based on user's purchases.
        """
        # Aggregate similar items from purchase history
        candidate_scores = {}

        for purchased_item in user_purchase_history:
            if purchased_item not in self.item_similarity:
                continue

            for similar_item, similarity in self.item_similarity[purchased_item]:
                if similar_item in user_purchase_history:
                    continue  # Don't recommend already purchased

                if similar_item not in candidate_scores:
                    candidate_scores[similar_item] = 0
                candidate_scores[similar_item] += similarity

        # Sort by score
        ranked = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)

        return [item_id for item_id, _ in ranked[:n]]

Amazon lessons:

  • Pre-computation makes real-time serving feasible
  • Purchase signals are stronger than views/clicks
  • "Frequently bought together" = item-item in same basket
  • Simple methods at scale beat complex methods that don't scale

Spotify: Audio Features for Cold-Start

Spotify's recommendation system is a masterclass in hybrid approaches—combining collaborative filtering with content-based features to solve one of the hardest problems in music recommendation: the cold-start problem for new songs.

The music cold-start problem:

Music platforms face a brutal cold-start challenge:

  • 40,000+ new tracks uploaded daily to Spotify
  • New songs have zero listening data
  • Artists depend on algorithmic discovery for exposure
  • Users expect fresh music in playlists like Discover Weekly

Traditional CF fails here: how do you recommend a song that no one has listened to yet?

Spotify's solution: Multiple signal sources

Spotify doesn't rely on a single model—they combine multiple signals, each addressing different scenarios:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SPOTIFY'S MULTI-SIGNAL ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SIGNAL 1: COLLABORATIVE FILTERING                                      │
│  ───────────────────────────────────                                    │
│  When it works: Songs with listening history                            │
│  Method: Matrix factorization on user-track listens                     │
│  Strength: Captures taste clusters (indie rock fans, K-pop fans)        │
│                                                                          │
│  SIGNAL 2: AUDIO CONTENT FEATURES                                       │
│  ─────────────────────────────────                                       │
│  When it works: ANY song (even brand new uploads)                       │
│  Method: CNN on mel-spectrograms → audio embedding                      │
│  Features: Tempo, energy, danceability, acousticness, valence           │
│  Strength: Enables cold-start recommendations                           │
│                                                                          │
│  SIGNAL 3: PLAYLIST CO-OCCURRENCE (NLP)                                 │
│  ─────────────────────────────────────────                               │
│  When it works: Songs that appear in user playlists                     │
│  Method: Word2Vec on playlists (songs = "words", playlists = "docs")   │
│  Strength: Captures context ("chill vibes" playlists = similar mood)   │
│                                                                          │
│  SIGNAL 4: EDITORIAL KNOWLEDGE                                          │
│  ─────────────────────────────────                                       │
│  When it works: Curated playlists, genre classifications                │
│  Method: Human-labeled tags, genre graphs                               │
│  Strength: Handles edge cases, ensures quality                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Audio content features: The cold-start savior

Spotify's audio analysis pipeline extracts features directly from the audio waveform:

  1. Low-level features: Raw spectrograms, MFCCs (mel-frequency cepstral coefficients)
  2. Mid-level features: Beat, tempo, key, mode, time signature
  3. High-level features: Genre probability distribution, mood, energy

These are computed at upload time—before anyone listens. A new song immediately gets an embedding based on how it sounds, enabling recommendations like: "This new track sounds like songs you've loved before."

Playlist-based Word2Vec:

Spotify treats playlists as "documents" and songs as "words." Using Word2Vec:

  • Songs that frequently appear in the same playlists get similar embeddings
  • A song in many "workout" playlists ends up near other workout songs
  • Captures semantic relationships that pure audio can't (e.g., "party anthems" might span genres)

Discover Weekly: The flagship product

Discover Weekly (30 personalized songs every Monday) combines all signals:

  1. Find your taste clusters: Use CF to identify what genres/artists you like
  2. Explore similar users' discoveries: What did similar users recently discover?
  3. Audio similarity expansion: Find audio-similar tracks to your favorites
  4. Freshness filter: Exclude songs you've already heard
  5. Diversity injection: Ensure variety across the 30 tracks

The result: ~40 million users listen to Discover Weekly, with >5 billion track streams.

Bandit exploration for new music:

Spotify uses multi-armed bandits to balance exploitation (recommend what you'll like) vs exploration (surface new tracks to gather data). New tracks with uncertain quality get some exposure to "test" them, then exploitation takes over based on observed engagement.

Python
class SpotifyHybridRecSys:
    """
    Spotify's hybrid approach: CF + Audio Content.

    Key insight: New songs have no listening data,
    but we can analyze the audio itself.
    """

    def __init__(self):
        # Collaborative filtering component
        self.cf_model = MatrixFactorization(num_factors=100)

        # Audio analysis component (CNN on spectrograms)
        self.audio_model = AudioCNN()

        # Playlist-based NLP
        self.playlist_nlp = PlaylistWord2Vec()

    def get_track_embedding(self, track_id: str, track_metadata: dict) -> np.ndarray:
        """
        Hybrid embedding combining multiple signals.
        """
        embeddings = []

        # CF embedding (if track has enough listens)
        if self.has_cf_embedding(track_id):
            cf_emb = self.cf_model.get_item_embedding(track_id)
            embeddings.append(cf_emb)

        # Audio embedding (always available)
        audio_features = self.audio_model.extract_features(
            track_metadata['audio_file']
        )
        embeddings.append(audio_features)

        # Playlist co-occurrence embedding
        if self.has_playlist_data(track_id):
            playlist_emb = self.playlist_nlp.get_track_embedding(track_id)
            embeddings.append(playlist_emb)

        # Weighted combination (learned weights)
        return self.combine_embeddings(embeddings)

    def discover_weekly(self, user_id: str) -> list:
        """
        Generate Discover Weekly playlist.
        Mix of CF recommendations + audio-similar discoveries.
        """
        # CF: Songs similar users liked
        cf_candidates = self.cf_model.recommend(user_id, n=100)

        # Audio: Songs audio-similar to user's favorites
        user_favorites = self.get_user_top_tracks(user_id)
        audio_candidates = []
        for fav in user_favorites[:10]:
            similar = self.audio_model.find_similar(fav, n=20)
            audio_candidates.extend(similar)

        # Blend and diversify
        final_playlist = self.blend_and_diversify(
            cf_candidates,
            audio_candidates,
            target_length=30,
        )

        return final_playlist

Spotify lessons:

  • Audio features (tempo, energy, danceability) enable cold-start
  • Playlist co-occurrence = implicit collaborative signal
  • Discover Weekly balances exploitation (CF) + exploration (audio similarity)
  • Release Radar uses freshness + artist affinity

Pinterest: Visual Similarity at Scale

Pinterest is an image-first platform—users come to discover visual ideas for home decor, fashion, recipes, and more. This makes Pinterest unique: visual features aren't just "nice to have," they're the primary signal for recommendation quality.

Why visual features dominate on Pinterest:

On Pinterest, two pins might have:

  • Identical metadata: "living room decor"
  • Completely different visual appeal: modern minimalist vs. bohemian maximalist

A user saving minimalist pins wants more minimalist—not just "living room decor." Visual similarity captures this nuance that metadata cannot.

The scale challenge:

Pinterest has:

  • 300+ billion saved pins
  • 5+ billion boards
  • 500+ million monthly users
  • Billions of images to process

Computing visual similarity for billions of images is computationally massive.

Pinterest's multi-modal approach:

Pinterest combines several signals, with visual features as the foundation:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PINTEREST'S MULTI-MODAL ARCHITECTURE                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SIGNAL 1: VISUAL FEATURES (CNN-based)                                  │
│  ───────────────────────────────────────                                │
│  Method: Pre-trained ResNet/VGG → 4096-dim embedding                    │
│  Use case: Visual similarity, style matching                            │
│  Example: "Show more pins that LOOK like this"                          │
│                                                                          │
│  SIGNAL 2: GRAPH FEATURES (PinSage)                                     │
│  ─────────────────────────────────────                                  │
│  Method: Graph Neural Network on pin-board-user graph                   │
│  Use case: Contextual similarity, taste clustering                      │
│  Example: "Pins saved to similar boards"                                │
│                                                                          │
│  SIGNAL 3: TEXT FEATURES                                                │
│  ────────────────────────                                               │
│  Method: BERT embeddings on pin descriptions                            │
│  Use case: Semantic search, query understanding                         │
│  Example: "modern kitchen ideas"                                        │
│                                                                          │
│  SIGNAL 4: ENGAGEMENT FEATURES                                          │
│  ──────────────────────────────                                         │
│  Method: Historical CTR, save rate, closeup rate                        │
│  Use case: Quality scoring, ranking                                     │
│  Example: Filter low-engagement pins                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PinSage: Graph Neural Networks at Billion Scale

PinSage (SIGKDD 2018) was a breakthrough—the first GNN deployed at billion scale. Traditional GNNs aggregate information from all neighbors, which is O(n) per node. PinSage's innovations:

  1. Random walk sampling: Instead of all neighbors, sample important neighbors via random walks
  2. Importance pooling: Weight neighbors by visit frequency in random walks
  3. Producer-consumer architecture: Separate graph traversal from neural network computation
  4. Mini-batch training: Train on subgraphs, not the full graph

The result: GNN embeddings for 3 billion pins, computed in hours.

The graph structure:

Pinterest's graph is bipartite:

  • Pins (images)
  • Boards (collections of pins)

Two pins are "connected" if they appear on the same board. PinSage learns embeddings where pins saved to similar boards end up close in embedding space—even if they look visually different.

Visual search: "More like this"

Pinterest's visual search lets users:

  1. Tap on any object in an image
  2. See visually similar pins

This requires:

  • Object detection (identify what the user tapped)
  • Feature extraction (embed the detected object)
  • ANN search (find similar embeddings)

The system processes millions of visual searches per day.

Diversification: Avoiding the "same image" problem

Visual similarity can be too precise—showing 20 nearly identical images is useless. Pinterest uses Maximal Marginal Relevance (MMR) to balance relevance and diversity:

MMR=argmaxdRS[λsim(d,q)(1λ)maxsSsim(d,s)]\text{MMR} = \arg\max_{d \in R \setminus S} \left[ \lambda \cdot \text{sim}(d, q) - (1-\lambda) \cdot \max_{s \in S} \text{sim}(d, s) \right]

Where:

  • RR = candidate pins
  • SS = already selected pins
  • qq = query/seed pin
  • λ\lambda = relevance/diversity tradeoff
Python
class PinterestVisualRecSys:
    """
    Pinterest's visual recommendation system.

    Key insight: In an image platform, visual similarity
    is often more important than collaborative signals.
    """

    def __init__(self):
        # Visual feature extractor (pre-trained CNN)
        self.visual_encoder = torchvision.models.resnet50(pretrained=True)
        self.visual_encoder = nn.Sequential(*list(self.visual_encoder.children())[:-1])

        # Graph neural network for pin-board relationships
        self.pinsage = PinSage(
            embedding_dim=256,
            num_layers=3,
        )

    def get_pin_embedding(self, pin_id: str, pin_image: np.ndarray) -> np.ndarray:
        """
        Combine visual and graph embeddings.
        """
        # Visual features from image
        with torch.no_grad():
            image_tensor = self.preprocess(pin_image)
            visual_emb = self.visual_encoder(image_tensor).squeeze()

        # Graph features from pin-board-user graph
        graph_emb = self.pinsage.get_embedding(pin_id)

        # Concatenate
        return np.concatenate([visual_emb, graph_emb])

    def related_pins(self, pin_id: str, pin_image: np.ndarray, n: int = 20) -> list:
        """
        Find visually and contextually similar pins.
        """
        query_emb = self.get_pin_embedding(pin_id, pin_image)

        # ANN search
        similar_pins = self.ann_index.search(query_emb, k=n * 2)

        # Diversify (don't show 20 nearly identical images)
        diversified = self.maximal_marginal_relevance(similar_pins, n)

        return diversified

class PinSage(nn.Module):
    """
    PinSage: Graph Neural Network for Pinterest.
    Paper: https://arxiv.org/abs/1806.01973

    Key innovation: Random walk sampling for scalable GNN.
    """

    def __init__(self, embedding_dim: int, num_layers: int):
        super().__init__()

        self.layers = nn.ModuleList([
            PinSageConv(embedding_dim)
            for _ in range(num_layers)
        ])

    def forward(self, pin_features, neighbor_features, neighbor_weights):
        """
        Aggregate neighbor information.

        Unlike standard GNN: Uses importance-sampled neighbors
        from random walks, not all neighbors.
        """
        h = pin_features

        for layer in self.layers:
            # Aggregate from sampled neighbors
            neighbor_agg = self.aggregate(neighbor_features, neighbor_weights)

            # Combine with self
            h = layer(h, neighbor_agg)

        return h

Pinterest lessons:

  • Visual features from CNNs are first-class signals
  • PinSage enabled GNNs at billion scale via sampling
  • Diversification is crucial (avoid "more of the same")
  • Board context provides implicit interest clustering

ByteDance/TikTok: Real-Time Learning

TikTok's recommendation system is legendary—often cited as the most engaging recommendation algorithm ever built. A new user gets eerily personalized content within minutes. The secret? Real-time learning that updates models as users interact, rather than waiting for daily batch training.

Why real-time learning matters for TikTok:

TikTok's content cycle is extremely fast:

  • Videos go viral in hours, not days
  • Trends emerge and die within a single day
  • User preferences shift based on current mood
  • New creators need exposure immediately

A model trained yesterday is already stale. Traditional batch training (daily/weekly) can't keep up.

The speed advantage:

Code
Traditional Recommendation Pipeline:
───────────────────────────────────────────────────────────────────────────
Day 1: User likes cat videos
Day 2: Batch training incorporates yesterday's data
Day 3: Model deployed with updated weights
Day 4: User finally sees more cat videos

→ 3-day lag between signal and adaptation

TikTok's Real-Time Pipeline:
───────────────────────────────────────────────────────────────────────────
Second 0: User likes cat video
Second 1: Event streamed to Kafka
Second 2: Online model update
Second 60: Model synced to serving
Second 61: User sees more cat videos

→ ~1 minute lag

The Monolith architecture (2022 paper):

ByteDance published their system in "Monolith: Real Time Recommendation System With Collisionless Embedding Table". Key innovations:

  1. Collisionless embedding tables: Traditional embedding tables use hashing (item_id % table_size), which causes collisions. Monolith uses expressive IDs that grow dynamically—no collisions, better personalization.

  2. Online training from streaming data: Events (watches, likes, shares) stream to Kafka. Training workers consume events and update model parameters in real-time.

  3. Parameter server architecture: Model parameters live in a distributed parameter server. Training pushes gradients, serving pulls latest weights.

  4. Time-window features: Features like "user engagement in last 5 minutes" are computed on streaming data, not batch.

The For You Page (FYP) algorithm:

TikTok's FYP combines:

  1. Candidate generation: Multiple retrievers (embedding similarity, trending, following, etc.) generate ~1000 candidates

  2. Ranking model: A multi-task neural network predicts:

    • Watch time probability
    • Like probability
    • Share probability
    • Comment probability
    • Follow probability
  3. Exploration: New content gets "exploration traffic" to gather engagement signals. Low-engagement content gets filtered quickly.

  4. Business rules: Creator fairness, content diversity, safety filters

Why TikTok "gets you" so fast:

The real-time system enables extremely fast adaptation:

  • Swipe patterns: Which videos you watch 5+ seconds vs. skip immediately
  • Engagement signals: Likes, comments, shares, follows
  • Implicit signals: Rewatch, slow-scroll, tap to unmute

Even 5-10 interactions give the model strong signal about your current interests.

The addiction machine critique:

TikTok's effectiveness has drawn criticism—the algorithm is so good at engaging users that it can be addictive. The same real-time learning that enables personalization also enables behavioral manipulation. This is a real ethical consideration for anyone building recommendation systems.

Python
class ByteDanceRealTimeLearning:
    """
    ByteDance's approach: Update models in real-time
    as users interact.

    Traditional: Train daily/weekly → deploy
    ByteDance: Train continuously → sync every minute
    """

    def __init__(self):
        # Streaming data pipeline
        self.kafka_consumer = KafkaConsumer('user_actions')

        # Online training
        self.model = RealTimeModel()
        self.parameter_server = ParameterServer()

        # Feature store (low-latency)
        self.feature_store = RedisFeatureStore()

    def online_training_loop(self):
        """
        Continuous training from streaming events.
        """
        for event in self.kafka_consumer:
            # Event: {user_id, video_id, action, timestamp}

            # Create training example
            user_features = self.feature_store.get(event['user_id'])
            video_features = self.feature_store.get(event['video_id'])

            label = 1 if event['action'] == 'watch_complete' else 0

            # Online update
            loss = self.model.train_step(user_features, video_features, label)

            # Sync to parameter server (batched)
            if self.should_sync():
                self.parameter_server.push_gradients(self.model.get_gradients())

    def serve(self, user_id: str, candidate_videos: list) -> list:
        """
        Serve with latest model parameters.
        """
        # Pull latest parameters (sync every ~1 minute)
        self.model.load_from_parameter_server(self.parameter_server)

        # Score candidates
        scores = self.model.predict(user_id, candidate_videos)

        return sorted(zip(candidate_videos, scores), key=lambda x: x[1], reverse=True)

ByteDance/TikTok lessons:

  • Real-time learning captures trending content immediately
  • Collision-less embeddings handle billion-scale item catalogs
  • Watch time prediction > click prediction
  • Short video = fast feedback loop = faster model iteration

LinkedIn: Professional Graph Structure

LinkedIn's recommendation challenges are fundamentally different from entertainment platforms. On Netflix or TikTok, the goal is engagement. On LinkedIn, recommendations must be professionally relevant—a bad connection recommendation isn't just annoying, it's embarrassing. "Do I know this person? Should I know them? Will they think I'm spam?"

What makes professional recommendations different:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│        ENTERTAINMENT vs. PROFESSIONAL RECOMMENDATIONS                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ENTERTAINMENT (Netflix, Spotify, TikTok)                               │
│  ─────────────────────────────────────────                               │
│  Goal: Maximize engagement                                               │
│  Mistake cost: User skips, no big deal                                   │
│  Signal: Implicit (watch time, skips)                                    │
│  Graph: Weak (users don't know each other)                               │
│                                                                          │
│  PROFESSIONAL (LinkedIn)                                                │
│  ────────────────────────                                                │
│  Goal: Maximize professional value                                       │
│  Mistake cost: Embarrassment, spam perception                            │
│  Signal: Explicit (connections are deliberate)                           │
│  Graph: Strong (users KNOW each other in real life)                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The professional graph is gold:

LinkedIn's most valuable asset is the connection graph. It encodes real-world professional relationships:

  • Same company: Colleagues who work together
  • Same school: Alumni who share educational background
  • Same industry: Professionals in the same field
  • Career trajectory: Job transitions that predict future moves

This graph structure is far more predictive than any content-based signal.

People You May Know (PYMK):

PYMK is LinkedIn's most important recommendation surface—it drives network growth. The algorithm combines multiple signals:

  1. Mutual connections: The single most predictive feature. If you and X share 5+ mutual connections, you likely know each other.

  2. Shared affiliations:

    • Same current company (colleagues)
    • Same past company (former colleagues)
    • Same school (classmates, alumni)
  3. Profile similarity:

    • Same job title/function
    • Same industry
    • Same location
  4. Graph structure:

    • 2nd degree connections (friends of friends)
    • Clustering coefficient (are your friends friends with each other?)

The "triangle closing" pattern:

The strongest PYMK signal is closing triangles. If A→B and A→C, then B→C is likely:

Code
     A
    / \
   B   C
    \?/

A knows B (connection)
A knows C (connection)
B probably knows C (recommend!)

This simple heuristic drives a huge portion of PYMK recommendations.

GLMix: Generalized Linear Mixed Models

LinkedIn developed GLMix (2015) to personalize at scale. The insight: some effects are global (industry trends), some are per-user (individual preferences), some are per-item (company popularity).

GLMix combines:

  • Fixed effects: Global model parameters (same for all users)
  • Random effects: Per-user and per-item parameters

This allows personalization without overfitting on sparse data.

Job recommendations: A different problem

Job recommendations have unique constraints:

  • Jobs expire: Unlike songs, jobs disappear when filled
  • Location matters: Users won't relocate for every job
  • Seniority matching: Entry-level vs. senior roles
  • Career trajectory: Recommend realistic next steps

LinkedIn's job recommendation combines:

  • Skills matching (job requirements vs. user skills)
  • Career path modeling (what jobs do similar people take next?)
  • Company affinity (has user engaged with this company?)
  • Freshness (prefer recently posted jobs)
Python
class LinkedInPYMK:
    """
    LinkedIn's People You May Know (PYMK).

    Key insight: Professional connections follow patterns
    (same company, same school, same industry).
    """

    def __init__(self):
        self.graph = ProfessionalGraph()
        self.embedding_model = GraphEmbedding()

    def people_you_may_know(self, user_id: str, n: int = 20) -> list:
        """
        Recommend connections based on graph structure.
        """
        candidates = []

        # 2nd degree connections (friends of friends)
        friends = self.graph.get_connections(user_id)
        for friend in friends:
            friends_of_friend = self.graph.get_connections(friend)
            for fof in friends_of_friend:
                if fof != user_id and fof not in friends:
                    candidates.append(fof)

        # Score by multiple signals
        scored = []
        for candidate in set(candidates):
            score = self.compute_pymk_score(user_id, candidate)
            scored.append((candidate, score))

        scored.sort(key=lambda x: x[1], reverse=True)
        return [c for c, _ in scored[:n]]

    def compute_pymk_score(self, user_id: str, candidate_id: str) -> float:
        """
        Multi-factor scoring for PYMK.
        """
        score = 0

        # Mutual connections
        mutual = self.graph.mutual_connections(user_id, candidate_id)
        score += len(mutual) * 10

        # Same current company
        if self.same_company(user_id, candidate_id):
            score += 50

        # Same school
        if self.same_school(user_id, candidate_id):
            score += 30

        # Same industry
        if self.same_industry(user_id, candidate_id):
            score += 20

        # Profile similarity (skills, experience)
        similarity = self.profile_similarity(user_id, candidate_id)
        score += similarity * 10

        return score

LinkedIn lessons:

  • Graph structure is a powerful signal for professional networks
  • Multi-factor models (GLMix) handle heterogeneous features
  • Mutual connections are the strongest predictor
  • Different recommendation contexts need different models (jobs vs. people vs. content)

Part VIII: Evaluation Metrics

Understanding how to evaluate recommendation systems is as important as building them. Different metrics capture different aspects of quality, and the right choice depends on your use case.

Ranking Metrics

Most recommendation systems produce ranked lists. These metrics evaluate ranking quality:

Precision@K: What fraction of the top-K recommendations were relevant?

Precision@K=relevant items in top KK\text{Precision@K} = \frac{|\text{relevant items in top K}|}{K}

Recall@K: What fraction of all relevant items appeared in top K?

Recall@K=relevant items in top Kall relevant items\text{Recall@K} = \frac{|\text{relevant items in top K}|}{|\text{all relevant items}|}

Hit Rate@K (HR@K): Did at least one relevant item appear in top K? Binary version of Recall.

NDCG@K (Normalized Discounted Cumulative Gain): Measures ranking quality, giving higher scores when relevant items appear earlier.

DCG@K=i=1K2reli1log2(i+1)\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i + 1)}

NDCG@K=DCG@KIDCG@K\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}

Where IDCG is the DCG of the ideal (perfect) ranking.

MRR (Mean Reciprocal Rank): Average of reciprocal ranks of first relevant item.

MRR=1Uu=1U1ranku\text{MRR} = \frac{1}{|U|} \sum_{u=1}^{|U|} \frac{1}{\text{rank}_u}

MAP (Mean Average Precision): Average of precision at each relevant item's position.

Python
import numpy as np
from typing import List, Set

def precision_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
    """Precision@K: fraction of top-K that are relevant."""
    recommended_k = recommended[:k]
    hits = len(set(recommended_k) & relevant)
    return hits / k

def recall_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
    """Recall@K: fraction of relevant items in top K."""
    if len(relevant) == 0:
        return 0.0
    recommended_k = recommended[:k]
    hits = len(set(recommended_k) & relevant)
    return hits / len(relevant)

def ndcg_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
    """NDCG@K: normalized discounted cumulative gain."""
    recommended_k = recommended[:k]

    # DCG
    dcg = 0.0
    for i, item in enumerate(recommended_k):
        if item in relevant:
            dcg += 1.0 / np.log2(i + 2)  # +2 because position is 1-indexed

    # IDCG (ideal: all relevant items at top)
    ideal_hits = min(len(relevant), k)
    idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))

    return dcg / idcg if idcg > 0 else 0.0

def mrr(recommended_lists: List[List[int]], relevant_sets: List[Set[int]]) -> float:
    """Mean Reciprocal Rank across users."""
    reciprocal_ranks = []

    for recommended, relevant in zip(recommended_lists, relevant_sets):
        for rank, item in enumerate(recommended, 1):
            if item in relevant:
                reciprocal_ranks.append(1.0 / rank)
                break
        else:
            reciprocal_ranks.append(0.0)

    return np.mean(reciprocal_ranks)

def map_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
    """Average Precision@K."""
    recommended_k = recommended[:k]
    precisions = []

    hits = 0
    for i, item in enumerate(recommended_k):
        if item in relevant:
            hits += 1
            precisions.append(hits / (i + 1))

    return np.mean(precisions) if precisions else 0.0

# Example usage
recommended = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]  # Top 10 recommendations
relevant = {3, 7, 15, 22, 25}  # Ground truth relevant items

print(f"Precision@10: {precision_at_k(recommended, relevant, 10):.3f}")  # 0.300
print(f"Recall@10: {recall_at_k(recommended, relevant, 10):.3f}")        # 0.600
print(f"NDCG@10: {ndcg_at_k(recommended, relevant, 10):.3f}")            # ~0.5
print(f"MAP@10: {map_at_k(recommended, relevant, 10):.3f}")              # ~0.4

Beyond Accuracy: Diversity and Coverage

Accuracy metrics don't capture everything. A system that recommends only popular items might have high precision but provide poor user experience.

Coverage: What fraction of items ever get recommended?

Coverage=items recommended to at least one userall items\text{Coverage} = \frac{|\text{items recommended to at least one user}|}{|\text{all items}|}

Diversity (Intra-List): How different are items within a recommendation list?

Diversity=1L(L1)ij(1sim(i,j))\text{Diversity} = \frac{1}{|L|(|L|-1)} \sum_{i \neq j} (1 - \text{sim}(i, j))

Novelty: Are we recommending items users haven't seen before?

Serendipity: Are we recommending surprising but relevant items?

Online vs Offline Evaluation

Offline evaluation uses historical data to measure predictive accuracy. It's fast and cheap but has limitations:

  • Only measures prediction of past behavior
  • Can't capture exploration value
  • Misses business impact

Online evaluation (A/B testing) measures actual user behavior:

  • Click-through rate (CTR)
  • Conversion rate
  • Watch time / session duration
  • Return visits
  • Revenue per user

The offline-online gap: Models that win offline often don't win online. Selection bias in historical data, feedback loops, and missing features all contribute.

Best practice: Use offline metrics for rapid iteration during development, but always validate with A/B tests before production deployment.


Part IX: Practical Comparison

When to Use What

MethodBest ForProsCons
Item-based CFSmall catalogs, explainabilitySimple, interpretableDoesn't scale
Matrix FactorizationCold start items (with content)Efficient, provenLinear only
BPRImplicit feedback rankingOptimizes ranking directlyNeeds negative sampling
NCFWhen non-linearity helpsFlexibleNot always better than MF
Wide & DeepApps with rich featuresMemorization + generalizationNeeds feature engineering
DeepFMCTR predictionAutomatic feature crossesCompute intensive
Two-TowerLarge-scale retrievalFast serving (ANN)Decoupled = less expressive
GRU4RecSession-basedCaptures orderSlow, short-term focus
TransformersLong sequences, SOTABest accuracyExpensive

Production Recommendations

  1. Start simple: Matrix factorization or item-based CF is often 80% of the value
  2. Two-tower for scale: When you have millions of items, you need ANN retrieval
  3. Add complexity gradually: NCF → DeepFM → Transformers, with A/B tests at each step
  4. Consider latency: More complex ≠ better if you can't serve it fast enough



Conclusion

Recommendation systems have evolved dramatically over three decades:

  1. 1990s-2000s: Collaborative filtering and content-based methods
  2. 2006-2015: Matrix factorization and the Netflix Prize revolution
  3. 2016-2018: Deep learning enters with NCF, Wide & Deep, and sequential models
  4. 2018+: Transformers take over (see next post)

Each era built on the last. Matrix factorization introduced embeddings. Deep learning added non-linearity. Transformers enabled direct modeling of long-range dependencies.

Understanding this evolution isn't just historical—it's practical. Many production systems still use "older" techniques where they excel. The best recommendation engineers know when to use matrix factorization, when to add deep learning, and when to go all-in on transformers.

Sources:

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles