Recommendation Systems: From Collaborative Filtering to Deep Learning
In-depth journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.
Table of Contents
The Evolution of Recommendation Systems
Recommendation systems are among the most impactful applications of machine learning. They power what you watch on Netflix, what you buy on Amazon, what you listen to on Spotify, and what you scroll on TikTok. Understanding their evolution—from simple heuristics to trillion-parameter transformers—is essential for any ML engineer.
This post covers the foundational architectures that came before the transformer revolution. These methods remain crucial: many production systems still use them, and modern approaches build upon their insights.
┌─────────────────────────────────────────────────────────────────────────┐
│ RECOMMENDATION SYSTEMS TIMELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1990s CONTENT-BASED & COLLABORATIVE FILTERING │
│ ───── Nearest neighbors, TF-IDF, item-item similarity │
│ │
│ 2006-2009 MATRIX FACTORIZATION ERA (Netflix Prize) │
│ ───────── SVD, ALS, BPR - latent factors revolutionize CF │
│ │
│ 2016 DEEP LEARNING ENTERS RECSYS │
│ ───── GRU4Rec, Wide & Deep, Neural Collaborative Filtering │
│ │
│ 2017-2018 TWO-TOWER MODELS & FEATURE INTERACTIONS │
│ ───────── DSSM, YouTube DNN, DeepFM, DCN │
│ │
│ 2018+ ATTENTION & TRANSFORMERS │
│ ───── SASRec, BERT4Rec, HSTU (see next post) │
│ │
│ 2024+ GENERATIVE & LLM-POWERED │
│ ───── TIGER, OneRec, LLM augmentation (see GenAI post) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part I: Classical Recommendation Methods
The Three Paradigms
Before diving into algorithms, understand the three fundamental approaches:
┌─────────────────────────────────────────────────────────────────────────┐
│ THREE PARADIGMS OF RECOMMENDATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. CONTENT-BASED FILTERING │
│ ───────────────────────────── │
│ │
│ "Recommend items similar to what you've liked before" │
│ │
│ User liked: Action movies with Tom Cruise │
│ Recommend: More action movies, more Tom Cruise movies │
│ │
│ Pros: No cold-start for items, explainable │
│ Cons: Limited discovery, requires item features │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 2. COLLABORATIVE FILTERING │
│ ─────────────────────────────── │
│ │
│ "Users who liked X also liked Y" │
│ │
│ User A liked: [Inception, Interstellar, The Matrix] │
│ User B liked: [Inception, Interstellar, Arrival] │
│ → Recommend Arrival to User A, The Matrix to User B │
│ │
│ Pros: Discovers unexpected items, no item features needed │
│ Cons: Cold-start problem, popularity bias │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 3. HYBRID METHODS │
│ ────────────────── │
│ │
│ Combine content and collaborative signals │
│ │
│ Examples: │
│ • Use content for cold items, CF for warm items │
│ • Weighted ensemble of both approaches │
│ • Joint models (modern deep learning) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
User-Based Collaborative Filtering
User-based collaborative filtering was the original recommender algorithm, pioneered at MIT Media Lab and GroupLens in the mid-1990s. The core intuition is beautifully simple: if two users agreed in the past, they'll probably agree in the future.
How it works step-by-step:
-
Build the user-item matrix: Rows are users, columns are items, values are ratings (or 0 for no interaction). A user who rated 100 movies has 100 non-zero entries in their row.
-
Compute user-user similarity: For each pair of users, calculate how similar their rating patterns are. The most common metric is cosine similarity—it measures the angle between two rating vectors, ignoring magnitude (a user who rates everything 5 stars vs 3 stars can still be similar in preferences).
-
Find k-nearest neighbors: For a target user, find the k users with highest similarity scores. These are the user's "taste neighbors."
-
Predict ratings: To predict what the target user would rate an unseen item, take a weighted average of how the neighbors rated it. Users more similar to the target get higher weight.
The similarity formula:
Where is the set of items both users rated, and is user 's rating for item .
The prediction formula:
This is a weighted average: neighbors who are more similar contribute more to the prediction.
Why it often fails in practice:
- Sparsity: With millions of users and items, the user-item matrix is 99%+ zeros. Finding users who overlap on rated items is hard.
- Scalability: Computing all pairwise user similarities is O(n² × m) where n is users and m is items. This doesn't scale.
- Cold start: New users have no ratings, so they have no neighbors.
Despite these limitations, understanding user-based CF is essential—it introduces the fundamental concepts that all modern recommenders build upon.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class UserBasedCF:
"""
User-based collaborative filtering.
Find similar users → recommend their items.
"""
def __init__(self, k_neighbors: int = 20):
self.k = k_neighbors
self.user_item_matrix = None
self.user_similarity = None
def fit(self, ratings: np.ndarray):
"""
Args:
ratings: (num_users, num_items) matrix
0 = no interaction, otherwise = rating/score
"""
self.user_item_matrix = ratings
# Compute user-user similarity
# Cosine similarity on rating vectors
self.user_similarity = cosine_similarity(ratings)
# Zero out self-similarity
np.fill_diagonal(self.user_similarity, 0)
def predict(self, user_id: int, item_id: int) -> float:
"""Predict rating for user-item pair."""
# Find k most similar users who rated this item
item_ratings = self.user_item_matrix[:, item_id]
users_who_rated = np.where(item_ratings > 0)[0]
if len(users_who_rated) == 0:
return 0 # No data
# Get similarities to these users
similarities = self.user_similarity[user_id, users_who_rated]
# Take top-k
top_k_idx = np.argsort(similarities)[-self.k:]
top_k_users = users_who_rated[top_k_idx]
top_k_sims = similarities[top_k_idx]
# Weighted average of their ratings
if top_k_sims.sum() == 0:
return item_ratings[top_k_users].mean()
weighted_sum = (top_k_sims * item_ratings[top_k_users]).sum()
return weighted_sum / top_k_sims.sum()
def recommend(self, user_id: int, n: int = 10) -> list[int]:
"""Get top-n recommendations for user."""
# Items user hasn't interacted with
user_ratings = self.user_item_matrix[user_id]
unseen_items = np.where(user_ratings == 0)[0]
# Predict scores for unseen items
scores = [
(item_id, self.predict(user_id, item_id))
for item_id in unseen_items
]
# Sort by score
scores.sort(key=lambda x: x[1], reverse=True)
return [item_id for item_id, _ in scores[:n]]
Problems with User-Based CF:
- Scalability: Computing user-user similarity is O(n²) in users
- Sparsity: Most user pairs have no common items
- Volatility: User preferences change; similarity matrix becomes stale
Item-Based Collaborative Filtering
In 2003, Amazon published a paper that revolutionized e-commerce recommendations. Their key insight: instead of finding similar users, find similar items. This seemingly small change solves most of user-based CF's problems.
Why item-based is better than user-based:
-
Stability: Item relationships are more stable than user relationships. "The Matrix" and "Inception" will always be similar sci-fi movies. But a user's taste can change—they might suddenly start watching documentaries.
-
Scalability: You have fewer items than users (typically 10-100x fewer). Computing item-item similarity is O(m²) where m is items, vs O(n²) for users. More importantly, item similarities can be precomputed offline.
-
Sparsity is less severe: Any two popular items have many users in common (who rated both). But two random users might have rated zero items in common.
-
Explainability: "We recommend X because you bought Y, and customers who bought Y also bought X" is intuitive. User-based explanations ("we recommend X because user #847592 liked it") are creepy.
How it works:
-
Build item-item similarity matrix: For each pair of items, compute similarity based on which users rated them. Two items are similar if the same users tend to rate them similarly.
-
For a user's prediction: Look at items the user has already rated. For each candidate item, find its similarity to the user's rated items. The prediction is a weighted sum.
The item similarity formula:
Where is the set of users who rated both items.
The prediction formula:
For item , look at the k most similar items that the user has rated, and take a weighted average of those ratings.
Production considerations:
- Precompute offline: Item similarities rarely change. Compute once daily/weekly.
- Truncate to top-k: Don't store all m² similarities. Keep only top-100 neighbors per item.
- Incremental updates: When a new item gets enough ratings, compute its similarities without recomputing everything.
This is still Amazon's core algorithm today, enhanced with deep learning features.
class ItemBasedCF:
"""
Item-based collaborative filtering (Amazon's approach).
Find similar items → recommend based on user's history.
Key insight: Item similarities are more stable than user similarities.
"""
def __init__(self, k_neighbors: int = 20):
self.k = k_neighbors
self.user_item_matrix = None
self.item_similarity = None
def fit(self, ratings: np.ndarray):
"""
Args:
ratings: (num_users, num_items) matrix
"""
self.user_item_matrix = ratings
# Compute item-item similarity
# Transpose: each row is an item's rating profile across users
self.item_similarity = cosine_similarity(ratings.T)
# Zero out self-similarity
np.fill_diagonal(self.item_similarity, 0)
def predict(self, user_id: int, item_id: int) -> float:
"""Predict rating for user-item pair."""
# Find items user has rated
user_ratings = self.user_item_matrix[user_id]
rated_items = np.where(user_ratings > 0)[0]
if len(rated_items) == 0:
return 0
# Similarity of target item to user's rated items
similarities = self.item_similarity[item_id, rated_items]
# Take top-k similar items
top_k_idx = np.argsort(similarities)[-self.k:]
top_k_items = rated_items[top_k_idx]
top_k_sims = similarities[top_k_idx]
# Weighted average
if top_k_sims.sum() == 0:
return user_ratings[top_k_items].mean()
weighted_sum = (top_k_sims * user_ratings[top_k_items]).sum()
return weighted_sum / top_k_sims.sum()
def recommend(self, user_id: int, n: int = 10) -> list[int]:
"""Get top-n recommendations."""
user_ratings = self.user_item_matrix[user_id]
unseen_items = np.where(user_ratings == 0)[0]
scores = [
(item_id, self.predict(user_id, item_id))
for item_id in unseen_items
]
scores.sort(key=lambda x: x[1], reverse=True)
return [item_id for item_id, _ in scores[:n]]
Why Item-Based is Better:
- Stability: Items don't change; their similarity is stable
- Scalability: Pre-compute item similarities offline
- Explainability: "Because you liked X, and Y is similar to X"
Part II: Matrix Factorization (The Netflix Prize Era)
The $1 Million Challenge
In 2006, Netflix offered $1 million to anyone who could beat their recommendation algorithm by 10%. This sparked a revolution in collaborative filtering.
The key insight: Latent factors. Instead of computing explicit item-item or user-user similarity, learn low-dimensional representations that capture underlying preferences.
┌─────────────────────────────────────────────────────────────────────────┐
│ MATRIX FACTORIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: │
│ ───────── │
│ │
│ Rating Matrix R (num_users × num_items): │
│ │
│ Item1 Item2 Item3 Item4 ... │
│ User1 [ 5 ? 3 ? ...] │
│ User2 [ ? 4 ? 2 ...] │
│ User3 [ 4 ? 5 ? ...] │
│ ... │
│ │
│ Most entries are missing (? = unknown) │
│ Goal: Predict the missing entries │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTION: FACTORIZE R ≈ P × Q^T │
│ ───────────────────────────────── │
│ │
│ P: User matrix (num_users × k) │
│ Each row = user's latent preferences │
│ │
│ Q: Item matrix (num_items × k) │
│ Each row = item's latent attributes │
│ │
│ Prediction: r̂_ui = p_u · q_i (dot product) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ INTUITION: │
│ ─────────── │
│ │
│ k latent factors might represent: │
│ • Factor 1: How much the movie is action vs drama │
│ • Factor 2: How mainstream vs indie │
│ • Factor 3: How visually stunning vs dialogue-driven │
│ • ... │
│ │
│ User embedding captures: "I like action, mainstream, visual" │
│ Item embedding captures: "This movie is action, indie, visual" │
│ Dot product: How well do they match? │
│ │
└─────────────────────────────────────────────────────────────────────────┘
SVD and ALS
Matrix factorization is the technique that won the Netflix Prize. The idea is elegant: decompose the sparse user-item rating matrix into two smaller, dense matrices that capture latent factors.
Wait, what is SVD exactly?
First, let's clarify the terminology. Singular Value Decomposition (SVD) is a fundamental theorem from linear algebra:
Any matrix can be decomposed as:
Where:
- = orthonormal matrix (users × k) — left singular vectors
- = diagonal matrix (k × k) — singular values, in decreasing order
- = orthonormal matrix (k × items) — right singular vectors
Classical SVD gives the optimal low-rank approximation of any matrix (Eckart-Young theorem). If you keep only the top-k singular values, you get the closest rank-k matrix to the original.
The problem: Classical SVD doesn't work for recommendations
Netflix's rating matrix is 99.9% empty (users rate <1% of movies). Classical SVD:
- Requires a complete matrix (no missing values)
- Would treat missing ratings as zeros (wrong!)
- Is computationally expensive (O(min(m,n)²·max(m,n)) for m×n matrix)
Netflix Matrix Reality:
─────────────────────────────────────────────────────────────────────────
Users: 480,000
Movies: 18,000
Possible ratings: 8.6 billion
Actual ratings: 100 million (~1.2% filled)
If we treat missing = 0, SVD thinks users HATE 99% of movies.
That's obviously wrong—they just haven't seen them.
The recommendation "SVD" is actually something different
What the Netflix Prize community calls "SVD" is actually regularized matrix factorization optimized only on observed entries. It's inspired by SVD but doesn't compute the actual decomposition.
The key insight: Only minimize error on ratings that exist:
This is fundamentally different from classical SVD which considers all entries.
The core idea:
Instead of storing explicit similarities, learn a low-dimensional representation for each user and item. A user is represented by a vector of k numbers (their "taste profile"), and an item by another k-vector (its "attribute profile"). The predicted rating is just the dot product of these vectors.
Why latent factors work:
Think of the k dimensions as hidden attributes the algorithm discovers. For movies:
- Dimension 1 might capture "how action-packed vs dramatic"
- Dimension 2 might capture "how mainstream vs indie"
- Dimension 3 might capture "how serious vs comedic"
- etc.
A user who loves action but hates comedy would have [high, ?, low, ...] in their vector. An action comedy would have [high, ?, high, ...]. Their dot product captures partial compatibility.
The magic: we never define what the dimensions mean. The algorithm discovers useful latent factors automatically from patterns in the data. When you examine learned factors, they often align with interpretable concepts—but sometimes capture subtle patterns humans wouldn't notice.
The math:
Given rating matrix (users × items), find matrices (users × k) and (items × k) such that:
The prediction for user and item is:
Adding biases (crucial for accuracy):
Raw dot products miss important patterns: some users rate everything high (generous raters), some items are universally popular (hit movies). The improved model adds biases:
Where:
- = global average rating (e.g., 3.5 on Netflix)
- = user bias (how much user deviates from average)
- = item bias (how much item deviates from average)
- = user-item interaction (personalization beyond biases)
Example decomposition:
Predicting Alice's rating for "Inception":
─────────────────────────────────────────────────────────────────────────
Global average μ = 3.5
Alice's bias b_alice = +0.3 (she rates slightly above average)
Inception's bias b_inception = +0.8 (it's a popular, well-rated movie)
Alice's latent vector p_alice = [0.8, -0.2, 0.5] (likes action, dislikes romance, likes complex plots)
Inception's latent vector q_inception = [0.9, -0.1, 0.7] (action, minimal romance, complex plot)
Dot product: 0.8×0.9 + (-0.2)×(-0.1) + 0.5×0.7 = 0.72 + 0.02 + 0.35 = 1.09
Prediction: 3.5 + 0.3 + 0.8 + 1.09 = 5.69 → clipped to 5.0 (max rating)
Two training approaches:
1. SGD (Stochastic Gradient Descent):
Iterate over observed ratings one at a time, compute gradient, update parameters:
For each observed rating r_ui:
error = r_ui - (μ + b_u + b_i + p_u · q_i)
b_u ← b_u + η(error - λ·b_u)
b_i ← b_i + η(error - λ·b_i)
p_u ← p_u + η(error·q_i - λ·p_u)
q_i ← q_i + η(error·p_u - λ·q_i)
Pros: Simple, works well, low memory Cons: Sequential (hard to parallelize), sensitive to learning rate
2. ALS (Alternating Least Squares):
The key insight: If we fix , solving for is a least squares problem with a closed-form solution. Vice versa.
Repeat until convergence:
1. Fix Q, solve for each p_u: p_u = (Q^T Q + λI)^{-1} Q^T r_u
2. Fix P, solve for each q_i: q_i = (P^T P + λI)^{-1} P^T r_i
Pros: Embarrassingly parallel (each user/item is independent), guaranteed convergence, works great for implicit feedback Cons: More complex, requires matrix inversion per user/item
ALS became the standard for large-scale systems (Spark MLlib, etc.) because you can distribute user updates across thousands of machines.
The loss function:
The regularization term prevents overfitting to the sparse observed ratings—critical because we're learning millions of parameters from limited data.
How SVD++ Won the Netflix Prize
The Netflix Prize wasn't won by basic SVD—it required several crucial innovations developed primarily by Yehuda Koren and his collaborators at AT&T Labs. Understanding these reveals what matters in production recommendation systems.
The challenge:
On October 2, 2006, Netflix released 100,480,507 ratings from 480,189 users on 17,770 movies. The goal: beat their existing system (Cinematch, RMSE = 0.9525 on test set) by 10% to achieve RMSE ≤ 0.8572. The prize: $1 million. Within a week, a team had already beaten Cinematch.
Netflix Prize Timeline (verified from competition records):
─────────────────────────────────────────────────────────────────────────
Oct 2006 (start): Cinematch baseline = 0.9525 RMSE
Nov 2006 (1 month in): Best = ~0.9042 (~5% improvement)
Nov 2007 (Progress Prize #1): KorBell = 0.8712 (8.43% improvement)
→ $50,000 prize, 2000+ hours of work
2008 (Progress Prize #2): BellKor = 0.8616 (~9.5% improvement)
Jun 26, 2009: BellKor's Pragmatic Chaos = 0.8558 (10.05%)
Jul 25, 2009: "The Ensemble" matches at 0.8554 (10.09%)
Sep 21, 2009 (WINNER): BellKor's Pragmatic Chaos = 0.8567 (10.06%)
The final 1.5% improvement took 2+ years of work by thousands of teams!
The photo finish: On the final day, two teams crossed the 10% threshold within minutes of each other. "The Ensemble" actually achieved a slightly better RMSE (0.8554 vs 0.8558), but BellKor's Pragmatic Chaos had submitted 20 minutes earlier. Per competition rules, BellKor won.
Innovation 1: SVD++ (Implicit Feedback)
Published by Koren in "Factorization Meets the Neighborhood" (KDD 2008), SVD++ was the breakthrough that made BellKor a frontrunner.
Basic SVD only uses explicit ratings. But there's more signal: the mere fact that a user rated an item carries information, regardless of whether the rating was high or low. A user who rated 50 horror movies is signaling interest in horror.
The SVD++ prediction formula:
Breaking down each component:
| Symbol | Meaning | Intuition |
|---|---|---|
| Global mean rating | Baseline (e.g., 3.5 stars) | |
| User bias | "Alice rates 0.3 stars above average" | |
| Item bias | "Inception is rated 0.8 stars above average" | |
| User latent factors | Explicit preferences from ratings given | |
| Item latent factors | Item characteristics | |
| Implicit item factors | What rating item reveals about user taste | |
| Items user rated | Set of all items (regardless of rating value) | |
| $ | I_u | ^{-1/2}$ |
Why implicit feedback helps:
The term captures "what kind of user rates these specific items." Even without knowing the rating values, we learn something:
- A user who rated many Marvel movies → probably likes superhero films
- A user who rated many foreign films → probably enjoys non-Hollywood cinema
- A user who rated both → has diverse taste
The vectors learn these implicit signals automatically during training.
Performance: SVD++ alone achieved RMSE of 0.8911, while standard SVD achieved 0.8914. The combination with other techniques made the real difference.
Innovation 2: Temporal Dynamics (timeSVD++)
Published in "Collaborative Filtering with Temporal Dynamics" (KDD 2009), this paper showed that user preferences and item perceptions drift over time.
Koren discovered striking patterns in the Netflix data:
- Rating scale shift: In early 2004, mean ratings jumped from ~3.4 to ~3.6 stars (Netflix changed their UI)
- Movie aging effect: Older movies consistently receive higher ratings than newer ones
- User preference drift: Individual users' tastes evolve over months/years
The timeSVD++ model makes parameters time-dependent:
Time-dependent parameters:
─────────────────────────────────────────────────────────────────────────
User bias over time: b_u(t) = b_u + α_u · dev_u(t) + b_{u,Bin(t)}
b_u = static user bias (doesn't change)
α_u = user-specific slope (some users' ratings drift faster)
dev_u(t) = sign(t - t̄_u) · |t - t̄_u|^β (deviation from user's mean time)
b_{u,Bin(t)} = bias for time period (weekly/monthly buckets)
Item bias over time: b_i(t) = b_i + b_{i,Bin(t)}
Movies don't have inherent drift, but perception changes over time
User factors over time: p_u(t) = p_u + α_u · dev_u(t)
User preferences evolve (someone who liked comedies in 2005 might
prefer dramas in 2008)
Dramatic impact: A timeSVD++ model with only 10 latent factors outperformed standard SVD with 200 factors. With 200 factors, timeSVD++ achieved RMSE = 0.8762—a massive improvement.
Innovation 3: Restricted Boltzmann Machines (RBMs)
While matrix factorization dominated, Ruslan Salakhutdinov's RBM approach provided a fundamentally different model that captured complementary patterns.
RBMs are probabilistic neural networks that learn a joint distribution over user ratings. Unlike SVD which learns point estimates, RBMs model uncertainty in preferences.
Performance comparison:
- SVD alone: RMSE = 0.8914
- RBM alone: RMSE = 0.8990
- SVD + RBM blend: RMSE = 0.88
The key insight: models with different inductive biases make different errors. RBMs captured patterns that SVD missed (and vice versa), making their combination more powerful than either alone.
Innovation 4: Neighborhood Models
Despite deep learning hype, simple item-item similarity models captured patterns that latent factor models missed. The winning solutions combined both:
Where is the set of items most similar to that user has rated, and are learned item-item weights.
This hybrid captured:
- Global patterns (latent factors): "Users who like action movies tend to like this"
- Local patterns (neighborhood): "Users who liked Inception also liked Interstellar"
Innovation 5: Massive Ensembling
The final winning submission from BellKor's Pragmatic Chaos (a merger of BellKor, Pragmatic Theory, and BigChaos teams) combined hundreds of individual predictors:
BellKor's Pragmatic Chaos ensemble composition:
─────────────────────────────────────────────────────────────────────────
MATRIX FACTORIZATION VARIANTS:
- Standard SVD (multiple k values: 20, 50, 100, 200, 500, 2000)
- SVD++ with implicit feedback
- timeSVD++ with temporal dynamics
- Asymmetric factor models
RESTRICTED BOLTZMANN MACHINES:
- Standard RBM
- Conditional RBM (with temporal features)
- Factored RBM
NEIGHBORHOOD MODELS:
- Item-item collaborative filtering
- User-user collaborative filtering
- Pearson correlation variants
- Cosine similarity variants
k-NEAREST NEIGHBORS:
- Various k values and distance metrics
- With and without significance weighting
REGRESSION MODELS:
- Linear regression on movie/user features
- Gradient Boosted Decision Trees
BLENDING:
- Linear regression on predictor outputs
- Neural network stacking (single hidden layer)
- Gradient Boosted Trees for nonlinear combination
Total: 100+ individual models, yielding 800+ predictors after parameter variations
The bitter lesson:
The winning ensemble was never fully deployed by Netflix. Here's what actually happened:
-
2007 Progress Prize solution was deployed: The simpler SVD+RBM combination from the first Progress Prize was productionized by Netflix engineers, as it was already significantly better than Cinematch.
-
Grand Prize ensemble was too complex: Netflix evaluated the final ensemble offline but concluded "the additional accuracy gains did not seem to justify the engineering effort needed to bring them into a production environment."
-
Business context changed: By 2009, Netflix was shifting from DVD rentals to streaming. The rating prediction problem became less central than other recommendation challenges (what to show on the homepage, etc.).
The real value of the Prize:
- Academic papers: Koren's work on SVD++, timeSVD++, and matrix factorization became foundational—his "Matrix Factorization Techniques for Recommender Systems" (IEEE Computer, 2009) has 45,000+ citations
- Talent pipeline: Many participants joined Netflix and other tech companies
- Proof of concept: Demonstrated that recommendations could be dramatically improved with ML
- Open research: Spurred massive investment in recommendation system research
What practitioners should take from the Netflix Prize:
-
Biases explain most variance: The terms often matter more than the latent factors . Always include them.
-
Implicit feedback is gold: What users interact with (even without explicit ratings) is hugely informative. Modern systems rely almost entirely on implicit signals.
-
Temporal dynamics matter: User preferences drift. Systems that don't account for time become stale.
-
Ensemble diversity beats ensemble size: RBM + SVD worked because they made different errors. 100 SVD variants wouldn't help as much.
-
Production ≠ Competition: The winning 800-predictor ensemble scored 10.06%, but a simple 3-model blend might achieve 9.5% and actually be deployable.
-
The final 1% is exponentially harder: Going from 0% to 8% took one team 1 year. Going from 8% to 10% took the entire community 2 more years.
Sources:
- Netflix Prize Wikipedia
- The BellKor Solution to the Netflix Grand Prize
- Matrix Factorization Techniques for Recommender Systems (Koren et al., 2009)
- Collaborative Filtering with Temporal Dynamics (Koren, KDD 2009)
import torch
import torch.nn as nn
class MatrixFactorization(nn.Module):
"""
Basic matrix factorization for collaborative filtering.
R ≈ P × Q^T + biases
"""
def __init__(
self,
num_users: int,
num_items: int,
num_factors: int = 50,
):
super().__init__()
# User and item embeddings (latent factors)
self.user_factors = nn.Embedding(num_users, num_factors)
self.item_factors = nn.Embedding(num_items, num_factors)
# Bias terms (important for accuracy)
self.user_bias = nn.Embedding(num_users, 1)
self.item_bias = nn.Embedding(num_items, 1)
self.global_bias = nn.Parameter(torch.zeros(1))
# Initialize
nn.init.normal_(self.user_factors.weight, std=0.01)
nn.init.normal_(self.item_factors.weight, std=0.01)
nn.init.zeros_(self.user_bias.weight)
nn.init.zeros_(self.item_bias.weight)
def forward(self, user_ids: torch.Tensor, item_ids: torch.Tensor) -> torch.Tensor:
"""
Predict ratings for user-item pairs.
Args:
user_ids: (batch_size,)
item_ids: (batch_size,)
Returns:
predictions: (batch_size,)
"""
# Get embeddings
user_emb = self.user_factors(user_ids) # (B, k)
item_emb = self.item_factors(item_ids) # (B, k)
# Dot product
dot = (user_emb * item_emb).sum(dim=1) # (B,)
# Add biases
user_b = self.user_bias(user_ids).squeeze()
item_b = self.item_bias(item_ids).squeeze()
return self.global_bias + user_b + item_b + dot
def train_mf_sgd(model, train_data, epochs=20, lr=0.01, reg=0.02):
"""
Train matrix factorization with SGD.
Args:
train_data: List of (user_id, item_id, rating) tuples
"""
optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=reg)
for epoch in range(epochs):
total_loss = 0
for user_id, item_id, rating in train_data:
user_id = torch.tensor([user_id])
item_id = torch.tensor([item_id])
rating = torch.tensor([rating], dtype=torch.float)
# Forward
pred = model(user_id, item_id)
# MSE loss
loss = (pred - rating) ** 2
# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch}: Loss = {total_loss / len(train_data):.4f}")
# Alternating Least Squares (ALS) - Better for implicit feedback
class ALS:
"""
Alternating Least Squares for matrix factorization.
Better for large-scale implicit feedback (views, clicks).
Key idea: Fix users, solve for items (closed-form).
Then fix items, solve for users. Repeat.
"""
def __init__(self, num_factors: int = 50, reg: float = 0.01, iterations: int = 15):
self.k = num_factors
self.reg = reg
self.iterations = iterations
def fit(self, interactions: np.ndarray):
"""
Args:
interactions: (num_users, num_items) binary or count matrix
"""
num_users, num_items = interactions.shape
# Initialize factors randomly
self.user_factors = np.random.normal(0, 0.01, (num_users, self.k))
self.item_factors = np.random.normal(0, 0.01, (num_items, self.k))
# Confidence matrix (for implicit feedback)
# Higher counts = higher confidence
C = 1 + 40 * interactions # Confidence scaling
for iteration in range(self.iterations):
# Fix items, solve for users
self._solve_users(interactions, C)
# Fix users, solve for items
self._solve_items(interactions, C)
print(f"Iteration {iteration}: Loss = {self._compute_loss(interactions, C):.4f}")
def _solve_users(self, R, C):
"""Solve for user factors with items fixed."""
YtY = self.item_factors.T @ self.item_factors
reg_I = self.reg * np.eye(self.k)
for u in range(R.shape[0]):
# Weighted least squares
Cu = np.diag(C[u])
A = YtY + self.item_factors.T @ Cu @ self.item_factors + reg_I
b = self.item_factors.T @ Cu @ R[u]
self.user_factors[u] = np.linalg.solve(A, b)
def _solve_items(self, R, C):
"""Solve for item factors with users fixed."""
XtX = self.user_factors.T @ self.user_factors
reg_I = self.reg * np.eye(self.k)
for i in range(R.shape[1]):
Ci = np.diag(C[:, i])
A = XtX + self.user_factors.T @ Ci @ self.user_factors + reg_I
b = self.user_factors.T @ Ci @ R[:, i]
self.item_factors[i] = np.linalg.solve(A, b)
def _compute_loss(self, R, C):
"""Compute weighted squared error."""
pred = self.user_factors @ self.item_factors.T
diff = R - pred
return np.sum(C * diff ** 2)
SVD++: The Netflix Prize Winning Algorithm
SVD++ extends basic matrix factorization by incorporating implicit feedback—the fact that a user rated an item carries information beyond the rating value itself. This was one of the key innovations that helped win the Netflix Prize.
The core insight:
A user who rated 50 horror movies signals interest in horror, regardless of whether those ratings were 2 stars or 5 stars. SVD++ captures this by adding a term that sums up the "implicit factors" of all items a user has rated.
The prediction formula:
The implementation below follows Koren's original formulation from "Factorization Meets the Neighborhood" (KDD 2008):
import torch
import torch.nn as nn
import numpy as np
from collections import defaultdict
class SVDPlusPlus(nn.Module):
"""
SVD++ (Koren, KDD 2008) - The core algorithm behind the Netflix Prize win.
Key innovation: Combines explicit ratings with implicit feedback.
The mere fact that a user rated an item (regardless of rating value)
provides signal about user preferences.
Prediction formula:
r̂_ui = μ + b_u + b_i + q_i^T (p_u + |I_u|^(-1/2) Σ_{j∈I_u} y_j)
Where:
- μ: global mean rating
- b_u: user bias
- b_i: item bias
- p_u: user latent factors (from explicit ratings)
- q_i: item latent factors
- y_j: implicit item factors (what rating item j reveals about user taste)
- I_u: set of items user u has rated (implicit signal)
"""
def __init__(
self,
num_users: int,
num_items: int,
num_factors: int = 50,
global_mean: float = 0.0,
):
super().__init__()
self.num_users = num_users
self.num_items = num_items
self.num_factors = num_factors
self.global_mean = nn.Parameter(torch.tensor(global_mean), requires_grad=False)
# Bias terms (μ + b_u + b_i explains most variance!)
self.user_bias = nn.Embedding(num_users, 1)
self.item_bias = nn.Embedding(num_items, 1)
# Explicit factor matrices (p_u and q_i)
self.user_factors = nn.Embedding(num_users, num_factors) # p_u
self.item_factors = nn.Embedding(num_items, num_factors) # q_i
# IMPLICIT factor matrix (y_j) - THE KEY SVD++ INNOVATION
# Each item has an implicit factor that captures "what kind of user rates this"
self.implicit_item_factors = nn.Embedding(num_items, num_factors) # y_j
# Initialize with small random values (important for convergence)
self._init_weights()
def _init_weights(self):
nn.init.zeros_(self.user_bias.weight)
nn.init.zeros_(self.item_bias.weight)
nn.init.normal_(self.user_factors.weight, std=0.01)
nn.init.normal_(self.item_factors.weight, std=0.01)
nn.init.normal_(self.implicit_item_factors.weight, std=0.01)
def forward(
self,
user_ids: torch.Tensor,
item_ids: torch.Tensor,
user_item_lists: list[list[int]],
) -> torch.Tensor:
"""
Predict ratings for user-item pairs.
Args:
user_ids: (batch_size,) user indices
item_ids: (batch_size,) item indices
user_item_lists: For each user in batch, list of ALL items they've rated
This is the implicit feedback signal!
Returns:
predictions: (batch_size,) predicted ratings
"""
batch_size = user_ids.shape[0]
# Biases
b_u = self.user_bias(user_ids).squeeze(-1) # (B,)
b_i = self.item_bias(item_ids).squeeze(-1) # (B,)
# Explicit user/item factors
p_u = self.user_factors(user_ids) # (B, k)
q_i = self.item_factors(item_ids) # (B, k)
# IMPLICIT FEEDBACK TERM: Σ y_j for all items user has rated
# This is what makes SVD++ different from basic SVD!
implicit_sum = torch.zeros(batch_size, self.num_factors, device=user_ids.device)
for idx, items_rated in enumerate(user_item_lists):
if len(items_rated) > 0:
# Get implicit factors for all items this user rated
items_tensor = torch.tensor(items_rated, device=user_ids.device)
y_j = self.implicit_item_factors(items_tensor) # (num_rated, k)
# Sum and normalize by sqrt(|I_u|) to prevent bias toward active users
normalization = 1.0 / np.sqrt(len(items_rated))
implicit_sum[idx] = y_j.sum(dim=0) * normalization
# Combine explicit and implicit: p_u + |I_u|^(-1/2) Σ y_j
user_representation = p_u + implicit_sum # (B, k)
# Final prediction: μ + b_u + b_i + q_i^T (p_u + implicit)
interaction = (q_i * user_representation).sum(dim=-1) # (B,)
prediction = self.global_mean + b_u + b_i + interaction
return prediction
def train_svdpp(
model: SVDPlusPlus,
train_data: list[tuple], # List of (user_id, item_id, rating)
user_items: dict[int, list[int]], # user_id -> list of item_ids they rated
epochs: int = 20,
lr: float = 0.007,
reg: float = 0.02,
):
"""
Train SVD++ using SGD (following Koren's original approach).
Args:
model: SVD++ model
train_data: List of (user_id, item_id, rating) tuples
user_items: Dict mapping user_id to list of all items they've rated
epochs: Number of training epochs
lr: Learning rate (Koren used 0.007)
reg: Regularization strength (Koren used 0.02)
"""
optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=reg)
for epoch in range(epochs):
total_loss = 0.0
np.random.shuffle(train_data)
for user_id, item_id, rating in train_data:
# Get implicit feedback for this user
items_rated = user_items.get(user_id, [])
# Convert to tensors
user_tensor = torch.tensor([user_id])
item_tensor = torch.tensor([item_id])
rating_tensor = torch.tensor([rating], dtype=torch.float32)
# Forward pass
pred = model(user_tensor, item_tensor, [items_rated])
# MSE loss
loss = (pred - rating_tensor) ** 2
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
rmse = np.sqrt(total_loss / len(train_data))
print(f"Epoch {epoch + 1}/{epochs}: RMSE = {rmse:.4f}")
# Example usage:
"""
# Prepare data
train_data = [(0, 5, 4.0), (0, 10, 3.5), (1, 5, 5.0), ...] # (user, item, rating)
# Build user_items dict for implicit feedback
user_items = defaultdict(list)
for user_id, item_id, _ in train_data:
user_items[user_id].append(item_id)
# Compute global mean
global_mean = np.mean([r for _, _, r in train_data])
# Initialize and train
model = SVDPlusPlus(
num_users=10000,
num_items=5000,
num_factors=50,
global_mean=global_mean,
)
train_svdpp(model, train_data, user_items, epochs=20)
"""
SVD++ Training: The Exact Gradient Derivation
Understanding exactly how SVD++ learns is crucial. Here we derive the gradients from first principles—this is what Koren did in his original implementation.
The Loss Function (Regularized Squared Error):
For a single observed rating , the loss is:
Where:
- First term: Prediction error (what we want to minimize)
- Second term: Regularization on biases (prevents extreme user/item biases)
- Third term: Regularization on latent factors (prevents overfitting)
Define the prediction error:
Gradient Derivation for Each Parameter:
Taking partial derivatives of with respect to each parameter:
1. User bias gradient:
2. Item bias gradient:
3. User factor gradient:
4. Item factor gradient:
5. Implicit factor gradient (for each item that user rated):
The SGD Update Rules:
Using gradient descent with learning rate , we update parameters in the opposite direction of the gradient:
For each observed rating (u, i, r_ui):
# 1. Compute prediction
implicit_sum = (1/√|I_u|) * Σ_{j ∈ I_u} y_j
r̂_ui = μ + b_u + b_i + q_i · (p_u + implicit_sum)
# 2. Compute error
e_ui = r_ui - r̂_ui
# 3. Update biases (move toward reducing error)
b_u ← b_u + γ · (e_ui - λ₁ · b_u)
b_i ← b_i + γ · (e_ui - λ₁ · b_i)
# 4. Update user factors
p_u ← p_u + γ · (e_ui · q_i - λ₂ · p_u)
# 5. Update item factors (note: uses p_u + implicit, not just p_u)
q_i ← q_i + γ · (e_ui · (p_u + implicit_sum) - λ₂ · q_i)
# 6. Update ALL implicit factors for items this user rated
for j in I_u:
y_j ← y_j + γ · (e_ui · (1/√|I_u|) · q_i - λ₂ · y_j)
Why updating is expensive:
For each training example, we must update for every item the user has rated. If a user rated 100 movies, that's 100 updates per training example. This makes SVD++ significantly slower than basic SVD (about 3-5x slower in practice).
Koren's Exact Hyperparameters (from the Netflix Prize):
| Parameter | Symbol | Value | Notes |
|---|---|---|---|
| Learning rate | 0.007 | For factors , , | |
| Bias learning rate | 0.005 | For biases , | |
| Factor regularization | 0.015 | Prevents factor overfitting | |
| Bias regularization | 0.005 | Lower than factors | |
| Learning rate decay | - | 0.9× per epoch | Multiply by 0.9 after each epoch |
| Number of epochs | - | 20-30 | More epochs with decay |
| Number of factors | 50-200 | 200 factors for final submission |
The Learning Rate Decay Schedule:
Koren found that decaying the learning rate was crucial for convergence:
def train_svdpp_with_decay(
model,
train_data,
user_items,
epochs: int = 30,
initial_lr: float = 0.007,
lr_decay: float = 0.9,
reg_factors: float = 0.015,
reg_biases: float = 0.005,
):
"""
Train SVD++ with Koren's exact hyperparameters and learning rate decay.
"""
current_lr = initial_lr
for epoch in range(epochs):
total_se = 0.0 # Squared error sum
# Shuffle training data each epoch (important for SGD!)
np.random.shuffle(train_data)
for user_id, item_id, rating in train_data:
items_rated = user_items[user_id]
num_rated = len(items_rated)
if num_rated == 0:
continue
# Compute implicit sum
norm_factor = 1.0 / np.sqrt(num_rated)
implicit_sum = norm_factor * sum(model.y[j] for j in items_rated)
# Prediction
pred = (model.global_mean +
model.b_u[user_id] +
model.b_i[item_id] +
np.dot(model.q[item_id], model.p[user_id] + implicit_sum))
# Clamp prediction to valid range
pred = max(1.0, min(5.0, pred))
# Error
err = rating - pred
total_se += err ** 2
# Update biases
model.b_u[user_id] += current_lr * (err - reg_biases * model.b_u[user_id])
model.b_i[item_id] += current_lr * (err - reg_biases * model.b_i[item_id])
# Cache for efficiency
user_implicit = model.p[user_id] + implicit_sum
# Update user factors
model.p[user_id] += current_lr * (err * model.q[item_id] - reg_factors * model.p[user_id])
# Update item factors
model.q[item_id] += current_lr * (err * user_implicit - reg_factors * model.q[item_id])
# Update implicit factors (expensive!)
for j in items_rated:
model.y[j] += current_lr * (err * norm_factor * model.q[item_id] - reg_factors * model.y[j])
# Compute RMSE
rmse = np.sqrt(total_se / len(train_data))
print(f"Epoch {epoch+1}: RMSE = {rmse:.4f}, LR = {current_lr:.6f}")
# Decay learning rate
current_lr *= lr_decay
Convergence Behavior:
Typical training progression on Netflix data:
─────────────────────────────────────────────────────────────────────────
Epoch 1: RMSE = 0.9234 LR = 0.007000 (big initial improvement)
Epoch 2: RMSE = 0.9089 LR = 0.006300
Epoch 3: RMSE = 0.9012 LR = 0.005670
Epoch 5: RMSE = 0.8967 LR = 0.004592
Epoch 10: RMSE = 0.8923 LR = 0.002824 (slowing down)
Epoch 15: RMSE = 0.8911 LR = 0.001736
Epoch 20: RMSE = 0.8907 LR = 0.001067
Epoch 25: RMSE = 0.8905 LR = 0.000656 (nearly converged)
Epoch 30: RMSE = 0.8904 LR = 0.000403
Final RMSE ≈ 0.8904 (on training set)
Probe RMSE ≈ 0.8911 (on held-out validation)
Why the Gradient Updates Work:
Let's trace through one example intuitively:
Scenario: User Alice (u=42) rates movie "Inception" (i=1701) with 5 stars.
Alice has previously rated: [Matrix, Interstellar, Blade Runner] (all sci-fi)
Our model predicts: 4.2 stars
Error: e = 5.0 - 4.2 = +0.8 (we under-predicted!)
Updates (with γ=0.007):
1. b_u (Alice's bias) goes UP by 0.007 × 0.8 = +0.0056
→ "Alice tends to rate higher than we thought"
2. b_i (Inception's bias) goes UP by +0.0056
→ "Inception is better than we thought"
3. p_u (Alice's taste vector) moves TOWARD q_inception
→ "Alice likes what Inception represents"
4. q_i (Inception's attribute vector) moves TOWARD (p_alice + implicit)
→ "Inception matches users like Alice"
5. For each j in [Matrix, Interstellar, Blade Runner]:
y_j moves TOWARD q_inception (scaled by 1/√3)
→ "Users who rate Matrix/Interstellar/Blade Runner also like Inception"
The implicit feedback update (step 5) is the key insight: we're learning that the act of rating sci-fi movies predicts liking Inception, even before we look at the rating values.
timeSVD++: Adding Temporal Dynamics
timeSVD++ extends SVD++ by making parameters time-dependent. User preferences drift over time, and movie perceptions change as they age. This was described in Koren's "Collaborative Filtering with Temporal Dynamics" (KDD 2009).
Key discoveries from the Netflix data:
- Rating scale shift (early 2004): Mean ratings jumped from ~3.4 to ~3.6 stars when Netflix changed their UI
- Movie aging effect: Older movies consistently receive higher ratings
- User preference drift: Individual tastes evolve over months/years
The time-dependent prediction formula:
Where the time-dependent terms are:
class TimeSVDPlusPlus(nn.Module):
"""
timeSVD++ (Koren, KDD 2009) - The full Netflix Prize winning model.
Extends SVD++ with temporal dynamics:
- User biases drift over time
- Item biases change as movies age
- User preferences evolve
A timeSVD++ model with 10 factors outperforms standard SVD with 200 factors!
Time-dependent prediction:
r̂_ui(t) = μ + b_u(t) + b_i(t) + q_i^T (p_u(t) + |I_u|^(-1/2) Σ y_j)
Where:
b_u(t) = b_u + α_u · dev_u(t) + b_{u,Bin(t)}
b_i(t) = b_i + b_{i,Bin(t)}
p_u(t) = p_u + α_u · dev_u(t)
"""
def __init__(
self,
num_users: int,
num_items: int,
num_factors: int = 50,
num_time_bins: int = 30, # e.g., 30 bins for monthly granularity
global_mean: float = 0.0,
beta: float = 0.4, # Power for dev function (Koren used 0.4)
):
super().__init__()
self.num_factors = num_factors
self.num_time_bins = num_time_bins
self.beta = beta
self.global_mean = nn.Parameter(torch.tensor(global_mean), requires_grad=False)
# Static biases and factors (same as SVD++)
self.user_bias = nn.Embedding(num_users, 1)
self.item_bias = nn.Embedding(num_items, 1)
self.user_factors = nn.Embedding(num_users, num_factors)
self.item_factors = nn.Embedding(num_items, num_factors)
self.implicit_item_factors = nn.Embedding(num_items, num_factors)
# TEMPORAL COMPONENTS - THE timeSVD++ INNOVATION
# α_u: User-specific drift rate (some users' tastes change faster)
self.user_alpha = nn.Embedding(num_users, 1)
# b_{u,Bin(t)}: User bias per time bin
self.user_bias_time = nn.Embedding(num_users * num_time_bins, 1)
# b_{i,Bin(t)}: Item bias per time bin (movies perceived differently over time)
self.item_bias_time = nn.Embedding(num_items * num_time_bins, 1)
# User factor drift (p_u changes over time)
self.user_factors_alpha = nn.Embedding(num_users, num_factors)
# Store user mean times (computed from training data)
self.register_buffer('user_mean_time', torch.zeros(num_users))
self._init_weights()
def _init_weights(self):
nn.init.zeros_(self.user_bias.weight)
nn.init.zeros_(self.item_bias.weight)
nn.init.zeros_(self.user_alpha.weight)
nn.init.zeros_(self.user_bias_time.weight)
nn.init.zeros_(self.item_bias_time.weight)
nn.init.normal_(self.user_factors.weight, std=0.01)
nn.init.normal_(self.item_factors.weight, std=0.01)
nn.init.normal_(self.implicit_item_factors.weight, std=0.01)
nn.init.normal_(self.user_factors_alpha.weight, std=0.01)
def compute_dev(self, user_ids: torch.Tensor, times: torch.Tensor) -> torch.Tensor:
"""
Compute dev_u(t) = sign(t - t̄_u) · |t - t̄_u|^β
This captures how far a rating is from the user's "typical" time.
Users who rate early vs late in their Netflix membership differ.
"""
mean_times = self.user_mean_time[user_ids]
diff = times - mean_times
sign = torch.sign(diff)
dev = sign * (torch.abs(diff) ** self.beta)
return dev
def forward(
self,
user_ids: torch.Tensor,
item_ids: torch.Tensor,
times: torch.Tensor, # Normalized time values (0 to num_time_bins-1)
time_bins: torch.Tensor, # Discretized time bin indices
user_item_lists: list[list[int]],
) -> torch.Tensor:
"""
Predict ratings with temporal dynamics.
Args:
user_ids: (batch_size,) user indices
item_ids: (batch_size,) item indices
times: (batch_size,) continuous time values
time_bins: (batch_size,) discretized time bin indices
user_item_lists: For each user, list of items they've rated
Returns:
predictions: (batch_size,) predicted ratings
"""
batch_size = user_ids.shape[0]
# === STATIC COMPONENTS (same as SVD++) ===
b_u_static = self.user_bias(user_ids).squeeze(-1)
b_i_static = self.item_bias(item_ids).squeeze(-1)
p_u_static = self.user_factors(user_ids)
q_i = self.item_factors(item_ids)
# === TEMPORAL BIAS COMPONENTS ===
# dev_u(t) for this batch
dev = self.compute_dev(user_ids, times) # (B,)
# α_u · dev_u(t) - user-specific drift
alpha_u = self.user_alpha(user_ids).squeeze(-1) # (B,)
user_drift = alpha_u * dev # (B,)
# b_{u,Bin(t)} - user bias for this time bin
user_time_indices = user_ids * self.num_time_bins + time_bins
b_u_time = self.user_bias_time(user_time_indices).squeeze(-1) # (B,)
# b_{i,Bin(t)} - item bias for this time bin
item_time_indices = item_ids * self.num_time_bins + time_bins
b_i_time = self.item_bias_time(item_time_indices).squeeze(-1) # (B,)
# Full time-dependent biases:
# b_u(t) = b_u + α_u · dev_u(t) + b_{u,Bin(t)}
# b_i(t) = b_i + b_{i,Bin(t)}
b_u_t = b_u_static + user_drift + b_u_time
b_i_t = b_i_static + b_i_time
# === TEMPORAL USER FACTORS ===
# p_u(t) = p_u + α_u · dev_u(t)
p_u_alpha = self.user_factors_alpha(user_ids) # (B, k)
p_u_t = p_u_static + p_u_alpha * dev.unsqueeze(-1) # (B, k)
# === IMPLICIT FEEDBACK (same as SVD++) ===
implicit_sum = torch.zeros(batch_size, self.num_factors, device=user_ids.device)
for idx, items_rated in enumerate(user_item_lists):
if len(items_rated) > 0:
items_tensor = torch.tensor(items_rated, device=user_ids.device)
y_j = self.implicit_item_factors(items_tensor)
normalization = 1.0 / np.sqrt(len(items_rated))
implicit_sum[idx] = y_j.sum(dim=0) * normalization
# === FINAL PREDICTION ===
# r̂_ui(t) = μ + b_u(t) + b_i(t) + q_i^T (p_u(t) + implicit)
user_representation = p_u_t + implicit_sum
interaction = (q_i * user_representation).sum(dim=-1)
prediction = self.global_mean + b_u_t + b_i_t + interaction
return prediction
def precompute_user_mean_times(
train_data: list[tuple], # (user_id, item_id, rating, time)
num_users: int,
) -> torch.Tensor:
"""
Compute t̄_u (mean rating time) for each user.
Required for the dev_u(t) function in timeSVD++.
"""
user_times = defaultdict(list)
for user_id, _, _, time in train_data:
user_times[user_id].append(time)
mean_times = torch.zeros(num_users)
for user_id, times in user_times.items():
mean_times[user_id] = np.mean(times)
return mean_times
# Example usage:
"""
# Training data now includes timestamps
train_data = [
(user_id, item_id, rating, timestamp),
...
]
# Convert timestamps to time bins (e.g., weeks or months)
def timestamp_to_bin(ts, min_ts, max_ts, num_bins=30):
normalized = (ts - min_ts) / (max_ts - min_ts)
return int(normalized * (num_bins - 1))
# Precompute user mean times
user_mean_times = precompute_user_mean_times(train_data, num_users)
# Initialize model
model = TimeSVDPlusPlus(
num_users=480189, # Netflix had 480K users
num_items=17770, # Netflix had 17.7K movies
num_factors=50,
num_time_bins=30,
global_mean=3.6,
)
model.user_mean_time = user_mean_times
# Training loop similar to SVD++ but with time information
"""
Performance comparison on Netflix:
| Model | Factors | RMSE | Notes |
|---|---|---|---|
| SVD (baseline) | 200 | 0.8914 | No implicit, no time |
| SVD++ | 200 | 0.8911 | With implicit feedback |
| timeSVD++ | 10 | < 0.8914 | 10 factors beats 200! |
| timeSVD++ | 200 | 0.8762 | Best single model |
The key insight: temporal dynamics are more important than model capacity. A small model that understands time beats a large model that doesn't.
BPR: Bayesian Personalized Ranking
Most real-world recommendation data is implicit: clicks, views, purchases, time spent—not explicit 1-5 star ratings. This fundamentally changes the problem. With explicit ratings, you predict "user U would rate item I 4.2 stars." With implicit feedback, you only know what users did interact with, not what they didn't like.
The key insight of BPR:
Don't predict absolute scores. Instead, learn to rank items correctly. If user U clicked on item I but not item J, then I should rank higher than J for user U. This pairwise comparison is the core of BPR.
The training approach:
- Sample a user
- Sample a positive item (user interacted with it)
- Sample a negative item (user didn't interact with it)
- Train the model so that
The BPR loss function:
Where is the sigmoid function. This loss pushes the model to score positive items higher than negative items, with a margin determined by the sigmoid.
Why BPR works better for implicit feedback:
- Point-wise losses fail: MSE on 0/1 labels treats all non-interactions as "dislike." But a user might love an item they just haven't seen yet.
- Pairwise comparison is natural: We're confident the user prefers clicked items over random ones. We're not confident they hate random ones.
- Ranking is what matters: In production, we show the top-k items. Getting the ranking right matters more than exact scores.
Negative sampling strategies:
The choice of negative samples dramatically affects learning:
- Uniform random: Simple but wastes training on easy negatives
- Popularity-based: Sample popular items as negatives (harder negatives)
- In-batch negatives: Use other users' positives as negatives (efficient)
- Hard negative mining: Find items the model incorrectly ranks high
class BPR(nn.Module):
"""
Bayesian Personalized Ranking.
Optimizes: user prefers positive item over negative item.
Key insight: We don't need absolute scores, just relative ranking.
"""
def __init__(self, num_users: int, num_items: int, num_factors: int = 50):
super().__init__()
self.user_factors = nn.Embedding(num_users, num_factors)
self.item_factors = nn.Embedding(num_items, num_factors)
nn.init.normal_(self.user_factors.weight, std=0.01)
nn.init.normal_(self.item_factors.weight, std=0.01)
def forward(self, user_ids, pos_item_ids, neg_item_ids):
"""
Compute BPR loss for triplets.
Args:
user_ids: (batch_size,)
pos_item_ids: (batch_size,) - items user interacted with
neg_item_ids: (batch_size,) - sampled negative items
"""
user_emb = self.user_factors(user_ids)
pos_emb = self.item_factors(pos_item_ids)
neg_emb = self.item_factors(neg_item_ids)
# Scores
pos_score = (user_emb * pos_emb).sum(dim=1)
neg_score = (user_emb * neg_emb).sum(dim=1)
# BPR loss: -log(sigmoid(pos - neg))
loss = -torch.log(torch.sigmoid(pos_score - neg_score) + 1e-8)
return loss.mean()
def recommend(self, user_id: int, n: int = 10) -> list[int]:
"""Get top-n items for user."""
user_emb = self.user_factors.weight[user_id] # (k,)
all_scores = self.item_factors.weight @ user_emb # (num_items,)
top_items = torch.topk(all_scores, n).indices.tolist()
return top_items
def train_bpr(model, train_interactions, epochs=20, lr=0.01):
"""
Train BPR with negative sampling.
Args:
train_interactions: List of (user_id, item_id) positive pairs
"""
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
num_items = model.item_factors.num_embeddings
# Build user->items dict for negative sampling
user_items = {}
for user, item in train_interactions:
if user not in user_items:
user_items[user] = set()
user_items[user].add(item)
for epoch in range(epochs):
total_loss = 0
np.random.shuffle(train_interactions)
for user, pos_item in train_interactions:
# Sample negative item (not in user's history)
neg_item = np.random.randint(0, num_items)
while neg_item in user_items[user]:
neg_item = np.random.randint(0, num_items)
user_t = torch.tensor([user])
pos_t = torch.tensor([pos_item])
neg_t = torch.tensor([neg_item])
loss = model(user_t, pos_t, neg_t)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch}: Loss = {total_loss / len(train_interactions):.4f}")
Matrix Factorization Legacy:
- Netflix Prize winner used an ensemble of 107 models, many MF-based
- Fundamental insight: Embeddings capture latent preferences
- This idea persists in all modern deep learning RecSys
Part III: Deep Learning for Recommendations
The Limitations of Matrix Factorization
Matrix factorization is linear: prediction = dot product of embeddings. It cannot capture complex, non-linear interactions between user preferences and item attributes.
MF limitation:
If user likes [action, comedy] and item is [action, romance],
MF computes: action*action + comedy*romance = positive (maybe wrong)
We need non-linear combinations:
"User likes action AND comedy together" (not just sum of factors)
Neural Collaborative Filtering (NCF)
Matrix factorization uses a simple dot product to combine user and item embeddings. But a dot product is a linear operation—it can only model linear interactions between latent factors. What if the true user-item relationship is nonlinear?
NCF (WWW 2017) answers this by replacing the dot product with a neural network. The key insight: let the network learn an arbitrary interaction function between user and item representations.
The architecture:
- Embedding layer: Map user ID → user embedding, item ID → item embedding (same as MF)
- Interaction layers: Instead of dot product, concatenate embeddings and pass through MLPs
- Output layer: Predict probability of interaction
Two interaction approaches combined:
NCF actually combines two sub-networks:
-
GMF (Generalized Matrix Factorization): Element-wise product of embeddings (like MF, but learnable weights)
-
MLP: Concatenate embeddings and pass through hidden layers
The final prediction combines both:
Why this matters:
- Nonlinear interactions: A user might like action movies AND comedies, but hate action-comedies. Linear models can't capture this. Neural networks can.
- Automatic feature learning: MLPs learn what combinations of latent factors matter, instead of assuming simple dot product.
- Flexible capacity: Add more layers or neurons to increase model capacity.
The trade-off:
NCF is more expressive but also more prone to overfitting. Matrix factorization has an elegant closed-form interpretation; NCF is a black box. In practice, the gains from NCF over well-tuned MF are often modest (5-10%), and MF remains competitive for many applications.
class NCF(nn.Module):
"""
Neural Collaborative Filtering.
Replace dot product with learnable MLP.
Architecture:
- GMF branch: Generalized Matrix Factorization (element-wise product)
- MLP branch: Multi-layer perceptron for non-linear interactions
- Combine: Concatenate and predict
"""
def __init__(
self,
num_users: int,
num_items: int,
gmf_dim: int = 32,
mlp_dims: list[int] = [64, 32, 16],
):
super().__init__()
# GMF embeddings
self.gmf_user = nn.Embedding(num_users, gmf_dim)
self.gmf_item = nn.Embedding(num_items, gmf_dim)
# MLP embeddings (separate from GMF)
self.mlp_user = nn.Embedding(num_users, mlp_dims[0] // 2)
self.mlp_item = nn.Embedding(num_items, mlp_dims[0] // 2)
# MLP layers
layers = []
for i in range(len(mlp_dims) - 1):
layers.append(nn.Linear(mlp_dims[i], mlp_dims[i + 1]))
layers.append(nn.ReLU())
self.mlp = nn.Sequential(*layers)
# Final prediction layer
self.predict = nn.Linear(gmf_dim + mlp_dims[-1], 1)
self._init_weights()
def _init_weights(self):
for m in self.modules():
if isinstance(m, nn.Embedding):
nn.init.normal_(m.weight, std=0.01)
elif isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
nn.init.zeros_(m.bias)
def forward(self, user_ids: torch.Tensor, item_ids: torch.Tensor) -> torch.Tensor:
"""
Args:
user_ids: (batch_size,)
item_ids: (batch_size,)
Returns:
predictions: (batch_size,)
"""
# GMF branch: element-wise product
gmf_user_emb = self.gmf_user(user_ids)
gmf_item_emb = self.gmf_item(item_ids)
gmf_out = gmf_user_emb * gmf_item_emb # (B, gmf_dim)
# MLP branch: concatenate and transform
mlp_user_emb = self.mlp_user(user_ids)
mlp_item_emb = self.mlp_item(item_ids)
mlp_input = torch.cat([mlp_user_emb, mlp_item_emb], dim=1)
mlp_out = self.mlp(mlp_input) # (B, mlp_dims[-1])
# Combine GMF and MLP
combined = torch.cat([gmf_out, mlp_out], dim=1)
# Predict
pred = self.predict(combined).squeeze()
return torch.sigmoid(pred) # For binary (click/no-click)
NCF Controversy: A 2020 RecSys paper showed that with proper hyperparameters, simple dot products can match NCF. The lesson: Don't overcomplicate unless you have evidence non-linearity helps.
Wide & Deep Learning
Google's Wide & Deep (2016) paper introduced one of the most influential architectures in industrial recommender systems. It powers Google Play app recommendations and has been adopted across the industry.
The core insight: Memorization vs Generalization
Recommendation systems face a fundamental tension:
-
Memorization: Learn specific patterns from historical data. "Users who installed Netflix also install Hulu." This captures rare but valuable co-occurrences.
-
Generalization: Learn broad patterns that apply to new situations. "Users who like streaming apps might like any new streaming app." This handles new items and sparse data.
Linear models (like logistic regression) are great at memorization through feature crosses, but require manual feature engineering. Deep neural networks automatically learn features and generalize well, but can over-generalize and miss important specific patterns.
Wide & Deep combines both:
┌─────────────────────────────────────────┐
│ Wide & Deep │
│ │
│ Wide (Linear) Deep (DNN) │
│ ↓ ↓ │
│ Memorization Generalization │
│ ↓ ↓ │
│ └───────┬───────┘ │
│ ↓ │
│ Combined │
│ Prediction │
└─────────────────────────────────────────┘
The Wide component:
A linear model on cross-product features:
Where are cross-product transformations like:
AND(installed_app=netflix, impression_app=hulu)AND(user_country=US, app_language=english)
These must be manually engineered based on domain knowledge.
The Deep component:
A standard feedforward network on dense embeddings:
Categorical features are embedded; continuous features are normalized. The network automatically learns feature interactions.
The combined prediction:
Both components are trained jointly end-to-end.
Why this works in practice:
- Best of both worlds: Wide captures specific user-item affinities; deep generalizes to new combinations
- Handles cold start: New items with features but no history can get recommendations through the deep component
- Production-friendly: The wide component can use any feature engineering you already have
class WideAndDeep(nn.Module):
"""
Wide & Deep Learning for Recommendations.
Wide component: Linear model for memorization
- Captures specific patterns: "users who bought X also bought Y"
Deep component: DNN for generalization
- Captures general patterns: "users who like action movies"
"""
def __init__(
self,
num_wide_features: int,
num_deep_features: int,
deep_dims: list[int] = [256, 128, 64],
):
super().__init__()
# Wide: Linear model on cross-product features
self.wide = nn.Linear(num_wide_features, 1)
# Deep: MLP on dense features
layers = [nn.Linear(num_deep_features, deep_dims[0]), nn.ReLU()]
for i in range(len(deep_dims) - 1):
layers.append(nn.Linear(deep_dims[i], deep_dims[i + 1]))
layers.append(nn.ReLU())
layers.append(nn.Linear(deep_dims[-1], 1))
self.deep = nn.Sequential(*layers)
def forward(self, wide_features: torch.Tensor, deep_features: torch.Tensor):
"""
Args:
wide_features: (B, num_wide_features) - sparse cross-product features
deep_features: (B, num_deep_features) - dense embeddings
"""
wide_out = self.wide(wide_features)
deep_out = self.deep(deep_features)
return torch.sigmoid(wide_out + deep_out)
Wide & Deep at Google:
- Used for Google Play app recommendations
- Wide: Manually engineered cross-features (e.g., app × installed_apps)
- Deep: Embeddings for categorical features
Part IV: Feature Interaction Models
DeepFM: Automating Feature Crosses
Wide & Deep's limitation: you must manually engineer the cross features for the wide component. This requires domain expertise and experimentation. What if we could automatically learn all useful feature interactions?
DeepFM (IJCAI 2017) solves this by replacing Wide & Deep's manual crosses with Factorization Machines (FM), which automatically model all pairwise feature interactions.
Understanding Factorization Machines:
FM models all second-order feature interactions:
Where:
- are first-order weights (like linear regression)
- is the latent vector for feature
- is the interaction strength
The key insight: instead of learning interaction weights directly (intractable for sparse data), learn latent vectors and compute interactions via dot products. This makes sparse interactions learnable.
The DeepFM architecture:
┌────────────────────────────────────────┐
│ DeepFM │
│ │
│ ┌─────────┐ ┌─────────────┐ │
│ │ FM │ │ DNN │ │
│ │ (auto │ │ (high-order │ │
│ │ 2nd │ │ interactions│ │
│ │ order) │ │ ) │ │
│ └────┬────┘ └──────┬──────┘ │
│ │ │ │
│ └──────────┬──────────┘ │
│ ↓ │
│ Final Prediction │
└────────────────────────────────────────┘
FM component captures all 2nd-order interactions automatically:
- user_age × item_category
- user_country × item_language
- hour_of_day × item_type
- ... all pairs, without manual feature engineering
DNN component captures higher-order interactions through deep layers.
Why DeepFM beats Wide & Deep:
- No manual feature engineering: FM automatically learns which pairs interact
- Shared embeddings: FM and DNN share the same embedding layer, reducing parameters and improving learning
- End-to-end learning: All interactions are learned jointly
The math:
Where is the FM output and is the deep network output.
class DeepFM(nn.Module):
"""
DeepFM: FM + DNN for automatic feature interactions.
FM component: 2nd-order feature interactions (efficient)
DNN component: Higher-order feature interactions
"""
def __init__(
self,
field_dims: list[int], # Number of unique values per field
embedding_dim: int = 16,
mlp_dims: list[int] = [256, 128, 64],
):
super().__init__()
self.num_fields = len(field_dims)
self.embedding_dim = embedding_dim
# Embedding tables for each field
self.embeddings = nn.ModuleList([
nn.Embedding(dim, embedding_dim)
for dim in field_dims
])
# First-order weights
self.first_order = nn.ModuleList([
nn.Embedding(dim, 1)
for dim in field_dims
])
# Global bias
self.bias = nn.Parameter(torch.zeros(1))
# DNN
dnn_input_dim = self.num_fields * embedding_dim
layers = [nn.Linear(dnn_input_dim, mlp_dims[0]), nn.ReLU(), nn.Dropout(0.2)]
for i in range(len(mlp_dims) - 1):
layers.extend([
nn.Linear(mlp_dims[i], mlp_dims[i + 1]),
nn.ReLU(),
nn.Dropout(0.2)
])
layers.append(nn.Linear(mlp_dims[-1], 1))
self.dnn = nn.Sequential(*layers)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: (B, num_fields) - field indices
Returns:
predictions: (B,)
"""
# Get embeddings for each field
emb_list = [
self.embeddings[i](x[:, i])
for i in range(self.num_fields)
]
emb_stack = torch.stack(emb_list, dim=1) # (B, num_fields, emb_dim)
# First-order term
first_order = sum([
self.first_order[i](x[:, i]).squeeze()
for i in range(self.num_fields)
])
# FM second-order: (sum of embeddings)^2 - sum of (embeddings^2)
# This efficiently computes all pairwise interactions
sum_of_emb = emb_stack.sum(dim=1) # (B, emb_dim)
sum_of_square = (emb_stack ** 2).sum(dim=1) # (B, emb_dim)
square_of_sum = sum_of_emb ** 2 # (B, emb_dim)
fm_second_order = 0.5 * (square_of_sum - sum_of_square).sum(dim=1)
# DNN
dnn_input = emb_stack.view(x.size(0), -1) # (B, num_fields * emb_dim)
dnn_out = self.dnn(dnn_input).squeeze()
# Combine
output = self.bias + first_order + fm_second_order + dnn_out
return torch.sigmoid(output)
DCN: Deep & Cross Network
DeepFM automates 2nd-order feature interactions. But what about 3rd-order? 4th-order? Higher-order interactions can be crucial: "young male users in California who browse on mobile in the evening" is a 5-way interaction.
DCN (KDD 2017) introduces the Cross Network, which explicitly models feature crosses of bounded degree at each layer.
The key innovation: Cross layers
A cross layer computes:
Where:
- is the original input (always included!)
- is the output of layer
- is element-wise multiplication
- The is a residual connection
Why this captures high-order interactions:
- Layer 1 captures 2nd-order interactions: contains all pairs
- Layer 2 captures 3rd-order: where already has pairs
- Layer captures up to -order interactions
The beauty: each layer adds only parameters (where is input dimension), not . This is polynomial complexity in interaction order, not exponential.
DCN architecture:
┌────────────────────────────────────────┐
│ DCN │
│ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Cross │ │ Deep │ │
│ │ Network │ │ Network │ │
│ │ (explicit│ │ (implicit │ │
│ │ crosses) │ │ crosses) │ │
│ └────┬─────┘ └──────┬───────┘ │
│ │ │ │
│ └─────────┬─────────┘ │
│ ↓ │
│ Concatenate & MLP │
│ ↓ │
│ Prediction │
└────────────────────────────────────────┘
Cross Network vs Deep Network:
| Aspect | Cross Network | Deep Network |
|---|---|---|
| Interactions | Explicit, bounded-degree | Implicit, learned |
| Parameters | Linear in depth | Quadratic in width |
| Interpretability | Can analyze which crosses matter | Black box |
| Expressiveness | All crosses up to degree L | Arbitrary nonlinear |
DCN-v2 (2020 improvement):
The original DCN uses a vector for each layer. DCN-v2 uses a full matrix , making each cross layer a low-rank approximation of all pairwise interactions. This significantly improves performance.
class CrossNetwork(nn.Module):
"""
Cross Network: Explicit bounded-degree feature crossing.
Each layer computes:
x_{l+1} = x_0 * (w_l · x_l + b_l) + x_l
This captures l-th order feature interactions at layer l.
"""
def __init__(self, input_dim: int, num_layers: int = 3):
super().__init__()
self.num_layers = num_layers
self.weights = nn.ParameterList([
nn.Parameter(torch.randn(input_dim))
for _ in range(num_layers)
])
self.biases = nn.ParameterList([
nn.Parameter(torch.zeros(input_dim))
for _ in range(num_layers)
])
def forward(self, x0: torch.Tensor) -> torch.Tensor:
"""
Args:
x0: (B, input_dim) - initial features
Returns:
output: (B, input_dim) - crossed features
"""
x = x0
for i in range(self.num_layers):
# x_{l+1} = x_0 * (w_l · x_l) + b_l + x_l
xw = (x * self.weights[i]).sum(dim=1, keepdim=True)
x = x0 * xw + self.biases[i] + x
return x
class DCN(nn.Module):
"""Deep & Cross Network."""
def __init__(
self,
input_dim: int,
cross_layers: int = 3,
deep_dims: list[int] = [256, 128, 64],
):
super().__init__()
self.cross_net = CrossNetwork(input_dim, cross_layers)
# Deep network
layers = [nn.Linear(input_dim, deep_dims[0]), nn.ReLU()]
for i in range(len(deep_dims) - 1):
layers.extend([nn.Linear(deep_dims[i], deep_dims[i + 1]), nn.ReLU()])
self.deep_net = nn.Sequential(*layers)
# Combine cross and deep
self.output = nn.Linear(input_dim + deep_dims[-1], 1)
def forward(self, x: torch.Tensor) -> torch.Tensor:
cross_out = self.cross_net(x)
deep_out = self.deep_net(x)
combined = torch.cat([cross_out, deep_out], dim=1)
return torch.sigmoid(self.output(combined))
Part V: Two-Tower Models for Retrieval
The Retrieval Problem
At scale (millions of items), you can't score every item for every user. The solution: Two-tower models for efficient retrieval.
┌─────────────────────────────────────────────────────────────────────────┐
│ TWO-TOWER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CHALLENGE: │
│ ─────────── │
│ Score all 10M items for each user? │
│ At 100M users × 10M items = 10^15 computations = IMPOSSIBLE │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTION: DECOUPLE USER AND ITEM ENCODING │
│ ─────────────────────────────────────────── │
│ │
│ USER TOWER ITEM TOWER │
│ ─────────── ─────────── │
│ User features → Item features → │
│ Neural network → Neural network → │
│ User embedding → Item embedding → │
│ ↓ ↓ │
│ └───────── DOT PRODUCT ─┘ │
│ ↓ │
│ Score │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ KEY INSIGHT: │
│ ───────────── │
│ Item embeddings are INDEPENDENT of users │
│ → Pre-compute ALL item embeddings offline │
│ → Store in vector index (FAISS, ScaNN, HNSW) │
│ → At serving: Compute user embedding, find nearest items │
│ → Complexity: O(log N) instead of O(N) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
YouTube DNN
YouTube's Deep Neural Network (RecSys 2016) was foundational—perhaps the most influential industrial recommendation paper. It revealed how YouTube serves billions of recommendations daily with sub-100ms latency.
The core problem YouTube faced:
YouTube has over 800 million videos and 2 billion logged-in users monthly. Scoring every video for every user is computationally impossible (10^15+ operations). The solution: a two-stage funnel architecture.
Stage 1: CANDIDATE GENERATION (what we call "retrieval")
─────────────────────────────────────────────────────────
800M videos → Neural network → ~1000 candidates
Goal: High recall (don't miss relevant videos)
Latency budget: ~10ms
Stage 2: RANKING (what we call "scoring")
─────────────────────────────────────────────────────────
1000 candidates → Complex model → ~50 ranked videos
Goal: High precision (order correctly)
Latency budget: ~50ms
The candidate generation model (what this code implements):
The key insight is treating recommendation as an extreme multiclass classification problem. Given the user's context (watch history, search history, demographics), predict which video they'll watch next—out of millions of classes.
Input features for the user tower:
-
Watch history: The user's last N watched videos. Instead of using a complex sequence model, YouTube simply averages the embeddings of watched videos. This works surprisingly well and is fast.
-
Search history: Similarly averaged embeddings of search queries (tokenized and embedded).
-
Demographics: Age bucket, gender, geographic region.
-
Context features: Time since last watch (crucial for freshness), device type.
The "example age" trick:
YouTube discovered that neural networks learn to predict popular videos because they appear more often in training data. To counteract this freshness bias, they include the video's age (time since upload) as a feature during training. At serving time, they set this to zero—telling the model "assume all videos were just uploaded"—which dramatically improved recommendation freshness.
Training with negative sampling:
With millions of classes, computing the full softmax is impossible. YouTube uses sampled softmax: for each positive example, sample a few thousand negative classes from the video corpus (weighted by popularity). The learned embedding space places similar users and items close together.
Why average pooling works:
Averaging embeddings loses order information but captures "interest topics." A user who watched [sports, sports, cooking] ends up in a similar embedding region to another sports+cooking fan. For coarse-grained retrieval, this is sufficient—ranking handles fine-grained preferences.
class YouTubeDNN(nn.Module):
"""
YouTube's Deep Neural Network for Recommendations.
User tower:
- Watch history (average of video embeddings)
- Search history
- Demographics
- Context (time, device)
Item tower:
- Video embeddings
"""
def __init__(
self,
num_videos: int,
embedding_dim: int = 256,
hidden_dims: list[int] = [1024, 512, 256],
):
super().__init__()
# Video embeddings (shared between history and candidates)
self.video_embedding = nn.Embedding(num_videos, embedding_dim)
# User tower
user_input_dim = embedding_dim * 2 + 32 # History + search + demographics
layers = [nn.Linear(user_input_dim, hidden_dims[0]), nn.ReLU()]
for i in range(len(hidden_dims) - 1):
layers.extend([nn.Linear(hidden_dims[i], hidden_dims[i + 1]), nn.ReLU()])
self.user_tower = nn.Sequential(*layers)
# Demographics embedding
self.age_embedding = nn.Embedding(10, 16) # Age buckets
self.gender_embedding = nn.Embedding(3, 16)
def encode_user(
self,
watch_history: torch.Tensor, # (B, max_history)
watch_mask: torch.Tensor, # (B, max_history) - 1 for valid, 0 for padding
search_history: torch.Tensor, # (B, max_search)
search_mask: torch.Tensor,
age: torch.Tensor, # (B,)
gender: torch.Tensor, # (B,)
) -> torch.Tensor:
"""Encode user into embedding."""
# Average watch history embeddings
watch_emb = self.video_embedding(watch_history) # (B, H, D)
watch_emb = (watch_emb * watch_mask.unsqueeze(-1)).sum(dim=1)
watch_emb = watch_emb / watch_mask.sum(dim=1, keepdim=True).clamp(min=1)
# Average search history (simplified - could use separate embeddings)
search_emb = self.video_embedding(search_history)
search_emb = (search_emb * search_mask.unsqueeze(-1)).sum(dim=1)
search_emb = search_emb / search_mask.sum(dim=1, keepdim=True).clamp(min=1)
# Demographics
demo_emb = torch.cat([
self.age_embedding(age),
self.gender_embedding(gender)
], dim=1)
# Combine and transform
user_features = torch.cat([watch_emb, search_emb, demo_emb], dim=1)
user_embedding = self.user_tower(user_features)
return user_embedding
def encode_items(self, item_ids: torch.Tensor) -> torch.Tensor:
"""Encode items into embeddings."""
return self.video_embedding(item_ids)
def forward(self, user_embedding: torch.Tensor, item_ids: torch.Tensor) -> torch.Tensor:
"""Score user-item pairs."""
item_emb = self.encode_items(item_ids)
scores = (user_embedding.unsqueeze(1) * item_emb).sum(dim=-1)
return scores
DSSM: Deep Structured Semantic Models
DSSM (Microsoft, 2013) pioneered the two-tower approach for search and remains influential today. Understanding DSSM is essential because modern retrieval systems (including embedding-based search and RAG) are direct descendants of this architecture.
The problem DSSM solves:
Traditional search systems rely on lexical matching: the query "NY hotels" finds documents containing those exact words. But what about "New York accommodations" or "Manhattan lodging"? Users and documents can express the same concept with completely different words—the vocabulary mismatch problem.
DSSM's solution: Semantic matching in embedding space
Instead of matching words, learn to map queries and documents into a shared semantic space where similar meanings have similar vectors, regardless of exact wording:
Query: "NY hotels" Document: "Manhattan lodging options"
↓ ↓
Query Tower Doc Tower
↓ ↓
[0.8, 0.3, ...] ←cosine→ [0.75, 0.35, ...]
↓
High similarity!
The architecture in detail:
-
Word hashing (character trigrams): Raw text is first converted to a bag of character trigrams. "hotel" → {"#ho", "hot", "ote", "tel", "el#"}. This reduces vocabulary from millions of words to ~30K trigrams, handles misspellings, and generalizes to unseen words.
-
Query tower: Trigram features → Dense layer (30K→300) → Tanh → Dense (300→300) → Tanh → Dense (300→128). The final 128-dim vector is the query embedding.
-
Document tower: Same architecture as query tower (but separate parameters). Documents are also converted to 128-dim embeddings.
-
Similarity scoring: Cosine similarity between normalized query and document embeddings.
Why Tanh activation?
DSSM uses Tanh (not ReLU) throughout. The bounded output [-1, 1] helps with cosine similarity computation and provides more stable gradients. Later work showed ReLU can work too, but Tanh remains common in embedding models.
Training with contrastive loss:
For each query , we have one positive document (clicked) and several negative documents (not clicked). The loss maximizes:
Where is a temperature parameter (typically ~10). This softmax-style loss pushes the query embedding closer to positive documents and farther from negative documents.
Negative sampling strategies:
- Random negatives: Sample from the document corpus. Easy but may include false negatives.
- In-batch negatives: Use other queries' positive documents as negatives. Efficient and provides harder negatives.
- Hard negatives: Mine documents that are similar but not relevant (e.g., from the same topic but not answering the query).
Why DSSM is still relevant:
Modern sentence transformers, dense retrievers (DPR, Contriever), and embedding APIs (OpenAI, Cohere) all use the same fundamental architecture: encode query and document with separate towers, score via similarity, train with contrastive loss.
import torch
import torch.nn as nn
import torch.nn.functional as F
class DSSM(nn.Module):
"""
Deep Structured Semantic Model.
Originally for query-document matching, adapted for recommendations.
"""
def __init__(
self,
query_input_dim: int,
doc_input_dim: int,
hidden_dims: list[int] = [300, 300, 128],
):
super().__init__()
# Query tower
q_layers = [nn.Linear(query_input_dim, hidden_dims[0]), nn.Tanh()]
for i in range(len(hidden_dims) - 1):
q_layers.extend([nn.Linear(hidden_dims[i], hidden_dims[i + 1]), nn.Tanh()])
self.query_tower = nn.Sequential(*q_layers)
# Document tower
d_layers = [nn.Linear(doc_input_dim, hidden_dims[0]), nn.Tanh()]
for i in range(len(hidden_dims) - 1):
d_layers.extend([nn.Linear(hidden_dims[i], hidden_dims[i + 1]), nn.Tanh()])
self.doc_tower = nn.Sequential(*d_layers)
def forward(self, query_features: torch.Tensor, doc_features: torch.Tensor) -> torch.Tensor:
"""
Args:
query_features: (B, query_dim)
doc_features: (B, num_docs, doc_dim)
Returns:
scores: (B, num_docs) - cosine similarity scores
"""
query_emb = self.query_tower(query_features) # (B, hidden)
query_emb = F.normalize(query_emb, dim=-1)
doc_emb = self.doc_tower(doc_features) # (B, num_docs, hidden)
doc_emb = F.normalize(doc_emb, dim=-1)
# Cosine similarity
scores = (query_emb.unsqueeze(1) * doc_emb).sum(dim=-1)
return scores
Serving Two-Tower Models
Training a two-tower model is only half the battle. The real challenge is serving it at scale—delivering personalized recommendations to millions of users in milliseconds. This section covers the production infrastructure that makes it possible.
The serving challenge:
At serving time, given a user, you need to find the most similar items from potentially billions of candidates. Even with decoupled towers, a naive linear scan is too slow:
10M items × 256 dimensions × 1 float op = 2.56 billion operations per user
At 10K QPS = 25.6 trillion operations per second = IMPOSSIBLE
The solution: Approximate Nearest Neighbor (ANN) search
Instead of exact nearest neighbor search (O(N)), we use approximate methods (O(log N) or O(1)) that find "close enough" neighbors with high probability. The key insight: we can trade a small amount of accuracy for massive speedup.
Popular ANN indexes:
-
FAISS (Facebook AI Similarity Search):
- Most popular library for dense vector search
- Multiple index types: Flat (exact), IVF (inverted file), HNSW (graph-based), PQ (product quantization)
- GPU support for batch processing
- Used by: Meta, Spotify, Pinterest
-
ScaNN (Google):
- Anisotropic vector quantization—learns quantization optimized for MIPS (maximum inner product search)
- 2x faster than HNSW at same recall
- Used internally at Google/YouTube
-
HNSW (Hierarchical Navigable Small World):
- Graph-based index with hierarchical structure
- Excellent recall-latency tradeoff
- Available in: Faiss, Milvus, Qdrant, Weaviate
Index building workflow:
OFFLINE (daily/weekly):
───────────────────────────────────────────────────────────────
1. Export all item IDs from catalog
2. Run item tower inference on all items → item embeddings
3. Normalize embeddings (for cosine similarity)
4. Build ANN index from embeddings
5. Deploy index to serving infrastructure
ONLINE (per request):
───────────────────────────────────────────────────────────────
1. Receive user request
2. Fetch user features (watch history, demographics, etc.)
3. Run user tower inference → user embedding (typically ~5-10ms)
4. Query ANN index for top-K items → candidate set (typically ~1-5ms)
5. Pass candidates to ranking model
Embedding normalization for cosine similarity:
FAISS's most efficient indexes use inner product (dot product) similarity. To get cosine similarity, we normalize all vectors to unit length:
After normalization, inner product equals cosine similarity.
Production considerations:
-
Index sharding: For billions of items, shard the index across multiple machines. Route queries to all shards, merge results.
-
Index updates: Item embeddings change as you retrain. Options:
- Full rebuild (safest, but expensive)
- Incremental updates (add new items, but index quality degrades over time)
- Hybrid: Periodic full rebuilds + incremental updates
-
Filtering: Often you need to filter candidates (e.g., exclude already-watched items, region restrictions). Options:
- Pre-filter: Build separate indexes per filter dimension (expensive)
- Post-filter: Retrieve more candidates, filter afterward (wasteful)
- In-query filter: Some indexes (Milvus, Qdrant) support filtered search
-
Caching: User embeddings can be cached for repeat visitors (typically with short TTL as preferences change).
import faiss
import numpy as np
class TwoTowerServing:
"""
Serving infrastructure for two-tower models.
"""
def __init__(self, model: YouTubeDNN, num_items: int, embedding_dim: int):
self.model = model
self.embedding_dim = embedding_dim
# Pre-compute all item embeddings
print("Computing item embeddings...")
item_ids = torch.arange(num_items)
with torch.no_grad():
self.item_embeddings = model.encode_items(item_ids).numpy()
# Build FAISS index for fast retrieval
print("Building FAISS index...")
self.index = faiss.IndexFlatIP(embedding_dim) # Inner product
faiss.normalize_L2(self.item_embeddings) # For cosine similarity
self.index.add(self.item_embeddings)
def retrieve(self, user_embedding: np.ndarray, k: int = 100) -> tuple[np.ndarray, np.ndarray]:
"""
Retrieve top-k items for user.
Args:
user_embedding: (embedding_dim,)
k: Number of items to retrieve
Returns:
item_ids: (k,)
scores: (k,)
"""
# Normalize user embedding
user_embedding = user_embedding / np.linalg.norm(user_embedding)
user_embedding = user_embedding.reshape(1, -1)
# Search
scores, item_ids = self.index.search(user_embedding, k)
return item_ids[0], scores[0]
def recommend(self, user_features: dict, k: int = 100) -> list[int]:
"""Full recommendation pipeline."""
# Encode user
with torch.no_grad():
user_emb = self.model.encode_user(**user_features).numpy()[0]
# Retrieve candidates
item_ids, scores = self.retrieve(user_emb, k)
return item_ids.tolist()
Part VI: Sequential Models
The models we've seen so far treat user-item interactions as a bag—they don't consider the order. But order matters! A user who watched [Inception, Interstellar, The Martian] likely wants more sci-fi, while [The Martian, Cast Away, Into the Wild] suggests survival films. Sequential models capture these temporal dynamics.
GRU4Rec: RNNs for Session-Based Recommendation
GRU4Rec (ICLR 2016 Workshop) was the first successful application of deep learning to session-based recommendation, fundamentally changing how we model user behavior.
The session-based recommendation problem:
In many scenarios, you don't have long-term user profiles:
- Anonymous users (no login)
- New users with no history (cold start)
- E-commerce sessions where intent changes rapidly
Instead, you only have the current session: a sequence of items the user interacted with in the past few minutes/hours. The task: predict the next item they'll interact with.
Session: [Product A] → [Product B] → [Product C] → ???
↑
Predict next item
Why RNNs for sequences?
Recurrent Neural Networks (RNNs) are designed for sequential data. They maintain a hidden state that evolves as they process each element:
Input: Item₁ Item₂ Item₃ Item₄
↓ ↓ ↓ ↓
[GRU] → [GRU] → [GRU] → [GRU]
↓ ↓ ↓ ↓
Hidden: h₁ → h₂ → h₃ → h₄
↓
Prediction
Each hidden state encodes the "summary" of the session so far. The final hidden state captures the user's current intent.
Why GRU instead of LSTM?
GRU4Rec uses Gated Recurrent Units (GRU) rather than LSTMs:
- Fewer parameters: GRU has 2 gates vs LSTM's 3, meaning faster training
- Similar performance: For recommendation sequences (typically short), GRU performs comparably to LSTM
- Simpler: Easier to implement and debug
The GRU update equations:
(update gate: how much to update) (reset gate: how much to forget) (candidate hidden state) (final hidden state)
Training with session-parallel mini-batches:
A key contribution of GRU4Rec is how they handle batching. Sessions have variable lengths, so naive batching is inefficient. Their solution:
- Sort sessions by length
- Create a sliding window across sessions
- When a session ends, replace it with a new one and reset its hidden state
This allows efficient GPU utilization while maintaining session independence.
BPR loss for recommendation:
GRU4Rec uses BPR (Bayesian Personalized Ranking) loss rather than cross-entropy:
Where is the actual next item and is a negative sample. This ranking loss directly optimizes for correct ordering.
TOP1 loss (GRU4Rec's innovation):
The paper also introduces TOP1 loss, which regularizes scores:
The second term penalizes large scores, improving stability.
Practical considerations:
- Sequence length: Most value comes from recent items. 20-50 items is typical.
- Item embeddings: Shared between input and output (parameter tying).
- Dropout: Applied to inputs and between layers. Critical for preventing overfitting.
- Hidden size: 100-500 dimensions typically. Larger helps for diverse catalogs.
class GRU4Rec(nn.Module):
"""
GRU-based Session Recommendation.
Models user's current session as a sequence.
"""
def __init__(
self,
num_items: int,
embedding_dim: int = 100,
hidden_dim: int = 100,
num_layers: int = 1,
dropout: float = 0.2,
):
super().__init__()
self.item_embedding = nn.Embedding(num_items, embedding_dim, padding_idx=0)
self.gru = nn.GRU(
input_size=embedding_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout if num_layers > 1 else 0,
)
self.output = nn.Linear(hidden_dim, num_items)
def forward(self, session: torch.Tensor) -> torch.Tensor:
"""
Args:
session: (B, seq_len) - item IDs in session
Returns:
scores: (B, num_items) - scores for next item
"""
# Embed items
emb = self.item_embedding(session) # (B, L, D)
# GRU
output, _ = self.gru(emb) # (B, L, H)
# Use last hidden state to predict next item
last_hidden = output[:, -1, :] # (B, H)
scores = self.output(last_hidden) # (B, num_items)
return scores
def predict_next(self, session: torch.Tensor, k: int = 20) -> torch.Tensor:
"""Get top-k next items."""
scores = self.forward(session)
top_k = torch.topk(scores, k, dim=-1)
return top_k.indices
Caser: CNNs for Sequential Patterns
Caser (WSDM 2018) challenged the assumption that RNNs are the best architecture for sequences. Instead, Caser uses Convolutional Neural Networks to capture sequential patterns—and often outperforms RNN-based methods.
The key insight: Local patterns matter
Consider a shopping sequence: [Laptop] → [Laptop Bag] → [Mouse] → [Keyboard]. The patterns here are mostly local—consecutive items are related. You don't need to remember what happened 50 items ago; the last few items determine the next purchase.
CNNs excel at capturing local patterns through sliding windows (convolutions). This makes them surprisingly effective for short-term sequential recommendation.
Two types of patterns Caser captures:
Caser uses two distinct types of convolutions to capture different pattern types:
HORIZONTAL CONVOLUTIONS (union-level patterns):
──────────────────────────────────────────────────────────────
Capture which COMBINATIONS of recent items predict the next item.
Filter slides HORIZONTALLY across the embedding matrix.
Example: Filter height = 2 captures bigrams
[Laptop, Bag] → likely to buy accessories
Embedding matrix (L items × D dimensions):
┌─────────────────────┐
│ Laptop [0.8 0.2] │ ← Filter slides
│ Bag [0.7 0.3] │ horizontally
│ Mouse [0.6 0.4] │
│ Keyboard [0.5 0.5] │
└─────────────────────┘
VERTICAL CONVOLUTIONS (point-level patterns):
──────────────────────────────────────────────────────────────
Capture how INDIVIDUAL items influence the next item.
Filter slides VERTICALLY, aggregating information across features.
Example: "Laptop" (regardless of what followed) → predict accessories
This captures "skip patterns" where a single item signals intent.
Filter slides vertically across ALL positions:
┌─────────────────────┐
│ Laptop [0.8 0.2] │
│ Bag [0.7 0.3] │ ← Same filter applied
│ Mouse [0.6 0.4] │ to each row
│ Keyboard [0.5 0.5] │
└─────────────────────┘
Architecture breakdown:
-
Input: User's last L items (e.g., L=5), embedded into D dimensions → (L × D) matrix
-
Horizontal convolutions: Multiple filters with different heights (1, 2, 3, ..., L)
- Height 1: Unigrams (single item patterns)
- Height 2: Bigrams (consecutive pairs)
- Height L: The entire sequence
- Each filter produces a 1D output, then max-pooled
-
Vertical convolutions: Filters with height L (full sequence) and width 1
- Slides across embedding dimensions
- Captures per-dimension patterns across all items
-
User embedding: Added to capture long-term preferences (not just session)
-
Output: Concatenate [horizontal features, vertical features, user embedding] → MLP → item scores
Why CNNs work for sequences:
- Parallelization: All convolutions run in parallel (unlike RNNs which are sequential)
- Local pattern capture: Explicit modeling of n-gram patterns
- Position sensitivity: Unlike bag-of-words, convolutions respect order
- Faster training: 2-3x faster than GRU4Rec in practice
The math behind horizontal convolutions:
For a filter of height and width (embedding dim), applied to embedding matrix :
Then max-pool over all positions:
When to choose Caser over GRU4Rec:
- Short sequences: Caser shines when recent history (last 5-10 items) dominates
- Clear local patterns: E-commerce, music playlists with strong sequential dependencies
- Need for speed: 2-3x faster training, similar inference latency
- Explicit n-grams: When you want to interpret which item combinations matter
import torch
import torch.nn as nn
import torch.nn.functional as F
class Caser(nn.Module):
"""
Convolutional Sequence Embedding for Sequential Recommendation.
Uses CNNs to capture point-level and union-level patterns.
"""
def __init__(
self,
num_items: int,
num_users: int,
embedding_dim: int = 50,
max_seq_len: int = 5,
num_horizontal_filters: int = 16,
num_vertical_filters: int = 4,
):
super().__init__()
self.embedding_dim = embedding_dim
self.max_seq_len = max_seq_len
# Embeddings
self.item_embedding = nn.Embedding(num_items, embedding_dim, padding_idx=0)
self.user_embedding = nn.Embedding(num_users, embedding_dim)
# Horizontal convolutions (capture union-level patterns)
# Different filter heights for different n-gram sizes
self.horizontal_convs = nn.ModuleList([
nn.Conv2d(1, num_horizontal_filters, (i + 1, embedding_dim))
for i in range(max_seq_len)
])
# Vertical convolution (capture point-level patterns)
self.vertical_conv = nn.Conv2d(1, num_vertical_filters, (max_seq_len, 1))
# Output
horizontal_out = num_horizontal_filters * max_seq_len
vertical_out = num_vertical_filters * embedding_dim
self.fc = nn.Linear(horizontal_out + vertical_out + embedding_dim, num_items)
def forward(self, user: torch.Tensor, sequence: torch.Tensor) -> torch.Tensor:
"""
Args:
user: (B,) user IDs
sequence: (B, max_seq_len) item sequence
Returns:
scores: (B, num_items)
"""
B = user.size(0)
# Embed sequence
seq_emb = self.item_embedding(sequence) # (B, L, D)
seq_emb = seq_emb.unsqueeze(1) # (B, 1, L, D) - add channel dim
# Horizontal convolutions
h_outs = []
for conv in self.horizontal_convs:
h = F.relu(conv(seq_emb)) # (B, filters, L-k+1, 1)
h = h.squeeze(-1) # (B, filters, L-k+1)
h = F.max_pool1d(h, h.size(-1)) # (B, filters, 1)
h_outs.append(h.squeeze(-1))
horizontal = torch.cat(h_outs, dim=1) # (B, filters * L)
# Vertical convolution
v = F.relu(self.vertical_conv(seq_emb)) # (B, filters, 1, D)
vertical = v.view(B, -1) # (B, filters * D)
# User embedding
user_emb = self.user_embedding(user) # (B, D)
# Combine
combined = torch.cat([horizontal, vertical, user_emb], dim=1)
scores = self.fc(combined)
return scores
The Transition to Attention
GRU4Rec and Caser showed that sequence matters. But they had limitations:
- GRU: Sequential processing (slow), vanishing gradients
- Caser: Fixed receptive field, limited long-range modeling
This set the stage for attention-based models (SASRec, BERT4Rec) covered in the next post.
Part VII: Production Systems (Pre-Transformer Era)
How Tech Giants Built Their RecSys
Understanding how major companies evolved their recommendation systems provides crucial context. Each faced unique scale challenges and made different architectural choices.
┌─────────────────────────────────────────────────────────────────────────┐
│ PRODUCTION RECSYS EVOLUTION BY COMPANY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ NETFLIX (2006-2017) │
│ ─────────────────── │
│ 2006: Netflix Prize launches (MF era begins) │
│ 2009: BellKor's Pragmatic Chaos wins with 107-model ensemble │
│ 2012: Moves to hybrid: MF + content features + context │
│ 2016: Deep learning experiments begin │
│ Key lesson: Ensembles beat single models at scale │
│ │
│ AMAZON (2003-2017) │
│ ───────────────────── │
│ 2003: Item-based CF paper (foundational) │
│ 2007: DSSTNE (Deep Scalable Sparse Tensor Network Engine) │
│ 2013: Hybrid CF + content for "Customers who bought..." │
│ Key lesson: Item-item similarity scales better than user-user │
│ │
│ YOUTUBE (2010-2018) │
│ ─────────────────── │
│ 2010: Video recommendations via click-through rates │
│ 2016: Deep Neural Networks paper (two-tower revolution) │
│ 2018: Adds watch time prediction, not just CTR │
│ Key lesson: Optimize for engagement, not just clicks │
│ │
│ SPOTIFY (2014-2018) │
│ ────────────────────── │
│ 2014: Discover Weekly launches (CF + audio features) │
│ 2016: Adds NLP on playlists ("chill vibes" → similar tracks) │
│ 2018: BaRT (Bandits for Recommendations) for exploration │
│ Key lesson: Audio content features enable cold-start │
│ │
│ LINKEDIN (2012-2018) │
│ ───────────────────── │
│ 2012: People You May Know (graph-based CF) │
│ 2015: GLMix (Generalized Linear Mixed Models) │
│ 2017: Photon-ML for feature engineering │
│ Key lesson: Professional graph structure is gold │
│ │
│ PINTEREST (2015-2018) │
│ ─────────────────────── │
│ 2015: Related Pins via visual similarity (CNN features) │
│ 2017: PinSage (graph neural networks at scale) │
│ 2018: Multi-task learning (engagement + saves) │
│ Key lesson: Visual features dominate in image-first platforms │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Netflix: The Prize That Changed Everything
The Netflix Prize ($1M, 2006-2009) was a watershed moment for recommendation systems—it brought collaborative filtering from an academic curiosity to a mainstream machine learning problem, attracting thousands of teams worldwide and spawning techniques still used today.
The challenge:
Netflix released 100 million movie ratings (1-5 stars) from 480,000 users on 18,000 movies. The goal: improve their existing recommendation system (Cinematch, a basic neighborhood CF) by 10% in RMSE. The prize seemed impossible at the time—Netflix's own engineers had been stuck at 0.9514 RMSE for years.
Why the Prize mattered:
- Open data at scale: For the first time, researchers had access to real industrial-scale recommendation data
- Clear metric: RMSE provided an objective benchmark that everyone optimized for
- Competition drove innovation: Teams shared insights, combined approaches, and pushed boundaries
- Industry attention: The $1M prize attracted researchers who had never worked on recommendations
The evolution of winning approaches:
Year 1 (2007): Single-model innovations
────────────────────────────────────────
- SVD with bias terms
- Temporal dynamics (preferences change over time)
- Best single model: ~0.88 RMSE
Year 2 (2008): Feature engineering
────────────────────────────────────────
- Implicit feedback (what users DIDN'T rate matters)
- Neighborhood models with shrinkage
- Best: ~0.87 RMSE
Year 3 (2009): Ensemble everything
────────────────────────────────────────
- 100+ model ensembles
- Stacking and blending
- Final winner: 0.8567 RMSE (10.09% improvement)
Key technical innovations from the Prize:
-
SVD++ (implicit feedback): Users who rated many horror movies are horror fans, even if individual ratings vary. SVD++ added "implicit" factors based on which items a user interacted with—not just how they rated.
-
Temporal dynamics: User preferences drift over time. A user who loved rom-coms in 2005 might prefer thrillers in 2008. Time-aware models captured this.
-
Regularization: L2 regularization (weight decay) on latent factors was crucial. Without it, models memorized training data.
-
Neighborhood + latent factor hybrids: The best results combined global latent factor models (matrix factorization) with local neighborhood models (kNN).
Why Netflix didn't deploy the winning solution:
The winning 107-model ensemble was too complex for production. Netflix's actual deployed system used a simpler blend of ~3-5 models. The Prize's real value was the techniques and talent it surfaced—many participants joined Netflix.
The lesson for practitioners:
Single-model improvements matter, but ensembles almost always win competitions. However, production systems favor simplicity. The Prize demonstrated both truths simultaneously.
# Simplified version of BellKor's winning approach
class NetflixPrizeEnsemble:
"""
Key insight from Netflix Prize: Blend many models.
The winning solution combined 107 models.
"""
def __init__(self):
self.models = [
# Matrix factorization variants
SVDModel(factors=100),
SVDModel(factors=200),
SVDPlusPlusModel(factors=100), # SVD with implicit feedback
# Neighborhood models
ItemKNN(k=20),
ItemKNN(k=50),
# Restricted Boltzmann Machines
RBMModel(hidden_units=100),
# Time-aware models
TimeSVDModel(factors=100), # Captures preference drift
]
# Learned blending weights
self.weights = None
def fit(self, train_data, val_data):
"""Train all models and learn blending weights."""
# Train each model
for model in self.models:
model.fit(train_data)
# Get predictions on validation set
val_predictions = np.array([
model.predict(val_data)
for model in self.models
]).T # (num_samples, num_models)
# Learn optimal weights via linear regression
self.weights = self._learn_weights(val_predictions, val_data.ratings)
def predict(self, user_item_pairs):
"""Blended prediction."""
predictions = np.array([
model.predict(user_item_pairs)
for model in self.models
]).T
return predictions @ self.weights
Netflix lessons:
- SVD++ with implicit feedback was crucial (views, not just ratings)
- Time-aware models captured preference drift
- Blending provided 10%+ improvement over single models
- Real deployment was much simpler than the prize-winning solution
YouTube: Two-Tower at Scale
YouTube's 2016 paper defined how to do deep learning recommendations at billion-scale. This paper is arguably the most influential industrial recommendation paper ever written—it established patterns that every major tech company now follows.
The scale challenge YouTube faced:
- 800+ million videos in the catalog
- 2+ billion logged-in users monthly
- 500+ hours of video uploaded every minute
- Latency budget: <200ms total for the entire recommendation flow
Scoring every video for every user is computationally impossible: 800M × 2B = 1.6 × 10^18 operations per day.
The solution: Two-stage architecture
YouTube's key architectural insight was splitting the problem into two stages, each optimized for different goals:
┌─────────────────────────────────────────────────────────────────────────┐
│ YOUTUBE'S TWO-STAGE ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: CANDIDATE GENERATION │
│ ───────────────────────────── │
│ Input: User context (history, demographics, time) │
│ Output: ~1000 candidate videos (from 800M) │
│ Model: Deep Neural Network (two-tower) │
│ Goal: HIGH RECALL - don't miss relevant videos │
│ Budget: ~10ms │
│ │
│ Features (intentionally sparse for speed): │
│ - Watch history (averaged embeddings of last 50 videos) │
│ - Search history (averaged query embeddings) │
│ - Age, gender, geographic region │
│ - Time since last watch (freshness signal) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STAGE 2: RANKING │
│ ──────────── │
│ Input: ~1000 candidates + rich features │
│ Output: ~50 ranked videos for display │
│ Model: Feature-rich neural network │
│ Goal: HIGH PRECISION - order candidates correctly │
│ Budget: ~100ms │
│ │
│ Features (much richer, expensive to compute): │
│ - Video metadata (title, description, category, duration) │
│ - Video quality signals (watch time, likes, comments) │
│ - User-video interaction history (has user seen similar?) │
│ - Contextual (time of day, device, page location) │
│ - Freshness (video age, creator's recent performance) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Why this architecture works:
-
Candidate generation can be pre-computed. Video embeddings don't change per-user, so they're computed offline and indexed. Only user embedding computation is online (~1ms).
-
Ranking uses expensive features that would be impossible to compute for all videos, but feasible for 1000 candidates.
-
Different objectives: Candidate generation optimizes for recall (any potentially relevant video), ranking optimizes for engagement (watch time).
The "watch time" insight:
YouTube's ranking model predicts expected watch time, not click probability. This was crucial:
- Clicks are gameable: Clickbait titles/thumbnails generate clicks but disappoint users
- Watch time = actual engagement: If users watch the video, it was genuinely interesting
- Better business alignment: YouTube's ad revenue correlates with watch time, not clicks
Training the ranking model:
The model is trained to predict the weighted logistic regression of watch time. For each impression:
- Positive example: User watched the video. Weight = watch time in seconds.
- Negative example: User didn't click. Weight = 1.
This gives longer-watched videos more influence on the model.
The business rules layer:
After ranking, YouTube applies business rules:
- Diversity: Don't show 10 videos from the same channel
- Freshness: Mix in recent uploads
- Fairness: Ensure smaller creators get exposure
- Brand safety: Filter inappropriate content
class YouTubeProductionSystem:
"""
YouTube's production recommendation architecture (2016).
Key insight: Split into RETRIEVAL (fast, broad) and RANKING (slow, precise).
"""
def __init__(self):
# Candidate generation: Two-tower DNN
self.candidate_generator = YouTubeDNN(
num_videos=1_000_000_000, # Billion videos
embedding_dim=256,
)
# Ranking: Feature-rich model
self.ranker = YouTubeRanker(
video_features=['duration', 'freshness', 'channel_ctr'],
user_features=['watch_history', 'search_history', 'demographics'],
context_features=['time_of_day', 'device', 'page'],
)
# ANN index for retrieval
self.ann_index = None
def serve(self, user_context: dict, num_results: int = 20) -> list:
"""Full serving pipeline."""
# Stage 1: Candidate Generation (100ms budget)
# Retrieve ~1000 candidates from billions of videos
user_embedding = self.candidate_generator.encode_user(user_context)
candidates = self.ann_index.search(user_embedding, k=1000)
# Stage 2: Ranking (100ms budget)
# Score candidates with rich features
video_features = self.fetch_features(candidates)
scores = self.ranker.score(user_context, candidates, video_features)
# Stage 3: Business Rules (10ms budget)
# Diversity, freshness, creator fairness
final_list = self.apply_business_rules(candidates, scores)
return final_list[:num_results]
class YouTubeRanker(nn.Module):
"""
YouTube's ranking model optimizes for WATCH TIME, not clicks.
Key insight: Clicks are easy to game. Watch time = real engagement.
"""
def __init__(self, video_features, user_features, context_features):
super().__init__()
# Feature processing
self.video_encoder = FeatureEncoder(video_features)
self.user_encoder = FeatureEncoder(user_features)
self.context_encoder = FeatureEncoder(context_features)
# Prediction heads
combined_dim = 512
self.watch_time_head = nn.Sequential(
nn.Linear(combined_dim, 256),
nn.ReLU(),
nn.Linear(256, 1),
)
def forward(self, user_features, video_features, context_features):
"""Predict expected watch time."""
user_emb = self.user_encoder(user_features)
video_emb = self.video_encoder(video_features)
context_emb = self.context_encoder(context_features)
combined = torch.cat([user_emb, video_emb, context_emb], dim=-1)
# Predict watch time (not binary click)
watch_time = self.watch_time_head(combined)
return watch_time
YouTube lessons:
- Two-stage architecture is essential at scale
- Optimize for watch time, not clicks (better engagement metric)
- Fresh content needs special handling (cold-start)
- Example age feature: Train on recent data, serve on current context
Amazon: Item-Based CF at Scale
Amazon's 2003 paper "Amazon.com Recommendations: Item-to-Item Collaborative Filtering" is one of the most influential recommendation papers ever published. It established item-based collaborative filtering as the go-to approach for e-commerce and shaped how we think about scalable recommendations.
The problem with user-based CF at Amazon's scale:
Before 2003, most recommendation systems used user-based collaborative filtering: "Find users similar to you, recommend what they liked." This approach hit a wall at Amazon's scale:
User-Based CF Scalability Problem:
─────────────────────────────────────────────────────────────────────────
Users: ~50 million (in 2003, now 300M+)
Items: ~10 million products
To recommend for user U:
1. Compare U to ALL 50M users → 50M similarity calculations
2. For each similar user, iterate their purchases → expensive
This must happen in REAL-TIME (<100ms). Impossible.
Amazon's insight: Items are more stable than users
Item-based CF flips the approach: instead of "users like you liked X," it's "items similar to X are Y, Z."
Item-Based CF:
─────────────────────────────────────────────────────────────────────────
Pre-compute: For each item, find similar items (based on co-purchase)
This is OFFLINE. Can take hours. Run nightly.
Online: Given user's purchase history [A, B, C]
Look up pre-computed similar items for A, B, C
Aggregate and rank
Total: O(purchase_history_size × k) = O(100 × 50) = 5000 ops
5000 ops vs 50,000,000 ops. That's the difference.
Why items are more stable:
- User preferences drift: A user might love sci-fi this year, thrillers next year
- Item relationships are stable: "Laptops are bought with laptop bags" doesn't change
- New users appear constantly: 1000s of new users per minute. Recomputing user-user similarity is expensive.
- Items change slowly: New products launch, but most catalog is stable
Computing item similarity:
The key question: How do we define "similar items"? Amazon uses purchase co-occurrence:
This is essentially cosine similarity over the user-item purchase matrix, viewed from the item perspective.
The "Customers who bought this also bought" pattern:
This is item-based CF in action:
- User views product X
- Look up pre-computed similar items for X
- Display top-k as "Customers who bought this also bought..."
This simple pattern drives a significant portion of Amazon's revenue.
Implementation considerations:
- Sparsity handling: Most users buy few items. Use co-occurrence counts, not Pearson correlation.
- Popular item bias: Very popular items co-occur with everything. Dampen by dividing by item popularity.
- Temporal filtering: Recent purchases matter more. Decay older co-occurrences.
- Category constraints: "Laptop bag" is similar to "Laptop," but maybe don't recommend both.
class AmazonItemCF:
"""
Amazon's item-to-item collaborative filtering.
Key insight: Item similarity is more stable than user similarity.
"Customers who bought X also bought Y"
"""
def __init__(self):
self.item_similarity = {} # item_id -> [(similar_item, score)]
def compute_similarities_offline(self, purchase_matrix):
"""
Pre-compute all item-item similarities.
Run daily/weekly as batch job.
"""
num_items = purchase_matrix.shape[1]
for i in range(num_items):
# Users who bought item i
users_i = set(np.where(purchase_matrix[:, i] > 0)[0])
similarities = []
for j in range(num_items):
if i == j:
continue
# Users who bought item j
users_j = set(np.where(purchase_matrix[:, j] > 0)[0])
# Jaccard similarity (or cosine)
intersection = len(users_i & users_j)
union = len(users_i | users_j)
if union > 0:
sim = intersection / union
if sim > 0.01: # Threshold for storage
similarities.append((j, sim))
# Keep top-k similar items
similarities.sort(key=lambda x: x[1], reverse=True)
self.item_similarity[i] = similarities[:100]
def recommend(self, user_purchase_history: list, n: int = 10) -> list:
"""
Real-time recommendation based on user's purchases.
"""
# Aggregate similar items from purchase history
candidate_scores = {}
for purchased_item in user_purchase_history:
if purchased_item not in self.item_similarity:
continue
for similar_item, similarity in self.item_similarity[purchased_item]:
if similar_item in user_purchase_history:
continue # Don't recommend already purchased
if similar_item not in candidate_scores:
candidate_scores[similar_item] = 0
candidate_scores[similar_item] += similarity
# Sort by score
ranked = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)
return [item_id for item_id, _ in ranked[:n]]
Amazon lessons:
- Pre-computation makes real-time serving feasible
- Purchase signals are stronger than views/clicks
- "Frequently bought together" = item-item in same basket
- Simple methods at scale beat complex methods that don't scale
Spotify: Audio Features for Cold-Start
Spotify's recommendation system is a masterclass in hybrid approaches—combining collaborative filtering with content-based features to solve one of the hardest problems in music recommendation: the cold-start problem for new songs.
The music cold-start problem:
Music platforms face a brutal cold-start challenge:
- 40,000+ new tracks uploaded daily to Spotify
- New songs have zero listening data
- Artists depend on algorithmic discovery for exposure
- Users expect fresh music in playlists like Discover Weekly
Traditional CF fails here: how do you recommend a song that no one has listened to yet?
Spotify's solution: Multiple signal sources
Spotify doesn't rely on a single model—they combine multiple signals, each addressing different scenarios:
┌─────────────────────────────────────────────────────────────────────────┐
│ SPOTIFY'S MULTI-SIGNAL ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SIGNAL 1: COLLABORATIVE FILTERING │
│ ─────────────────────────────────── │
│ When it works: Songs with listening history │
│ Method: Matrix factorization on user-track listens │
│ Strength: Captures taste clusters (indie rock fans, K-pop fans) │
│ │
│ SIGNAL 2: AUDIO CONTENT FEATURES │
│ ───────────────────────────────── │
│ When it works: ANY song (even brand new uploads) │
│ Method: CNN on mel-spectrograms → audio embedding │
│ Features: Tempo, energy, danceability, acousticness, valence │
│ Strength: Enables cold-start recommendations │
│ │
│ SIGNAL 3: PLAYLIST CO-OCCURRENCE (NLP) │
│ ───────────────────────────────────────── │
│ When it works: Songs that appear in user playlists │
│ Method: Word2Vec on playlists (songs = "words", playlists = "docs") │
│ Strength: Captures context ("chill vibes" playlists = similar mood) │
│ │
│ SIGNAL 4: EDITORIAL KNOWLEDGE │
│ ───────────────────────────────── │
│ When it works: Curated playlists, genre classifications │
│ Method: Human-labeled tags, genre graphs │
│ Strength: Handles edge cases, ensures quality │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Audio content features: The cold-start savior
Spotify's audio analysis pipeline extracts features directly from the audio waveform:
- Low-level features: Raw spectrograms, MFCCs (mel-frequency cepstral coefficients)
- Mid-level features: Beat, tempo, key, mode, time signature
- High-level features: Genre probability distribution, mood, energy
These are computed at upload time—before anyone listens. A new song immediately gets an embedding based on how it sounds, enabling recommendations like: "This new track sounds like songs you've loved before."
Playlist-based Word2Vec:
Spotify treats playlists as "documents" and songs as "words." Using Word2Vec:
- Songs that frequently appear in the same playlists get similar embeddings
- A song in many "workout" playlists ends up near other workout songs
- Captures semantic relationships that pure audio can't (e.g., "party anthems" might span genres)
Discover Weekly: The flagship product
Discover Weekly (30 personalized songs every Monday) combines all signals:
- Find your taste clusters: Use CF to identify what genres/artists you like
- Explore similar users' discoveries: What did similar users recently discover?
- Audio similarity expansion: Find audio-similar tracks to your favorites
- Freshness filter: Exclude songs you've already heard
- Diversity injection: Ensure variety across the 30 tracks
The result: ~40 million users listen to Discover Weekly, with >5 billion track streams.
Bandit exploration for new music:
Spotify uses multi-armed bandits to balance exploitation (recommend what you'll like) vs exploration (surface new tracks to gather data). New tracks with uncertain quality get some exposure to "test" them, then exploitation takes over based on observed engagement.
class SpotifyHybridRecSys:
"""
Spotify's hybrid approach: CF + Audio Content.
Key insight: New songs have no listening data,
but we can analyze the audio itself.
"""
def __init__(self):
# Collaborative filtering component
self.cf_model = MatrixFactorization(num_factors=100)
# Audio analysis component (CNN on spectrograms)
self.audio_model = AudioCNN()
# Playlist-based NLP
self.playlist_nlp = PlaylistWord2Vec()
def get_track_embedding(self, track_id: str, track_metadata: dict) -> np.ndarray:
"""
Hybrid embedding combining multiple signals.
"""
embeddings = []
# CF embedding (if track has enough listens)
if self.has_cf_embedding(track_id):
cf_emb = self.cf_model.get_item_embedding(track_id)
embeddings.append(cf_emb)
# Audio embedding (always available)
audio_features = self.audio_model.extract_features(
track_metadata['audio_file']
)
embeddings.append(audio_features)
# Playlist co-occurrence embedding
if self.has_playlist_data(track_id):
playlist_emb = self.playlist_nlp.get_track_embedding(track_id)
embeddings.append(playlist_emb)
# Weighted combination (learned weights)
return self.combine_embeddings(embeddings)
def discover_weekly(self, user_id: str) -> list:
"""
Generate Discover Weekly playlist.
Mix of CF recommendations + audio-similar discoveries.
"""
# CF: Songs similar users liked
cf_candidates = self.cf_model.recommend(user_id, n=100)
# Audio: Songs audio-similar to user's favorites
user_favorites = self.get_user_top_tracks(user_id)
audio_candidates = []
for fav in user_favorites[:10]:
similar = self.audio_model.find_similar(fav, n=20)
audio_candidates.extend(similar)
# Blend and diversify
final_playlist = self.blend_and_diversify(
cf_candidates,
audio_candidates,
target_length=30,
)
return final_playlist
Spotify lessons:
- Audio features (tempo, energy, danceability) enable cold-start
- Playlist co-occurrence = implicit collaborative signal
- Discover Weekly balances exploitation (CF) + exploration (audio similarity)
- Release Radar uses freshness + artist affinity
Pinterest: Visual Similarity at Scale
Pinterest is an image-first platform—users come to discover visual ideas for home decor, fashion, recipes, and more. This makes Pinterest unique: visual features aren't just "nice to have," they're the primary signal for recommendation quality.
Why visual features dominate on Pinterest:
On Pinterest, two pins might have:
- Identical metadata: "living room decor"
- Completely different visual appeal: modern minimalist vs. bohemian maximalist
A user saving minimalist pins wants more minimalist—not just "living room decor." Visual similarity captures this nuance that metadata cannot.
The scale challenge:
Pinterest has:
- 300+ billion saved pins
- 5+ billion boards
- 500+ million monthly users
- Billions of images to process
Computing visual similarity for billions of images is computationally massive.
Pinterest's multi-modal approach:
Pinterest combines several signals, with visual features as the foundation:
┌─────────────────────────────────────────────────────────────────────────┐
│ PINTEREST'S MULTI-MODAL ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SIGNAL 1: VISUAL FEATURES (CNN-based) │
│ ─────────────────────────────────────── │
│ Method: Pre-trained ResNet/VGG → 4096-dim embedding │
│ Use case: Visual similarity, style matching │
│ Example: "Show more pins that LOOK like this" │
│ │
│ SIGNAL 2: GRAPH FEATURES (PinSage) │
│ ───────────────────────────────────── │
│ Method: Graph Neural Network on pin-board-user graph │
│ Use case: Contextual similarity, taste clustering │
│ Example: "Pins saved to similar boards" │
│ │
│ SIGNAL 3: TEXT FEATURES │
│ ──────────────────────── │
│ Method: BERT embeddings on pin descriptions │
│ Use case: Semantic search, query understanding │
│ Example: "modern kitchen ideas" │
│ │
│ SIGNAL 4: ENGAGEMENT FEATURES │
│ ────────────────────────────── │
│ Method: Historical CTR, save rate, closeup rate │
│ Use case: Quality scoring, ranking │
│ Example: Filter low-engagement pins │
│ │
└─────────────────────────────────────────────────────────────────────────┘
PinSage: Graph Neural Networks at Billion Scale
PinSage (SIGKDD 2018) was a breakthrough—the first GNN deployed at billion scale. Traditional GNNs aggregate information from all neighbors, which is O(n) per node. PinSage's innovations:
- Random walk sampling: Instead of all neighbors, sample important neighbors via random walks
- Importance pooling: Weight neighbors by visit frequency in random walks
- Producer-consumer architecture: Separate graph traversal from neural network computation
- Mini-batch training: Train on subgraphs, not the full graph
The result: GNN embeddings for 3 billion pins, computed in hours.
The graph structure:
Pinterest's graph is bipartite:
- Pins (images)
- Boards (collections of pins)
Two pins are "connected" if they appear on the same board. PinSage learns embeddings where pins saved to similar boards end up close in embedding space—even if they look visually different.
Visual search: "More like this"
Pinterest's visual search lets users:
- Tap on any object in an image
- See visually similar pins
This requires:
- Object detection (identify what the user tapped)
- Feature extraction (embed the detected object)
- ANN search (find similar embeddings)
The system processes millions of visual searches per day.
Diversification: Avoiding the "same image" problem
Visual similarity can be too precise—showing 20 nearly identical images is useless. Pinterest uses Maximal Marginal Relevance (MMR) to balance relevance and diversity:
Where:
- = candidate pins
- = already selected pins
- = query/seed pin
- = relevance/diversity tradeoff
class PinterestVisualRecSys:
"""
Pinterest's visual recommendation system.
Key insight: In an image platform, visual similarity
is often more important than collaborative signals.
"""
def __init__(self):
# Visual feature extractor (pre-trained CNN)
self.visual_encoder = torchvision.models.resnet50(pretrained=True)
self.visual_encoder = nn.Sequential(*list(self.visual_encoder.children())[:-1])
# Graph neural network for pin-board relationships
self.pinsage = PinSage(
embedding_dim=256,
num_layers=3,
)
def get_pin_embedding(self, pin_id: str, pin_image: np.ndarray) -> np.ndarray:
"""
Combine visual and graph embeddings.
"""
# Visual features from image
with torch.no_grad():
image_tensor = self.preprocess(pin_image)
visual_emb = self.visual_encoder(image_tensor).squeeze()
# Graph features from pin-board-user graph
graph_emb = self.pinsage.get_embedding(pin_id)
# Concatenate
return np.concatenate([visual_emb, graph_emb])
def related_pins(self, pin_id: str, pin_image: np.ndarray, n: int = 20) -> list:
"""
Find visually and contextually similar pins.
"""
query_emb = self.get_pin_embedding(pin_id, pin_image)
# ANN search
similar_pins = self.ann_index.search(query_emb, k=n * 2)
# Diversify (don't show 20 nearly identical images)
diversified = self.maximal_marginal_relevance(similar_pins, n)
return diversified
class PinSage(nn.Module):
"""
PinSage: Graph Neural Network for Pinterest.
Paper: https://arxiv.org/abs/1806.01973
Key innovation: Random walk sampling for scalable GNN.
"""
def __init__(self, embedding_dim: int, num_layers: int):
super().__init__()
self.layers = nn.ModuleList([
PinSageConv(embedding_dim)
for _ in range(num_layers)
])
def forward(self, pin_features, neighbor_features, neighbor_weights):
"""
Aggregate neighbor information.
Unlike standard GNN: Uses importance-sampled neighbors
from random walks, not all neighbors.
"""
h = pin_features
for layer in self.layers:
# Aggregate from sampled neighbors
neighbor_agg = self.aggregate(neighbor_features, neighbor_weights)
# Combine with self
h = layer(h, neighbor_agg)
return h
Pinterest lessons:
- Visual features from CNNs are first-class signals
- PinSage enabled GNNs at billion scale via sampling
- Diversification is crucial (avoid "more of the same")
- Board context provides implicit interest clustering
ByteDance/TikTok: Real-Time Learning
TikTok's recommendation system is legendary—often cited as the most engaging recommendation algorithm ever built. A new user gets eerily personalized content within minutes. The secret? Real-time learning that updates models as users interact, rather than waiting for daily batch training.
Why real-time learning matters for TikTok:
TikTok's content cycle is extremely fast:
- Videos go viral in hours, not days
- Trends emerge and die within a single day
- User preferences shift based on current mood
- New creators need exposure immediately
A model trained yesterday is already stale. Traditional batch training (daily/weekly) can't keep up.
The speed advantage:
Traditional Recommendation Pipeline:
───────────────────────────────────────────────────────────────────────────
Day 1: User likes cat videos
Day 2: Batch training incorporates yesterday's data
Day 3: Model deployed with updated weights
Day 4: User finally sees more cat videos
→ 3-day lag between signal and adaptation
TikTok's Real-Time Pipeline:
───────────────────────────────────────────────────────────────────────────
Second 0: User likes cat video
Second 1: Event streamed to Kafka
Second 2: Online model update
Second 60: Model synced to serving
Second 61: User sees more cat videos
→ ~1 minute lag
The Monolith architecture (2022 paper):
ByteDance published their system in "Monolith: Real Time Recommendation System With Collisionless Embedding Table". Key innovations:
-
Collisionless embedding tables: Traditional embedding tables use hashing (item_id % table_size), which causes collisions. Monolith uses expressive IDs that grow dynamically—no collisions, better personalization.
-
Online training from streaming data: Events (watches, likes, shares) stream to Kafka. Training workers consume events and update model parameters in real-time.
-
Parameter server architecture: Model parameters live in a distributed parameter server. Training pushes gradients, serving pulls latest weights.
-
Time-window features: Features like "user engagement in last 5 minutes" are computed on streaming data, not batch.
The For You Page (FYP) algorithm:
TikTok's FYP combines:
-
Candidate generation: Multiple retrievers (embedding similarity, trending, following, etc.) generate ~1000 candidates
-
Ranking model: A multi-task neural network predicts:
- Watch time probability
- Like probability
- Share probability
- Comment probability
- Follow probability
-
Exploration: New content gets "exploration traffic" to gather engagement signals. Low-engagement content gets filtered quickly.
-
Business rules: Creator fairness, content diversity, safety filters
Why TikTok "gets you" so fast:
The real-time system enables extremely fast adaptation:
- Swipe patterns: Which videos you watch 5+ seconds vs. skip immediately
- Engagement signals: Likes, comments, shares, follows
- Implicit signals: Rewatch, slow-scroll, tap to unmute
Even 5-10 interactions give the model strong signal about your current interests.
The addiction machine critique:
TikTok's effectiveness has drawn criticism—the algorithm is so good at engaging users that it can be addictive. The same real-time learning that enables personalization also enables behavioral manipulation. This is a real ethical consideration for anyone building recommendation systems.
class ByteDanceRealTimeLearning:
"""
ByteDance's approach: Update models in real-time
as users interact.
Traditional: Train daily/weekly → deploy
ByteDance: Train continuously → sync every minute
"""
def __init__(self):
# Streaming data pipeline
self.kafka_consumer = KafkaConsumer('user_actions')
# Online training
self.model = RealTimeModel()
self.parameter_server = ParameterServer()
# Feature store (low-latency)
self.feature_store = RedisFeatureStore()
def online_training_loop(self):
"""
Continuous training from streaming events.
"""
for event in self.kafka_consumer:
# Event: {user_id, video_id, action, timestamp}
# Create training example
user_features = self.feature_store.get(event['user_id'])
video_features = self.feature_store.get(event['video_id'])
label = 1 if event['action'] == 'watch_complete' else 0
# Online update
loss = self.model.train_step(user_features, video_features, label)
# Sync to parameter server (batched)
if self.should_sync():
self.parameter_server.push_gradients(self.model.get_gradients())
def serve(self, user_id: str, candidate_videos: list) -> list:
"""
Serve with latest model parameters.
"""
# Pull latest parameters (sync every ~1 minute)
self.model.load_from_parameter_server(self.parameter_server)
# Score candidates
scores = self.model.predict(user_id, candidate_videos)
return sorted(zip(candidate_videos, scores), key=lambda x: x[1], reverse=True)
ByteDance/TikTok lessons:
- Real-time learning captures trending content immediately
- Collision-less embeddings handle billion-scale item catalogs
- Watch time prediction > click prediction
- Short video = fast feedback loop = faster model iteration
LinkedIn: Professional Graph Structure
LinkedIn's recommendation challenges are fundamentally different from entertainment platforms. On Netflix or TikTok, the goal is engagement. On LinkedIn, recommendations must be professionally relevant—a bad connection recommendation isn't just annoying, it's embarrassing. "Do I know this person? Should I know them? Will they think I'm spam?"
What makes professional recommendations different:
┌─────────────────────────────────────────────────────────────────────────┐
│ ENTERTAINMENT vs. PROFESSIONAL RECOMMENDATIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ENTERTAINMENT (Netflix, Spotify, TikTok) │
│ ───────────────────────────────────────── │
│ Goal: Maximize engagement │
│ Mistake cost: User skips, no big deal │
│ Signal: Implicit (watch time, skips) │
│ Graph: Weak (users don't know each other) │
│ │
│ PROFESSIONAL (LinkedIn) │
│ ──────────────────────── │
│ Goal: Maximize professional value │
│ Mistake cost: Embarrassment, spam perception │
│ Signal: Explicit (connections are deliberate) │
│ Graph: Strong (users KNOW each other in real life) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The professional graph is gold:
LinkedIn's most valuable asset is the connection graph. It encodes real-world professional relationships:
- Same company: Colleagues who work together
- Same school: Alumni who share educational background
- Same industry: Professionals in the same field
- Career trajectory: Job transitions that predict future moves
This graph structure is far more predictive than any content-based signal.
People You May Know (PYMK):
PYMK is LinkedIn's most important recommendation surface—it drives network growth. The algorithm combines multiple signals:
-
Mutual connections: The single most predictive feature. If you and X share 5+ mutual connections, you likely know each other.
-
Shared affiliations:
- Same current company (colleagues)
- Same past company (former colleagues)
- Same school (classmates, alumni)
-
Profile similarity:
- Same job title/function
- Same industry
- Same location
-
Graph structure:
- 2nd degree connections (friends of friends)
- Clustering coefficient (are your friends friends with each other?)
The "triangle closing" pattern:
The strongest PYMK signal is closing triangles. If A→B and A→C, then B→C is likely:
A
/ \
B C
\?/
A knows B (connection)
A knows C (connection)
B probably knows C (recommend!)
This simple heuristic drives a huge portion of PYMK recommendations.
GLMix: Generalized Linear Mixed Models
LinkedIn developed GLMix (2015) to personalize at scale. The insight: some effects are global (industry trends), some are per-user (individual preferences), some are per-item (company popularity).
GLMix combines:
- Fixed effects: Global model parameters (same for all users)
- Random effects: Per-user and per-item parameters
This allows personalization without overfitting on sparse data.
Job recommendations: A different problem
Job recommendations have unique constraints:
- Jobs expire: Unlike songs, jobs disappear when filled
- Location matters: Users won't relocate for every job
- Seniority matching: Entry-level vs. senior roles
- Career trajectory: Recommend realistic next steps
LinkedIn's job recommendation combines:
- Skills matching (job requirements vs. user skills)
- Career path modeling (what jobs do similar people take next?)
- Company affinity (has user engaged with this company?)
- Freshness (prefer recently posted jobs)
class LinkedInPYMK:
"""
LinkedIn's People You May Know (PYMK).
Key insight: Professional connections follow patterns
(same company, same school, same industry).
"""
def __init__(self):
self.graph = ProfessionalGraph()
self.embedding_model = GraphEmbedding()
def people_you_may_know(self, user_id: str, n: int = 20) -> list:
"""
Recommend connections based on graph structure.
"""
candidates = []
# 2nd degree connections (friends of friends)
friends = self.graph.get_connections(user_id)
for friend in friends:
friends_of_friend = self.graph.get_connections(friend)
for fof in friends_of_friend:
if fof != user_id and fof not in friends:
candidates.append(fof)
# Score by multiple signals
scored = []
for candidate in set(candidates):
score = self.compute_pymk_score(user_id, candidate)
scored.append((candidate, score))
scored.sort(key=lambda x: x[1], reverse=True)
return [c for c, _ in scored[:n]]
def compute_pymk_score(self, user_id: str, candidate_id: str) -> float:
"""
Multi-factor scoring for PYMK.
"""
score = 0
# Mutual connections
mutual = self.graph.mutual_connections(user_id, candidate_id)
score += len(mutual) * 10
# Same current company
if self.same_company(user_id, candidate_id):
score += 50
# Same school
if self.same_school(user_id, candidate_id):
score += 30
# Same industry
if self.same_industry(user_id, candidate_id):
score += 20
# Profile similarity (skills, experience)
similarity = self.profile_similarity(user_id, candidate_id)
score += similarity * 10
return score
LinkedIn lessons:
- Graph structure is a powerful signal for professional networks
- Multi-factor models (GLMix) handle heterogeneous features
- Mutual connections are the strongest predictor
- Different recommendation contexts need different models (jobs vs. people vs. content)
Part VIII: Evaluation Metrics
Understanding how to evaluate recommendation systems is as important as building them. Different metrics capture different aspects of quality, and the right choice depends on your use case.
Ranking Metrics
Most recommendation systems produce ranked lists. These metrics evaluate ranking quality:
Precision@K: What fraction of the top-K recommendations were relevant?
Recall@K: What fraction of all relevant items appeared in top K?
Hit Rate@K (HR@K): Did at least one relevant item appear in top K? Binary version of Recall.
NDCG@K (Normalized Discounted Cumulative Gain): Measures ranking quality, giving higher scores when relevant items appear earlier.
Where IDCG is the DCG of the ideal (perfect) ranking.
MRR (Mean Reciprocal Rank): Average of reciprocal ranks of first relevant item.
MAP (Mean Average Precision): Average of precision at each relevant item's position.
import numpy as np
from typing import List, Set
def precision_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
"""Precision@K: fraction of top-K that are relevant."""
recommended_k = recommended[:k]
hits = len(set(recommended_k) & relevant)
return hits / k
def recall_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
"""Recall@K: fraction of relevant items in top K."""
if len(relevant) == 0:
return 0.0
recommended_k = recommended[:k]
hits = len(set(recommended_k) & relevant)
return hits / len(relevant)
def ndcg_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
"""NDCG@K: normalized discounted cumulative gain."""
recommended_k = recommended[:k]
# DCG
dcg = 0.0
for i, item in enumerate(recommended_k):
if item in relevant:
dcg += 1.0 / np.log2(i + 2) # +2 because position is 1-indexed
# IDCG (ideal: all relevant items at top)
ideal_hits = min(len(relevant), k)
idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))
return dcg / idcg if idcg > 0 else 0.0
def mrr(recommended_lists: List[List[int]], relevant_sets: List[Set[int]]) -> float:
"""Mean Reciprocal Rank across users."""
reciprocal_ranks = []
for recommended, relevant in zip(recommended_lists, relevant_sets):
for rank, item in enumerate(recommended, 1):
if item in relevant:
reciprocal_ranks.append(1.0 / rank)
break
else:
reciprocal_ranks.append(0.0)
return np.mean(reciprocal_ranks)
def map_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
"""Average Precision@K."""
recommended_k = recommended[:k]
precisions = []
hits = 0
for i, item in enumerate(recommended_k):
if item in relevant:
hits += 1
precisions.append(hits / (i + 1))
return np.mean(precisions) if precisions else 0.0
# Example usage
recommended = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19] # Top 10 recommendations
relevant = {3, 7, 15, 22, 25} # Ground truth relevant items
print(f"Precision@10: {precision_at_k(recommended, relevant, 10):.3f}") # 0.300
print(f"Recall@10: {recall_at_k(recommended, relevant, 10):.3f}") # 0.600
print(f"NDCG@10: {ndcg_at_k(recommended, relevant, 10):.3f}") # ~0.5
print(f"MAP@10: {map_at_k(recommended, relevant, 10):.3f}") # ~0.4
Beyond Accuracy: Diversity and Coverage
Accuracy metrics don't capture everything. A system that recommends only popular items might have high precision but provide poor user experience.
Coverage: What fraction of items ever get recommended?
Diversity (Intra-List): How different are items within a recommendation list?
Novelty: Are we recommending items users haven't seen before?
Serendipity: Are we recommending surprising but relevant items?
Online vs Offline Evaluation
Offline evaluation uses historical data to measure predictive accuracy. It's fast and cheap but has limitations:
- Only measures prediction of past behavior
- Can't capture exploration value
- Misses business impact
Online evaluation (A/B testing) measures actual user behavior:
- Click-through rate (CTR)
- Conversion rate
- Watch time / session duration
- Return visits
- Revenue per user
The offline-online gap: Models that win offline often don't win online. Selection bias in historical data, feedback loops, and missing features all contribute.
Best practice: Use offline metrics for rapid iteration during development, but always validate with A/B tests before production deployment.
Part IX: Practical Comparison
When to Use What
| Method | Best For | Pros | Cons |
|---|---|---|---|
| Item-based CF | Small catalogs, explainability | Simple, interpretable | Doesn't scale |
| Matrix Factorization | Cold start items (with content) | Efficient, proven | Linear only |
| BPR | Implicit feedback ranking | Optimizes ranking directly | Needs negative sampling |
| NCF | When non-linearity helps | Flexible | Not always better than MF |
| Wide & Deep | Apps with rich features | Memorization + generalization | Needs feature engineering |
| DeepFM | CTR prediction | Automatic feature crosses | Compute intensive |
| Two-Tower | Large-scale retrieval | Fast serving (ANN) | Decoupled = less expressive |
| GRU4Rec | Session-based | Captures order | Slow, short-term focus |
| Transformers | Long sequences, SOTA | Best accuracy | Expensive |
Production Recommendations
- Start simple: Matrix factorization or item-based CF is often 80% of the value
- Two-tower for scale: When you have millions of items, you need ANN retrieval
- Add complexity gradually: NCF → DeepFM → Transformers, with A/B tests at each step
- Consider latency: More complex ≠ better if you can't serve it fast enough
Conclusion
Recommendation systems have evolved dramatically over three decades:
- 1990s-2000s: Collaborative filtering and content-based methods
- 2006-2015: Matrix factorization and the Netflix Prize revolution
- 2016-2018: Deep learning enters with NCF, Wide & Deep, and sequential models
- 2018+: Transformers take over (see next post)
Each era built on the last. Matrix factorization introduced embeddings. Deep learning added non-linearity. Transformers enabled direct modeling of long-range dependencies.
Understanding this evolution isn't just historical—it's practical. Many production systems still use "older" techniques where they excel. The best recommendation engineers know when to use matrix factorization, when to add deep learning, and when to go all-in on transformers.
Sources:
Frequently Asked Questions
Related Articles
Transformers for Recommendation Systems: From SASRec to HSTU
In-depth tour of transformer-based recommendation systems. From the fundamentals of sequential recommendation to Meta's trillion-parameter HSTU, understand how attention mechanisms revolutionized personalization.
Generative AI for Recommendation Systems: LLMs Meet Personalization
Practical guide to LLM-powered recommendation systems. From feature augmentation to conversational agents, understand how generative AI is transforming personalization.
Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications
Clear walkthrough of embedding models for RAG, search, and AI applications. Comparison of text-embedding-3, BGE, E5, Cohere Embed v4, and Voyage with guidance on fine-tuning, dimensionality, multimodal embeddings, and production optimization.
Vector Databases: A Comprehensive Guide to Pinecone, Weaviate, Qdrant, Milvus & Chroma
Deep dive into vector database architecture, indexing algorithms, and production considerations. Comprehensive comparison of Pinecone vs Weaviate vs Qdrant vs Milvus vs Chroma with benchmarks, pricing, and use case recommendations for 2025.