Do I still need to learn matrix factorization in the transformer era?

Absolutely. Matrix factorization introduced the fundamental concept of **embeddings** that underpins all modern deep learning recommendation systems. When you use user and item embeddings in a transformer model, you're using the same conceptual framework that matrix factorization pioneered. Understanding MF deeply helps you: - Understand why dot products work for similarity (the mathematical foundation) - Know why we need regularization (overfitting is real, even with neural networks) - Grasp how implicit feedback differs from explicit ratings (a crucial distinction) - Debug embedding-based systems (when embeddings collapse or don't converge) - Appreciate why simpler models often win (proper hyperparameter tuning matters more than architecture) Many production systems still use matrix factorization variants as a strong baseline or as the initial retrieval stage. It's not obsolete—it's foundational.

How do sequential models differ from collaborative filtering?

These approaches answer fundamentally different questions: **Collaborative filtering asks:** "What do similar users like?" - Based on the aggregated behavior of many users - Captures long-term preferences and taste - User A bought items 1, 2, 3; User B bought 1, 2 → recommend 3 to B **Sequential models ask:** "What will this user do next, given their recent behavior?" - Based on the temporal order of one user's actions - Captures current intent and session dynamics - User just viewed laptop → laptop accessories → likely wants laptop case **Why this matters:** - A user's long-term taste (CF) might be "likes jazz music" - Their current session intent might be "looking for workout music" - CF would recommend jazz; sequential models would recommend workout playlists The best systems combine both: CF captures stable preferences, sequential models capture session intent. The relative weight depends on your domain—e-commerce has strong session intent (shopping missions), music streaming mixes long-term taste with current mood.

What's the relationship between ranking and retrieval?

Modern recommendation systems have multiple stages, and understanding the distinction is crucial: **Retrieval (Candidate Generation):** - Goal: Find a manageable set of potentially relevant items from millions - Constraints: Must be very fast (sub-millisecond for millions of items) - Models: Two-tower, ANN indices, simple heuristics - Output: ~100-1000 candidates **Ranking (Scoring):** - Goal: Precisely order the candidates by relevance - Constraints: Can be slower (scoring hundreds of items, not millions) - Models: Feature interaction models (DeepFM, DCN), transformers, cross-encoders - Output: Final ordered list **Why two stages?** Ranking models are more expressive but too expensive to run on millions of items. Retrieval models are fast but less accurate. The combination gives you both speed and quality. **Production example:** - Retrieval: User embedding → ANN query → top 500 items (5ms) - Ranking: DeepFM scores 500 items → top 50 (20ms) - Re-ranking: Business rules, diversity, freshness → top 20 (5ms) - Total: ~30ms for a recommendation request

Why did transformers eventually take over recommendation systems?

Transformers addressed a fundamental limitation of earlier sequential models: the **information bottleneck**. **RNNs (GRU4Rec, etc.)** compress the entire history into a fixed-size hidden state. Information about early items must pass through all intermediate states to influence the prediction. Long-range dependencies get diluted or lost. **Transformers (SASRec, BERT4Rec, etc.)** use attention to directly connect any two items in the sequence. If item 1 and item 50 are related, attention can capture this directly without information passing through items 2-49. **Concrete advantages:** - **Long-range dependencies:** Can capture patterns spanning hundreds of interactions - **Parallel training:** All positions computed simultaneously (vs. sequential RNN) - **Interpretability:** Attention weights show what the model focuses on - **Transfer learning:** Pre-trained architectures can be fine-tuned **When RNNs might still win:** - Very short sequences (< 10 items): Overhead of attention isn't worth it - Strict latency constraints: Transformers are more expensive at inference - Streaming/online settings: RNNs update state incrementally; transformers recompute The transformer takeover also benefited from ecosystem effects: more research, better tooling, GPU optimizations, and pre-trained models made transformers increasingly practical.

What embedding dimension should I use?

Embedding dimension is one of the most important hyperparameters, and there's no universal answer. However, there are useful heuristics: **Rules of thumb:** - **Small catalogs ( 1M items):** 128-512 dimensions - **General formula:** d ≈ k × log(N) where N is catalog size, k is a constant (around 4-8) **Why these ranges work:** - Too few dimensions: Can't capture the diversity of items (embeddings collide) - Too many dimensions: Overfitting, increased compute, diminishing returns **Practical considerations:** - Higher dimensions require more data to train well - ANN search gets slower and less accurate with higher dimensions - GPU memory constraints may limit dimension size at scale **Recommendation:** Start with 64-128, measure offline metrics, then tune. The difference between 64 and 256 is usually smaller than the difference between bad and good hyperparameter tuning.

How do I handle the cold start problem with classical methods?

Cold start—recommending for new users or new items with no interaction history—is a fundamental challenge that classical methods address in several ways: **For new users:** - **Popularity baseline:** Recommend globally popular items until you have signal - **Demographics/registration data:** Use age, location, stated preferences for initial recommendations - **Onboarding quiz:** Explicitly ask preferences during signup - **Explore-exploit:** Show diverse items early to learn preferences quickly **For new items:** - **Content-based features:** Use item attributes (category, brand, description) to place items near similar items in embedding space - **Attribute averaging:** New running shoe inherits average embedding of existing running shoes - **Metadata injection:** Add item features to the embedding lookup (concatenate ID embedding with attribute embedding) **Hybrid approaches:** Matrix factorization variants like [Factorization Machines](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf) naturally handle feature-based cold start by factorizing features in addition to user/item IDs. Deep learning methods (covered in the transformer post) use content encoders (text, images) for more sophisticated cold-start handling.

What's the difference between implicit and explicit feedback?

This distinction fundamentally changes how you train recommendation models: **Explicit feedback:** Users directly state preferences - Ratings (1-5 stars) - Thumbs up/down - "I like this" buttons **Implicit feedback:** Preferences inferred from behavior - Clicks, views, purchases - Watch time, scroll depth - Add to cart, save for later **Why implicit dominates in practice:** - Explicit feedback is sparse (users rarely rate items) - Implicit feedback is abundant (every interaction generates signal) - Implicit better reflects behavior (what users do vs. what they say) **Training differences:** - **Explicit:** Predict the rating value (regression or classification) - **Implicit:** Handle missing data carefully—non-interaction ≠ negative preference **BPR (Bayesian Personalized Ranking)** was specifically designed for implicit feedback: it assumes observed interactions are preferred over non-observed ones, but doesn't assume non-observed items are disliked. **Negative sampling** is essential for implicit feedback: since you only have positive examples (interactions), you sample non-interacted items as negatives. The quality of negative sampling significantly impacts model performance.

Back to Blog

RecSys Personalization Deep Learning ML Engineering Architecture Tutorials

Recommendation Systems: From Collaborative Filtering to Deep Learning

Q: Is Neural Collaborative Filtering (NCF) better than matrix factorization?

Not necessarily. A [2020 RecSys study](https://dl.acm.org/doi/10.1145/3383313.3412488) showed that with proper hyperparameter tuning, simple dot products (matrix factorization) match NCF performance. This result was surprising because NCF was supposed to capture complex non-linear interactions that MF couldn't. What happened? The original NCF paper compared against a poorly-tuned MF baseline. When researchers tuned MF properly (learning rates, regularization, embedding dimensions), the gap disappeared. **The broader lesson:** Don't add complexity without evidence it helps on your data. Neural networks are powerful but they're not magic—they require proper tuning, more compute, and introduce training instability. If MF gets you 90% of the way there, the extra 10% from NCF may not justify the operational complexity. That said, NCF can help in specific cases: when you have rich user/item features beyond IDs, when non-linear interactions are genuinely present in your domain, or when you need a unified neural architecture that can incorporate other signals.

Q: When should I use two-tower models?

Two-tower models become essential when you have **more than ~100K items** and need **real-time recommendations** (sub-100ms latency). The key advantage is that you can pre-compute all item embeddings offline and store them in an approximate nearest neighbor (ANN) index. At serving time: 1. Compute the user embedding (one forward pass through the user tower) 2. Query the ANN index to find nearest item embeddings (sub-millisecond for millions of items) 3. Return top candidates This is dramatically faster than scoring all items directly, which would require millions of model inferences per request. **When you might not need two-tower:** - Less than ~100K items: You might afford to score all items directly - Offline/batch recommendations: Latency isn't critical - Very complex ranking logic: Two-tower forces simple dot-product similarity **Common production pattern:** Two-tower for retrieval (find top 1000 candidates), then a more expressive model (DeepFM, DCN, transformer) for ranking those candidates.

Q: What embedding dimension should I use?

Embedding dimension is one of the most important hyperparameters, and there's no universal answer. However, there are useful heuristics: **Rules of thumb:** - **Small catalogs ( 1M items):** 128-512 dimensions - **General formula:** d ≈ k × log(N) where N is catalog size, k is a constant (around 4-8) **Why these ranges work:** - Too few dimensions: Can't capture the diversity of items (embeddings collide) - Too many dimensions: Overfitting, increased compute, diminishing returns **Practical considerations:** - Higher dimensions require more data to train well - ANN search gets slower and less accurate with higher dimensions - GPU memory constraints may limit dimension size at scale **Recommendation:** Start with 64-128, measure offline metrics, then tune. The difference between 64 and 256 is usually smaller than the difference between bad and good hyperparameter tuning.

In-depth journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.

September 3, 202530 min read

The Evolution of Recommendation Systems

Recommendation systems are among the most impactful applications of machine learning. They power what you watch on Netflix, what you buy on Amazon, what you listen to on Spotify, and what you scroll on TikTok. Understanding their evolution—from simple heuristics to trillion-parameter transformers—is essential for any ML engineer.

This post covers the foundational architectures that came before the transformer revolution. These methods remain crucial: many production systems still use them, and modern approaches build upon their insights.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              RECOMMENDATION SYSTEMS TIMELINE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1990s          CONTENT-BASED & COLLABORATIVE FILTERING                 │
│  ─────          Nearest neighbors, TF-IDF, item-item similarity         │
│                                                                          │
│  2006-2009      MATRIX FACTORIZATION ERA (Netflix Prize)                │
│  ─────────      SVD, ALS, BPR - latent factors revolutionize CF         │
│                                                                          │
│  2016           DEEP LEARNING ENTERS RECSYS                             │
│  ─────          GRU4Rec, Wide & Deep, Neural Collaborative Filtering    │
│                                                                          │
│  2017-2018      TWO-TOWER MODELS & FEATURE INTERACTIONS                 │
│  ─────────      DSSM, YouTube DNN, DeepFM, DCN                          │
│                                                                          │
│  2018+          ATTENTION & TRANSFORMERS                                │
│  ─────          SASRec, BERT4Rec, HSTU (see next post)                  │
│                                                                          │
│  2024+          GENERATIVE & LLM-POWERED                                │
│  ─────          TIGER, OneRec, LLM augmentation (see GenAI post)        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part I: Classical Recommendation Methods

The Three Paradigms

Before diving into algorithms, understand the three fundamental approaches:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              THREE PARADIGMS OF RECOMMENDATION                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. CONTENT-BASED FILTERING                                             │
│  ─────────────────────────────                                           │
│                                                                          │
│  "Recommend items similar to what you've liked before"                  │
│                                                                          │
│  User liked: Action movies with Tom Cruise                              │
│  Recommend: More action movies, more Tom Cruise movies                  │
│                                                                          │
│  Pros: No cold-start for items, explainable                            │
│  Cons: Limited discovery, requires item features                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. COLLABORATIVE FILTERING                                             │
│  ───────────────────────────────                                         │
│                                                                          │
│  "Users who liked X also liked Y"                                       │
│                                                                          │
│  User A liked: [Inception, Interstellar, The Matrix]                   │
│  User B liked: [Inception, Interstellar, Arrival]                      │
│  → Recommend Arrival to User A, The Matrix to User B                   │
│                                                                          │
│  Pros: Discovers unexpected items, no item features needed             │
│  Cons: Cold-start problem, popularity bias                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. HYBRID METHODS                                                       │
│  ──────────────────                                                      │
│                                                                          │
│  Combine content and collaborative signals                              │
│                                                                          │
│  Examples:                                                               │
│  • Use content for cold items, CF for warm items                       │
│  • Weighted ensemble of both approaches                                 │
│  • Joint models (modern deep learning)                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

User-Based Collaborative Filtering

User-based collaborative filtering was the original recommender algorithm, pioneered at MIT Media Lab and GroupLens in the mid-1990s. The core intuition is beautifully simple: if two users agreed in the past, they'll probably agree in the future.

How it works step-by-step:

Build the user-item matrix: Rows are users, columns are items, values are ratings (or 0 for no interaction). A user who rated 100 movies has 100 non-zero entries in their row.
Compute user-user similarity: For each pair of users, calculate how similar their rating patterns are. The most common metric is cosine similarity—it measures the angle between two rating vectors, ignoring magnitude (a user who rates everything 5 stars vs 3 stars can still be similar in preferences).
Find k-nearest neighbors: For a target user, find the k users with highest similarity scores. These are the user's "taste neighbors."
Predict ratings: To predict what the target user would rate an unseen item, take a weighted average of how the neighbors rated it. Users more similar to the target get higher weight.

The similarity formula:

$\text{sim}(u, v) = \frac{\sum_{i \in I_{uv}} r_{ui} \cdot r_{vi}}{\sqrt{\sum_{i \in I_u} r_{ui}^2} \cdot \sqrt{\sum_{i \in I_v} r_{vi}^2}}$

Where $I_{uv}$ is the set of items both users rated, and $r_{ui}$ is user $u$ 's rating for item $i$ .

The prediction formula:

$\hat{r}_{ui} = \frac{\sum_{v \in N_k(u)} \text{sim}(u, v) \cdot r_{vi}}{\sum_{v \in N_k(u)} |\text{sim}(u, v)|}$

This is a weighted average: neighbors who are more similar contribute more to the prediction.

Why it often fails in practice:

Sparsity: With millions of users and items, the user-item matrix is 99%+ zeros. Finding users who overlap on rated items is hard.
Scalability: Computing all pairwise user similarities is O(n² × m) where n is users and m is items. This doesn't scale.
Cold start: New users have no ratings, so they have no neighbors.

Despite these limitations, understanding user-based CF is essential—it introduces the fundamental concepts that all modern recommenders build upon.

Python

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class UserBasedCF:
    """
    User-based collaborative filtering.
    Find similar users → recommend their items.
    """

    def __init__(self, k_neighbors: int = 20):
        self.k = k_neighbors
        self.user_item_matrix = None
        self.user_similarity = None

    def fit(self, ratings: np.ndarray):
        """
        Args:
            ratings: (num_users, num_items) matrix
                     0 = no interaction, otherwise = rating/score
        """
        self.user_item_matrix = ratings

        # Compute user-user similarity
        # Cosine similarity on rating vectors
        self.user_similarity = cosine_similarity(ratings)

        # Zero out self-similarity
        np.fill_diagonal(self.user_similarity, 0)

    def predict(self, user_id: int, item_id: int) -> float:
        """Predict rating for user-item pair."""

        # Find k most similar users who rated this item
        item_ratings = self.user_item_matrix[:, item_id]
        users_who_rated = np.where(item_ratings > 0)[0]

        if len(users_who_rated) == 0:
            return 0  # No data

        # Get similarities to these users
        similarities = self.user_similarity[user_id, users_who_rated]

        # Take top-k
        top_k_idx = np.argsort(similarities)[-self.k:]
        top_k_users = users_who_rated[top_k_idx]
        top_k_sims = similarities[top_k_idx]

        # Weighted average of their ratings
        if top_k_sims.sum() == 0:
            return item_ratings[top_k_users].mean()

        weighted_sum = (top_k_sims * item_ratings[top_k_users]).sum()
        return weighted_sum / top_k_sims.sum()

    def recommend(self, user_id: int, n: int = 10) -> list[int]:
        """Get top-n recommendations for user."""

        # Items user hasn't interacted with
        user_ratings = self.user_item_matrix[user_id]
        unseen_items = np.where(user_ratings == 0)[0]

        # Predict scores for unseen items
        scores = [
            (item_id, self.predict(user_id, item_id))
            for item_id in unseen_items
        ]

        # Sort by score
        scores.sort(key=lambda x: x[1], reverse=True)

        return [item_id for item_id, _ in scores[:n]]

Problems with User-Based CF:

Scalability: Computing user-user similarity is O(n²) in users
Sparsity: Most user pairs have no common items
Volatility: User preferences change; similarity matrix becomes stale

Item-Based Collaborative Filtering

In 2003, Amazon published a paper that revolutionized e-commerce recommendations. Their key insight: instead of finding similar users, find similar items. This seemingly small change solves most of user-based CF's problems.

Why item-based is better than user-based:

Stability: Item relationships are more stable than user relationships. "The Matrix" and "Inception" will always be similar sci-fi movies. But a user's taste can change—they might suddenly start watching documentaries.
Scalability: You have fewer items than users (typically 10-100x fewer). Computing item-item similarity is O(m²) where m is items, vs O(n²) for users. More importantly, item similarities can be precomputed offline.
Sparsity is less severe: Any two popular items have many users in common (who rated both). But two random users might have rated zero items in common.
Explainability: "We recommend X because you bought Y, and customers who bought Y also bought X" is intuitive. User-based explanations ("we recommend X because user #847592 liked it") are creepy.

How it works:

Build item-item similarity matrix: For each pair of items, compute similarity based on which users rated them. Two items are similar if the same users tend to rate them similarly.
For a user's prediction: Look at items the user has already rated. For each candidate item, find its similarity to the user's rated items. The prediction is a weighted sum.

The item similarity formula:

$\text{sim}(i, j) = \frac{\sum_{u \in U_{ij}} r_{ui} \cdot r_{uj}}{\sqrt{\sum_{u \in U_i} r_{ui}^2} \cdot \sqrt{\sum_{u \in U_j} r_{uj}^2}}$

Where $U_{ij}$ is the set of users who rated both items.

The prediction formula:

$\hat{r}_{ui} = \frac{\sum_{j \in N_k(i) \cap I_u} \text{sim}(i, j) \cdot r_{uj}}{\sum_{j \in N_k(i) \cap I_u} |\text{sim}(i, j)|}$

For item $i$ , look at the k most similar items that the user has rated, and take a weighted average of those ratings.

Production considerations:

Precompute offline: Item similarities rarely change. Compute once daily/weekly.
Truncate to top-k: Don't store all m² similarities. Keep only top-100 neighbors per item.
Incremental updates: When a new item gets enough ratings, compute its similarities without recomputing everything.

This is still Amazon's core algorithm today, enhanced with deep learning features.

Python

class ItemBasedCF:
    """
    Item-based collaborative filtering (Amazon's approach).
    Find similar items → recommend based on user's history.

    Key insight: Item similarities are more stable than user similarities.
    """

    def __init__(self, k_neighbors: int = 20):
        self.k = k_neighbors
        self.user_item_matrix = None
        self.item_similarity = None

    def fit(self, ratings: np.ndarray):
        """
        Args:
            ratings: (num_users, num_items) matrix
        """
        self.user_item_matrix = ratings

        # Compute item-item similarity
        # Transpose: each row is an item's rating profile across users
        self.item_similarity = cosine_similarity(ratings.T)

        # Zero out self-similarity
        np.fill_diagonal(self.item_similarity, 0)

    def predict(self, user_id: int, item_id: int) -> float:
        """Predict rating for user-item pair."""

        # Find items user has rated
        user_ratings = self.user_item_matrix[user_id]
        rated_items = np.where(user_ratings > 0)[0]

        if len(rated_items) == 0:
            return 0

        # Similarity of target item to user's rated items
        similarities = self.item_similarity[item_id, rated_items]

        # Take top-k similar items
        top_k_idx = np.argsort(similarities)[-self.k:]
        top_k_items = rated_items[top_k_idx]
        top_k_sims = similarities[top_k_idx]

        # Weighted average
        if top_k_sims.sum() == 0:
            return user_ratings[top_k_items].mean()

        weighted_sum = (top_k_sims * user_ratings[top_k_items]).sum()
        return weighted_sum / top_k_sims.sum()

    def recommend(self, user_id: int, n: int = 10) -> list[int]:
        """Get top-n recommendations."""

        user_ratings = self.user_item_matrix[user_id]
        unseen_items = np.where(user_ratings == 0)[0]

        scores = [
            (item_id, self.predict(user_id, item_id))
            for item_id in unseen_items
        ]

        scores.sort(key=lambda x: x[1], reverse=True)
        return [item_id for item_id, _ in scores[:n]]

Why Item-Based is Better:

Stability: Items don't change; their similarity is stable
Scalability: Pre-compute item similarities offline
Explainability: "Because you liked X, and Y is similar to X"

Part II: Matrix Factorization (The Netflix Prize Era)

The $1 Million Challenge

In 2006, Netflix offered $1 million to anyone who could beat their recommendation algorithm by 10%. This sparked a revolution in collaborative filtering.

The key insight: Latent factors. Instead of computing explicit item-item or user-user similarity, learn low-dimensional representations that capture underlying preferences.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MATRIX FACTORIZATION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM:                                                                │
│  ─────────                                                               │
│                                                                          │
│  Rating Matrix R (num_users × num_items):                               │
│                                                                          │
│          Item1  Item2  Item3  Item4  ...                                │
│  User1  [  5      ?      3      ?    ...]                               │
│  User2  [  ?      4      ?      2    ...]                               │
│  User3  [  4      ?      5      ?    ...]                               │
│  ...                                                                     │
│                                                                          │
│  Most entries are missing (? = unknown)                                 │
│  Goal: Predict the missing entries                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: FACTORIZE R ≈ P × Q^T                                        │
│  ─────────────────────────────────                                       │
│                                                                          │
│  P: User matrix (num_users × k)                                         │
│     Each row = user's latent preferences                                │
│                                                                          │
│  Q: Item matrix (num_items × k)                                         │
│     Each row = item's latent attributes                                 │
│                                                                          │
│  Prediction: r̂_ui = p_u · q_i (dot product)                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INTUITION:                                                              │
│  ───────────                                                             │
│                                                                          │
│  k latent factors might represent:                                      │
│  • Factor 1: How much the movie is action vs drama                     │
│  • Factor 2: How mainstream vs indie                                    │
│  • Factor 3: How visually stunning vs dialogue-driven                  │
│  • ...                                                                   │
│                                                                          │
│  User embedding captures: "I like action, mainstream, visual"          │
│  Item embedding captures: "This movie is action, indie, visual"        │
│  Dot product: How well do they match?                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SVD and ALS

Matrix factorization is the technique that won the Netflix Prize. The idea is elegant: decompose the sparse user-item rating matrix into two smaller, dense matrices that capture latent factors.

Wait, what is SVD exactly?

First, let's clarify the terminology. Singular Value Decomposition (SVD) is a fundamental theorem from linear algebra:

Any matrix $R$ can be decomposed as: $R = U \Sigma V^T$

Where:

$U$ = orthonormal matrix (users × k) — left singular vectors
$\Sigma$ = diagonal matrix (k × k) — singular values, in decreasing order
$V^T$ = orthonormal matrix (k × items) — right singular vectors

Classical SVD gives the optimal low-rank approximation of any matrix (Eckart-Young theorem). If you keep only the top-k singular values, you get the closest rank-k matrix to the original.

The problem: Classical SVD doesn't work for recommendations

Netflix's rating matrix is 99.9% empty (users rate <1% of movies). Classical SVD:

Requires a complete matrix (no missing values)
Would treat missing ratings as zeros (wrong!)
Is computationally expensive (O(min(m,n)²·max(m,n)) for m×n matrix)

Code

Netflix Matrix Reality:
─────────────────────────────────────────────────────────────────────────
Users: 480,000
Movies: 18,000
Possible ratings: 8.6 billion
Actual ratings: 100 million (~1.2% filled)

If we treat missing = 0, SVD thinks users HATE 99% of movies.
That's obviously wrong—they just haven't seen them.

The recommendation "SVD" is actually something different

What the Netflix Prize community calls "SVD" is actually regularized matrix factorization optimized only on observed entries. It's inspired by SVD but doesn't compute the actual decomposition.

The key insight: Only minimize error on ratings that exist:

$\mathcal{L} = \sum_{(u,i) \in \text{observed}} (r_{ui} - p_u \cdot q_i)^2$

This is fundamentally different from classical SVD which considers all entries.

The core idea:

Instead of storing explicit similarities, learn a low-dimensional representation for each user and item. A user is represented by a vector of k numbers (their "taste profile"), and an item by another k-vector (its "attribute profile"). The predicted rating is just the dot product of these vectors.

Why latent factors work:

Think of the k dimensions as hidden attributes the algorithm discovers. For movies:

Dimension 1 might capture "how action-packed vs dramatic"
Dimension 2 might capture "how mainstream vs indie"
Dimension 3 might capture "how serious vs comedic"
etc.

A user who loves action but hates comedy would have [high, ?, low, ...] in their vector. An action comedy would have [high, ?, high, ...]. Their dot product captures partial compatibility.

The magic: we never define what the dimensions mean. The algorithm discovers useful latent factors automatically from patterns in the data. When you examine learned factors, they often align with interpretable concepts—but sometimes capture subtle patterns humans wouldn't notice.

The math:

Given rating matrix $R$ (users × items), find matrices $P$ (users × k) and $Q$ (items × k) such that:

$R \approx P \cdot Q^T$

The prediction for user $u$ and item $i$ is:

$\hat{r}_{ui} = p_u \cdot q_i = \sum_{f=1}^{k} p_{uf} \cdot q_{if}$

Adding biases (crucial for accuracy):

Raw dot products miss important patterns: some users rate everything high (generous raters), some items are universally popular (hit movies). The improved model adds biases:

$\hat{r}_{ui} = \mu + b_u + b_i + p_u \cdot q_i$

Where:

$\mu$ = global average rating (e.g., 3.5 on Netflix)
$b_u$ = user bias (how much user $u$ deviates from average)
$b_i$ = item bias (how much item $i$ deviates from average)
$p_u \cdot q_i$ = user-item interaction (personalization beyond biases)

Example decomposition:

Code

Predicting Alice's rating for "Inception":
─────────────────────────────────────────────────────────────────────────
Global average μ = 3.5

Alice's bias b_alice = +0.3 (she rates slightly above average)
Inception's bias b_inception = +0.8 (it's a popular, well-rated movie)

Alice's latent vector p_alice = [0.8, -0.2, 0.5]  (likes action, dislikes romance, likes complex plots)
Inception's latent vector q_inception = [0.9, -0.1, 0.7]  (action, minimal romance, complex plot)

Dot product: 0.8×0.9 + (-0.2)×(-0.1) + 0.5×0.7 = 0.72 + 0.02 + 0.35 = 1.09

Prediction: 3.5 + 0.3 + 0.8 + 1.09 = 5.69 → clipped to 5.0 (max rating)

Two training approaches:

1. SGD (Stochastic Gradient Descent):

Iterate over observed ratings one at a time, compute gradient, update parameters:

Code

For each observed rating r_ui:
    error = r_ui - (μ + b_u + b_i + p_u · q_i)

    b_u ← b_u + η(error - λ·b_u)
    b_i ← b_i + η(error - λ·b_i)
    p_u ← p_u + η(error·q_i - λ·p_u)
    q_i ← q_i + η(error·p_u - λ·q_i)

Pros: Simple, works well, low memory Cons: Sequential (hard to parallelize), sensitive to learning rate

2. ALS (Alternating Least Squares):

The key insight: If we fix $Q$ , solving for $P$ is a least squares problem with a closed-form solution. Vice versa.

Code

Repeat until convergence:
    1. Fix Q, solve for each p_u:  p_u = (Q^T Q + λI)^{-1} Q^T r_u
    2. Fix P, solve for each q_i:  q_i = (P^T P + λI)^{-1} P^T r_i

Pros: Embarrassingly parallel (each user/item is independent), guaranteed convergence, works great for implicit feedback Cons: More complex, requires matrix inversion per user/item

ALS became the standard for large-scale systems (Spark MLlib, etc.) because you can distribute user updates across thousands of machines.

The loss function:

$\mathcal{L} = \sum_{(u,i) \in \text{observed}} (r_{ui} - \hat{r}_{ui})^2 + \lambda(||P||^2 + ||Q||^2 + ||b_u||^2 + ||b_i||^2)$

The regularization term $\lambda$ prevents overfitting to the sparse observed ratings—critical because we're learning millions of parameters from limited data.

How SVD++ Won the Netflix Prize

The Netflix Prize wasn't won by basic SVD—it required several crucial innovations developed primarily by Yehuda Koren and his collaborators at AT&T Labs. Understanding these reveals what matters in production recommendation systems.

The challenge:

On October 2, 2006, Netflix released 100,480,507 ratings from 480,189 users on 17,770 movies. The goal: beat their existing system (Cinematch, RMSE = 0.9525 on test set) by 10% to achieve RMSE ≤ 0.8572. The prize: $1 million. Within a week, a team had already beaten Cinematch.

Code

Netflix Prize Timeline (verified from competition records):
─────────────────────────────────────────────────────────────────────────
Oct 2006 (start):               Cinematch baseline = 0.9525 RMSE
Nov 2006 (1 month in):          Best = ~0.9042 (~5% improvement)
Nov 2007 (Progress Prize #1):   KorBell = 0.8712 (8.43% improvement)
                                → $50,000 prize, 2000+ hours of work
2008 (Progress Prize #2):       BellKor = 0.8616 (~9.5% improvement)
Jun 26, 2009:                   BellKor's Pragmatic Chaos = 0.8558 (10.05%)
Jul 25, 2009:                   "The Ensemble" matches at 0.8554 (10.09%)
Sep 21, 2009 (WINNER):          BellKor's Pragmatic Chaos = 0.8567 (10.06%)

The final 1.5% improvement took 2+ years of work by thousands of teams!

The photo finish: On the final day, two teams crossed the 10% threshold within minutes of each other. "The Ensemble" actually achieved a slightly better RMSE (0.8554 vs 0.8558), but BellKor's Pragmatic Chaos had submitted 20 minutes earlier. Per competition rules, BellKor won.

Innovation 1: SVD++ (Implicit Feedback)

Published by Koren in "Factorization Meets the Neighborhood" (KDD 2008), SVD++ was the breakthrough that made BellKor a frontrunner.

Basic SVD only uses explicit ratings. But there's more signal: the mere fact that a user rated an item carries information, regardless of whether the rating was high or low. A user who rated 50 horror movies is signaling interest in horror.

The SVD++ prediction formula:

$\hat{r}_{ui} = \mu + b_u + b_i + q_i^T \left(p_u + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right)$

Breaking down each component:

Symbol	Meaning	Intuition
$\mu$	Global mean rating	Baseline (e.g., 3.5 stars)
$b_u$	User bias	"Alice rates 0.3 stars above average"
$b_i$	Item bias	"Inception is rated 0.8 stars above average"
$p_u$	User latent factors	Explicit preferences from ratings given
$q_i$	Item latent factors	Item characteristics
$y_j$	Implicit item factors	What rating item $j$ reveals about user taste
$I_u$	Items user $u$ rated	Set of all items (regardless of rating value)
$	I_u	^{-1/2}$

Why implicit feedback helps:

The term $\sum_{j \in I_u} y_j$ captures "what kind of user rates these specific items." Even without knowing the rating values, we learn something:

A user who rated many Marvel movies → probably likes superhero films
A user who rated many foreign films → probably enjoys non-Hollywood cinema
A user who rated both → has diverse taste

The $y_j$ vectors learn these implicit signals automatically during training.

Performance: SVD++ alone achieved RMSE of 0.8911, while standard SVD achieved 0.8914. The combination with other techniques made the real difference.

Innovation 2: Temporal Dynamics (timeSVD++)

Published in "Collaborative Filtering with Temporal Dynamics" (KDD 2009), this paper showed that user preferences and item perceptions drift over time.

Koren discovered striking patterns in the Netflix data:

Rating scale shift: In early 2004, mean ratings jumped from ~3.4 to ~3.6 stars (Netflix changed their UI)
Movie aging effect: Older movies consistently receive higher ratings than newer ones
User preference drift: Individual users' tastes evolve over months/years

The timeSVD++ model makes parameters time-dependent:

Code

Time-dependent parameters:
─────────────────────────────────────────────────────────────────────────
User bias over time:   b_u(t) = b_u + α_u · dev_u(t) + b_{u,Bin(t)}

  b_u         = static user bias (doesn't change)
  α_u         = user-specific slope (some users' ratings drift faster)
  dev_u(t)    = sign(t - t̄_u) · |t - t̄_u|^β  (deviation from user's mean time)
  b_{u,Bin(t)} = bias for time period (weekly/monthly buckets)

Item bias over time:   b_i(t) = b_i + b_{i,Bin(t)}

  Movies don't have inherent drift, but perception changes over time

User factors over time: p_u(t) = p_u + α_u · dev_u(t)

  User preferences evolve (someone who liked comedies in 2005 might
  prefer dramas in 2008)

Dramatic impact: A timeSVD++ model with only 10 latent factors outperformed standard SVD with 200 factors. With 200 factors, timeSVD++ achieved RMSE = 0.8762—a massive improvement.

Innovation 3: Restricted Boltzmann Machines (RBMs)

While matrix factorization dominated, Ruslan Salakhutdinov's RBM approach provided a fundamentally different model that captured complementary patterns.

RBMs are probabilistic neural networks that learn a joint distribution over user ratings. Unlike SVD which learns point estimates, RBMs model uncertainty in preferences.

Performance comparison:

SVD alone: RMSE = 0.8914
RBM alone: RMSE = 0.8990
SVD + RBM blend: RMSE = 0.88

The key insight: models with different inductive biases make different errors. RBMs captured patterns that SVD missed (and vice versa), making their combination more powerful than either alone.

Innovation 4: Neighborhood Models

Despite deep learning hype, simple item-item similarity models captured patterns that latent factor models missed. The winning solutions combined both:

$\hat{r}_{ui} = \underbrace{\mu + b_u + b_i + q_i^T p_u}_{\text{Latent factors (global patterns)}} + \underbrace{\sum_{j \in R^k(i;u)} w_{ij}(r_{uj} - b_{uj})}_{\text{Neighborhood (local patterns)}}$

Where $R^k(i;u)$ is the set of $k$ items most similar to $i$ that user $u$ has rated, and $w_{ij}$ are learned item-item weights.

This hybrid captured:

Global patterns (latent factors): "Users who like action movies tend to like this"
Local patterns (neighborhood): "Users who liked Inception also liked Interstellar"

Innovation 5: Massive Ensembling

The final winning submission from BellKor's Pragmatic Chaos (a merger of BellKor, Pragmatic Theory, and BigChaos teams) combined hundreds of individual predictors:

Code

BellKor's Pragmatic Chaos ensemble composition:
─────────────────────────────────────────────────────────────────────────
MATRIX FACTORIZATION VARIANTS:
- Standard SVD (multiple k values: 20, 50, 100, 200, 500, 2000)
- SVD++ with implicit feedback
- timeSVD++ with temporal dynamics
- Asymmetric factor models

RESTRICTED BOLTZMANN MACHINES:
- Standard RBM
- Conditional RBM (with temporal features)
- Factored RBM

NEIGHBORHOOD MODELS:
- Item-item collaborative filtering
- User-user collaborative filtering
- Pearson correlation variants
- Cosine similarity variants

k-NEAREST NEIGHBORS:
- Various k values and distance metrics
- With and without significance weighting

REGRESSION MODELS:
- Linear regression on movie/user features
- Gradient Boosted Decision Trees

BLENDING:
- Linear regression on predictor outputs
- Neural network stacking (single hidden layer)
- Gradient Boosted Trees for nonlinear combination

Total: 100+ individual models, yielding 800+ predictors after parameter variations

The bitter lesson:

The winning ensemble was never fully deployed by Netflix. Here's what actually happened:

2007 Progress Prize solution was deployed: The simpler SVD+RBM combination from the first Progress Prize was productionized by Netflix engineers, as it was already significantly better than Cinematch.
Grand Prize ensemble was too complex: Netflix evaluated the final ensemble offline but concluded "the additional accuracy gains did not seem to justify the engineering effort needed to bring them into a production environment."
Business context changed: By 2009, Netflix was shifting from DVD rentals to streaming. The rating prediction problem became less central than other recommendation challenges (what to show on the homepage, etc.).

The real value of the Prize:

Academic papers: Koren's work on SVD++, timeSVD++, and matrix factorization became foundational—his "Matrix Factorization Techniques for Recommender Systems" (IEEE Computer, 2009) has 45,000+ citations
Talent pipeline: Many participants joined Netflix and other tech companies
Proof of concept: Demonstrated that recommendations could be dramatically improved with ML
Open research: Spurred massive investment in recommendation system research

What practitioners should take from the Netflix Prize:

Biases explain most variance: The $\mu + b_u + b_i$ terms often matter more than the latent factors $p_u \cdot q_i$ . Always include them.
Implicit feedback is gold: What users interact with (even without explicit ratings) is hugely informative. Modern systems rely almost entirely on implicit signals.
Temporal dynamics matter: User preferences drift. Systems that don't account for time become stale.
Ensemble diversity beats ensemble size: RBM + SVD worked because they made different errors. 100 SVD variants wouldn't help as much.
Production ≠ Competition: The winning 800-predictor ensemble scored 10.06%, but a simple 3-model blend might achieve 9.5% and actually be deployable.
The final 1% is exponentially harder: Going from 0% to 8% took one team 1 year. Going from 8% to 10% took the entire community 2 more years.

Sources:

Python

import torch
import torch.nn as nn

class MatrixFactorization(nn.Module):
    """
    Basic matrix factorization for collaborative filtering.
    R ≈ P × Q^T + biases
    """

    def __init__(
        self,
        num_users: int,
        num_items: int,
        num_factors: int = 50,
    ):
        super().__init__()

        # User and item embeddings (latent factors)
        self.user_factors = nn.Embedding(num_users, num_factors)
        self.item_factors = nn.Embedding(num_items, num_factors)

        # Bias terms (important for accuracy)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)
        self.global_bias = nn.Parameter(torch.zeros(1))

        # Initialize
        nn.init.normal_(self.user_factors.weight, std=0.01)
        nn.init.normal_(self.item_factors.weight, std=0.01)
        nn.init.zeros_(self.user_bias.weight)
        nn.init.zeros_(self.item_bias.weight)

    def forward(self, user_ids: torch.Tensor, item_ids: torch.Tensor) -> torch.Tensor:
        """
        Predict ratings for user-item pairs.

        Args:
            user_ids: (batch_size,)
            item_ids: (batch_size,)

        Returns:
            predictions: (batch_size,)
        """
        # Get embeddings
        user_emb = self.user_factors(user_ids)  # (B, k)
        item_emb = self.item_factors(item_ids)  # (B, k)

        # Dot product
        dot = (user_emb * item_emb).sum(dim=1)  # (B,)

        # Add biases
        user_b = self.user_bias(user_ids).squeeze()
        item_b = self.item_bias(item_ids).squeeze()

        return self.global_bias + user_b + item_b + dot

def train_mf_sgd(model, train_data, epochs=20, lr=0.01, reg=0.02):
    """
    Train matrix factorization with SGD.

    Args:
        train_data: List of (user_id, item_id, rating) tuples
    """
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=reg)

    for epoch in range(epochs):
        total_loss = 0

        for user_id, item_id, rating in train_data:
            user_id = torch.tensor([user_id])
            item_id = torch.tensor([item_id])
            rating = torch.tensor([rating], dtype=torch.float)

            # Forward
            pred = model(user_id, item_id)

            # MSE loss
            loss = (pred - rating) ** 2

            # Backward
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch}: Loss = {total_loss / len(train_data):.4f}")

# Alternating Least Squares (ALS) - Better for implicit feedback
class ALS:
    """
    Alternating Least Squares for matrix factorization.
    Better for large-scale implicit feedback (views, clicks).

    Key idea: Fix users, solve for items (closed-form).
              Then fix items, solve for users. Repeat.
    """

    def __init__(self, num_factors: int = 50, reg: float = 0.01, iterations: int = 15):
        self.k = num_factors
        self.reg = reg
        self.iterations = iterations

    def fit(self, interactions: np.ndarray):
        """
        Args:
            interactions: (num_users, num_items) binary or count matrix
        """
        num_users, num_items = interactions.shape

        # Initialize factors randomly
        self.user_factors = np.random.normal(0, 0.01, (num_users, self.k))
        self.item_factors = np.random.normal(0, 0.01, (num_items, self.k))

        # Confidence matrix (for implicit feedback)
        # Higher counts = higher confidence
        C = 1 + 40 * interactions  # Confidence scaling

        for iteration in range(self.iterations):
            # Fix items, solve for users
            self._solve_users(interactions, C)

            # Fix users, solve for items
            self._solve_items(interactions, C)

            print(f"Iteration {iteration}: Loss = {self._compute_loss(interactions, C):.4f}")

    def _solve_users(self, R, C):
        """Solve for user factors with items fixed."""
        YtY = self.item_factors.T @ self.item_factors
        reg_I = self.reg * np.eye(self.k)

        for u in range(R.shape[0]):
            # Weighted least squares
            Cu = np.diag(C[u])
            A = YtY + self.item_factors.T @ Cu @ self.item_factors + reg_I
            b = self.item_factors.T @ Cu @ R[u]
            self.user_factors[u] = np.linalg.solve(A, b)

    def _solve_items(self, R, C):
        """Solve for item factors with users fixed."""
        XtX = self.user_factors.T @ self.user_factors
        reg_I = self.reg * np.eye(self.k)

        for i in range(R.shape[1]):
            Ci = np.diag(C[:, i])
            A = XtX + self.user_factors.T @ Ci @ self.user_factors + reg_I
            b = self.user_factors.T @ Ci @ R[:, i]
            self.item_factors[i] = np.linalg.solve(A, b)

    def _compute_loss(self, R, C):
        """Compute weighted squared error."""
        pred = self.user_factors @ self.item_factors.T
        diff = R - pred
        return np.sum(C * diff ** 2)

SVD++: The Netflix Prize Winning Algorithm

SVD++ extends basic matrix factorization by incorporating implicit feedback—the fact that a user rated an item carries information beyond the rating value itself. This was one of the key innovations that helped win the Netflix Prize.

The core insight:

A user who rated 50 horror movies signals interest in horror, regardless of whether those ratings were 2 stars or 5 stars. SVD++ captures this by adding a term that sums up the "implicit factors" of all items a user has rated.

The prediction formula:

$\hat{r}_{ui} = \mu + b_u + b_i + q_i^T \left(p_u + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right)$

The implementation below follows Koren's original formulation from "Factorization Meets the Neighborhood" (KDD 2008):

Python

import torch
import torch.nn as nn
import numpy as np
from collections import defaultdict

class SVDPlusPlus(nn.Module):
    """
    SVD++ (Koren, KDD 2008) - The core algorithm behind the Netflix Prize win.

    Key innovation: Combines explicit ratings with implicit feedback.
    The mere fact that a user rated an item (regardless of rating value)
    provides signal about user preferences.

    Prediction formula:
    r̂_ui = μ + b_u + b_i + q_i^T (p_u + |I_u|^(-1/2) Σ_{j∈I_u} y_j)

    Where:
    - μ: global mean rating
    - b_u: user bias
    - b_i: item bias
    - p_u: user latent factors (from explicit ratings)
    - q_i: item latent factors
    - y_j: implicit item factors (what rating item j reveals about user taste)
    - I_u: set of items user u has rated (implicit signal)
    """

    def __init__(
        self,
        num_users: int,
        num_items: int,
        num_factors: int = 50,
        global_mean: float = 0.0,
    ):
        super().__init__()

        self.num_users = num_users
        self.num_items = num_items
        self.num_factors = num_factors
        self.global_mean = nn.Parameter(torch.tensor(global_mean), requires_grad=False)

        # Bias terms (μ + b_u + b_i explains most variance!)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)

        # Explicit factor matrices (p_u and q_i)
        self.user_factors = nn.Embedding(num_users, num_factors)  # p_u
        self.item_factors = nn.Embedding(num_items, num_factors)  # q_i

        # IMPLICIT factor matrix (y_j) - THE KEY SVD++ INNOVATION
        # Each item has an implicit factor that captures "what kind of user rates this"
        self.implicit_item_factors = nn.Embedding(num_items, num_factors)  # y_j

        # Initialize with small random values (important for convergence)
        self._init_weights()

    def _init_weights(self):
        nn.init.zeros_(self.user_bias.weight)
        nn.init.zeros_(self.item_bias.weight)
        nn.init.normal_(self.user_factors.weight, std=0.01)
        nn.init.normal_(self.item_factors.weight, std=0.01)
        nn.init.normal_(self.implicit_item_factors.weight, std=0.01)

    def forward(
        self,
        user_ids: torch.Tensor,
        item_ids: torch.Tensor,
        user_item_lists: list[list[int]],
    ) -> torch.Tensor:
        """
        Predict ratings for user-item pairs.

        Args:
            user_ids: (batch_size,) user indices
            item_ids: (batch_size,) item indices
            user_item_lists: For each user in batch, list of ALL items they've rated
                             This is the implicit feedback signal!

        Returns:
            predictions: (batch_size,) predicted ratings
        """
        batch_size = user_ids.shape[0]

        # Biases
        b_u = self.user_bias(user_ids).squeeze(-1)  # (B,)
        b_i = self.item_bias(item_ids).squeeze(-1)  # (B,)

        # Explicit user/item factors
        p_u = self.user_factors(user_ids)  # (B, k)
        q_i = self.item_factors(item_ids)  # (B, k)

        # IMPLICIT FEEDBACK TERM: Σ y_j for all items user has rated
        # This is what makes SVD++ different from basic SVD!
        implicit_sum = torch.zeros(batch_size, self.num_factors, device=user_ids.device)

        for idx, items_rated in enumerate(user_item_lists):
            if len(items_rated) > 0:
                # Get implicit factors for all items this user rated
                items_tensor = torch.tensor(items_rated, device=user_ids.device)
                y_j = self.implicit_item_factors(items_tensor)  # (num_rated, k)

                # Sum and normalize by sqrt(|I_u|) to prevent bias toward active users
                normalization = 1.0 / np.sqrt(len(items_rated))
                implicit_sum[idx] = y_j.sum(dim=0) * normalization

        # Combine explicit and implicit: p_u + |I_u|^(-1/2) Σ y_j
        user_representation = p_u + implicit_sum  # (B, k)

        # Final prediction: μ + b_u + b_i + q_i^T (p_u + implicit)
        interaction = (q_i * user_representation).sum(dim=-1)  # (B,)
        prediction = self.global_mean + b_u + b_i + interaction

        return prediction

def train_svdpp(
    model: SVDPlusPlus,
    train_data: list[tuple],  # List of (user_id, item_id, rating)
    user_items: dict[int, list[int]],  # user_id -> list of item_ids they rated
    epochs: int = 20,
    lr: float = 0.007,
    reg: float = 0.02,
):
    """
    Train SVD++ using SGD (following Koren's original approach).

    Args:
        model: SVD++ model
        train_data: List of (user_id, item_id, rating) tuples
        user_items: Dict mapping user_id to list of all items they've rated
        epochs: Number of training epochs
        lr: Learning rate (Koren used 0.007)
        reg: Regularization strength (Koren used 0.02)
    """
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=reg)

    for epoch in range(epochs):
        total_loss = 0.0
        np.random.shuffle(train_data)

        for user_id, item_id, rating in train_data:
            # Get implicit feedback for this user
            items_rated = user_items.get(user_id, [])

            # Convert to tensors
            user_tensor = torch.tensor([user_id])
            item_tensor = torch.tensor([item_id])
            rating_tensor = torch.tensor([rating], dtype=torch.float32)

            # Forward pass
            pred = model(user_tensor, item_tensor, [items_rated])

            # MSE loss
            loss = (pred - rating_tensor) ** 2

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        rmse = np.sqrt(total_loss / len(train_data))
        print(f"Epoch {epoch + 1}/{epochs}: RMSE = {rmse:.4f}")

# Example usage:
"""
# Prepare data
train_data = [(0, 5, 4.0), (0, 10, 3.5), (1, 5, 5.0), ...]  # (user, item, rating)

# Build user_items dict for implicit feedback
user_items = defaultdict(list)
for user_id, item_id, _ in train_data:
    user_items[user_id].append(item_id)

# Compute global mean
global_mean = np.mean([r for _, _, r in train_data])

# Initialize and train
model = SVDPlusPlus(
    num_users=10000,
    num_items=5000,
    num_factors=50,
    global_mean=global_mean,
)
train_svdpp(model, train_data, user_items, epochs=20)
"""

SVD++ Training: The Exact Gradient Derivation

Understanding exactly how SVD++ learns is crucial. Here we derive the gradients from first principles—this is what Koren did in his original implementation.

The Loss Function (Regularized Squared Error):

For a single observed rating $r_{ui}$ , the loss is:

$\mathcal{L}_{ui} = \frac{1}{2}\left(r_{ui} - \hat{r}_{ui}\right)^2 + \frac{\lambda_1}{2}\left(b_u^2 + b_i^2\right) + \frac{\lambda_2}{2}\left(\|p_u\|^2 + \|q_i\|^2 + \sum_{j \in I_u}\|y_j\|^2\right)$

Where:

First term: Prediction error (what we want to minimize)
Second term: Regularization on biases (prevents extreme user/item biases)
Third term: Regularization on latent factors (prevents overfitting)

Define the prediction error:

$e_{ui} = r_{ui} - \hat{r}_{ui} = r_{ui} - \left(\mu + b_u + b_i + q_i^T \left(p_u + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right)\right)$

Gradient Derivation for Each Parameter:

Taking partial derivatives of $\mathcal{L}_{ui}$ with respect to each parameter:

1. User bias gradient: $\frac{\partial \mathcal{L}}{\partial b_u} = -e_{ui} + \lambda_1 b_u$

2. Item bias gradient: $\frac{\partial \mathcal{L}}{\partial b_i} = -e_{ui} + \lambda_1 b_i$

3. User factor gradient: $\frac{\partial \mathcal{L}}{\partial p_u} = -e_{ui} \cdot q_i + \lambda_2 p_u$

4. Item factor gradient: $\frac{\partial \mathcal{L}}{\partial q_i} = -e_{ui} \cdot \left(p_u + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right) + \lambda_2 q_i$

5. Implicit factor gradient (for each item $j$ that user $u$ rated): $\frac{\partial \mathcal{L}}{\partial y_j} = -e_{ui} \cdot |I_u|^{-1/2} \cdot q_i + \lambda_2 y_j$

The SGD Update Rules:

Using gradient descent with learning rate $\gamma$ , we update parameters in the opposite direction of the gradient:

Code

For each observed rating (u, i, r_ui):

    # 1. Compute prediction
    implicit_sum = (1/√|I_u|) * Σ_{j ∈ I_u} y_j
    r̂_ui = μ + b_u + b_i + q_i · (p_u + implicit_sum)

    # 2. Compute error
    e_ui = r_ui - r̂_ui

    # 3. Update biases (move toward reducing error)
    b_u ← b_u + γ · (e_ui - λ₁ · b_u)
    b_i ← b_i + γ · (e_ui - λ₁ · b_i)

    # 4. Update user factors
    p_u ← p_u + γ · (e_ui · q_i - λ₂ · p_u)

    # 5. Update item factors (note: uses p_u + implicit, not just p_u)
    q_i ← q_i + γ · (e_ui · (p_u + implicit_sum) - λ₂ · q_i)

    # 6. Update ALL implicit factors for items this user rated
    for j in I_u:
        y_j ← y_j + γ · (e_ui · (1/√|I_u|) · q_i - λ₂ · y_j)

Why updating $y_j$ is expensive:

For each training example, we must update $y_j$ for every item the user has rated. If a user rated 100 movies, that's 100 updates per training example. This makes SVD++ significantly slower than basic SVD (about 3-5x slower in practice).

Koren's Exact Hyperparameters (from the Netflix Prize):

Parameter	Symbol	Value	Notes
Learning rate	$\gamma$	0.007	For factors $p_u$ , $q_i$ , $y_j$
Bias learning rate	$\gamma_1$	0.005	For biases $b_u$ , $b_i$
Factor regularization	$\lambda_2$	0.015	Prevents factor overfitting
Bias regularization	$\lambda_1$	0.005	Lower than factors
Learning rate decay	-	0.9× per epoch	Multiply $\gamma$ by 0.9 after each epoch
Number of epochs	-	20-30	More epochs with decay
Number of factors	$k$	50-200	200 factors for final submission

The Learning Rate Decay Schedule:

Koren found that decaying the learning rate was crucial for convergence:

Python

def train_svdpp_with_decay(
    model,
    train_data,
    user_items,
    epochs: int = 30,
    initial_lr: float = 0.007,
    lr_decay: float = 0.9,
    reg_factors: float = 0.015,
    reg_biases: float = 0.005,
):
    """
    Train SVD++ with Koren's exact hyperparameters and learning rate decay.
    """
    current_lr = initial_lr

    for epoch in range(epochs):
        total_se = 0.0  # Squared error sum

        # Shuffle training data each epoch (important for SGD!)
        np.random.shuffle(train_data)

        for user_id, item_id, rating in train_data:
            items_rated = user_items[user_id]
            num_rated = len(items_rated)

            if num_rated == 0:
                continue

            # Compute implicit sum
            norm_factor = 1.0 / np.sqrt(num_rated)
            implicit_sum = norm_factor * sum(model.y[j] for j in items_rated)

            # Prediction
            pred = (model.global_mean +
                    model.b_u[user_id] +
                    model.b_i[item_id] +
                    np.dot(model.q[item_id], model.p[user_id] + implicit_sum))

            # Clamp prediction to valid range
            pred = max(1.0, min(5.0, pred))

            # Error
            err = rating - pred
            total_se += err ** 2

            # Update biases
            model.b_u[user_id] += current_lr * (err - reg_biases * model.b_u[user_id])
            model.b_i[item_id] += current_lr * (err - reg_biases * model.b_i[item_id])

            # Cache for efficiency
            user_implicit = model.p[user_id] + implicit_sum

            # Update user factors
            model.p[user_id] += current_lr * (err * model.q[item_id] - reg_factors * model.p[user_id])

            # Update item factors
            model.q[item_id] += current_lr * (err * user_implicit - reg_factors * model.q[item_id])

            # Update implicit factors (expensive!)
            for j in items_rated:
                model.y[j] += current_lr * (err * norm_factor * model.q[item_id] - reg_factors * model.y[j])

        # Compute RMSE
        rmse = np.sqrt(total_se / len(train_data))
        print(f"Epoch {epoch+1}: RMSE = {rmse:.4f}, LR = {current_lr:.6f}")

        # Decay learning rate
        current_lr *= lr_decay

Convergence Behavior:

Code

Typical training progression on Netflix data:
─────────────────────────────────────────────────────────────────────────
Epoch 1:  RMSE = 0.9234  LR = 0.007000  (big initial improvement)
Epoch 2:  RMSE = 0.9089  LR = 0.006300
Epoch 3:  RMSE = 0.9012  LR = 0.005670
Epoch 5:  RMSE = 0.8967  LR = 0.004592
Epoch 10: RMSE = 0.8923  LR = 0.002824  (slowing down)
Epoch 15: RMSE = 0.8911  LR = 0.001736
Epoch 20: RMSE = 0.8907  LR = 0.001067
Epoch 25: RMSE = 0.8905  LR = 0.000656  (nearly converged)
Epoch 30: RMSE = 0.8904  LR = 0.000403

Final RMSE ≈ 0.8904 (on training set)
Probe RMSE ≈ 0.8911 (on held-out validation)

Why the Gradient Updates Work:

Let's trace through one example intuitively:

Code

Scenario: User Alice (u=42) rates movie "Inception" (i=1701) with 5 stars.
Alice has previously rated: [Matrix, Interstellar, Blade Runner] (all sci-fi)
Our model predicts: 4.2 stars

Error: e = 5.0 - 4.2 = +0.8 (we under-predicted!)

Updates (with γ=0.007):

1. b_u (Alice's bias) goes UP by 0.007 × 0.8 = +0.0056
   → "Alice tends to rate higher than we thought"

2. b_i (Inception's bias) goes UP by +0.0056
   → "Inception is better than we thought"

3. p_u (Alice's taste vector) moves TOWARD q_inception
   → "Alice likes what Inception represents"

4. q_i (Inception's attribute vector) moves TOWARD (p_alice + implicit)
   → "Inception matches users like Alice"

5. For each j in [Matrix, Interstellar, Blade Runner]:
   y_j moves TOWARD q_inception (scaled by 1/√3)
   → "Users who rate Matrix/Interstellar/Blade Runner also like Inception"

The implicit feedback update (step 5) is the key insight: we're learning that the act of rating sci-fi movies predicts liking Inception, even before we look at the rating values.

timeSVD++: Adding Temporal Dynamics

timeSVD++ extends SVD++ by making parameters time-dependent. User preferences drift over time, and movie perceptions change as they age. This was described in Koren's "Collaborative Filtering with Temporal Dynamics" (KDD 2009).

Key discoveries from the Netflix data:

Rating scale shift (early 2004): Mean ratings jumped from ~3.4 to ~3.6 stars when Netflix changed their UI
Movie aging effect: Older movies consistently receive higher ratings
User preference drift: Individual tastes evolve over months/years

The time-dependent prediction formula:

$\hat{r}_{ui}(t) = \mu + b_u(t) + b_i(t) + q_i^T \left(p_u(t) + |I_u|^{-1/2} \sum_{j \in I_u} y_j\right)$

Where the time-dependent terms are:

$b_u(t) = b_u + \alpha_u \cdot \text{dev}_u(t) + b_{u,\text{Bin}(t)}$ $b_i(t) = b_i + b_{i,\text{Bin}(t)}$ $p_u(t) = p_u + \alpha_u \cdot \text{dev}_u(t)$

Python

class TimeSVDPlusPlus(nn.Module):
    """
    timeSVD++ (Koren, KDD 2009) - The full Netflix Prize winning model.

    Extends SVD++ with temporal dynamics:
    - User biases drift over time
    - Item biases change as movies age
    - User preferences evolve

    A timeSVD++ model with 10 factors outperforms standard SVD with 200 factors!

    Time-dependent prediction:
    r̂_ui(t) = μ + b_u(t) + b_i(t) + q_i^T (p_u(t) + |I_u|^(-1/2) Σ y_j)

    Where:
    b_u(t) = b_u + α_u · dev_u(t) + b_{u,Bin(t)}
    b_i(t) = b_i + b_{i,Bin(t)}
    p_u(t) = p_u + α_u · dev_u(t)
    """

    def __init__(
        self,
        num_users: int,
        num_items: int,
        num_factors: int = 50,
        num_time_bins: int = 30,  # e.g., 30 bins for monthly granularity
        global_mean: float = 0.0,
        beta: float = 0.4,  # Power for dev function (Koren used 0.4)
    ):
        super().__init__()

        self.num_factors = num_factors
        self.num_time_bins = num_time_bins
        self.beta = beta
        self.global_mean = nn.Parameter(torch.tensor(global_mean), requires_grad=False)

        # Static biases and factors (same as SVD++)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)
        self.user_factors = nn.Embedding(num_users, num_factors)
        self.item_factors = nn.Embedding(num_items, num_factors)
        self.implicit_item_factors = nn.Embedding(num_items, num_factors)

        # TEMPORAL COMPONENTS - THE timeSVD++ INNOVATION

        # α_u: User-specific drift rate (some users' tastes change faster)
        self.user_alpha = nn.Embedding(num_users, 1)

        # b_{u,Bin(t)}: User bias per time bin
        self.user_bias_time = nn.Embedding(num_users * num_time_bins, 1)

        # b_{i,Bin(t)}: Item bias per time bin (movies perceived differently over time)
        self.item_bias_time = nn.Embedding(num_items * num_time_bins, 1)

        # User factor drift (p_u changes over time)
        self.user_factors_alpha = nn.Embedding(num_users, num_factors)

        # Store user mean times (computed from training data)
        self.register_buffer('user_mean_time', torch.zeros(num_users))

        self._init_weights()

    def _init_weights(self):
        nn.init.zeros_(self.user_bias.weight)
        nn.init.zeros_(self.item_bias.weight)
        nn.init.zeros_(self.user_alpha.weight)
        nn.init.zeros_(self.user_bias_time.weight)
        nn.init.zeros_(self.item_bias_time.weight)
        nn.init.normal_(self.user_factors.weight, std=0.01)
        nn.init.normal_(self.item_factors.weight, std=0.01)
        nn.init.normal_(self.implicit_item_factors.weight, std=0.01)
        nn.init.normal_(self.user_factors_alpha.weight, std=0.01)

    def compute_dev(self, user_ids: torch.Tensor, times: torch.Tensor) -> torch.Tensor:
        """
        Compute dev_u(t) = sign(t - t̄_u) · |t - t̄_u|^β

        This captures how far a rating is from the user's "typical" time.
        Users who rate early vs late in their Netflix membership differ.
        """
        mean_times = self.user_mean_time[user_ids]
        diff = times - mean_times
        sign = torch.sign(diff)
        dev = sign * (torch.abs(diff) ** self.beta)
        return dev

    def forward(
        self,
        user_ids: torch.Tensor,
        item_ids: torch.Tensor,
        times: torch.Tensor,  # Normalized time values (0 to num_time_bins-1)
        time_bins: torch.Tensor,  # Discretized time bin indices
        user_item_lists: list[list[int]],
    ) -> torch.Tensor:
        """
        Predict ratings with temporal dynamics.

        Args:
            user_ids: (batch_size,) user indices
            item_ids: (batch_size,) item indices
            times: (batch_size,) continuous time values
            time_bins: (batch_size,) discretized time bin indices
            user_item_lists: For each user, list of items they've rated

        Returns:
            predictions: (batch_size,) predicted ratings
        """
        batch_size = user_ids.shape[0]

        # === STATIC COMPONENTS (same as SVD++) ===
        b_u_static = self.user_bias(user_ids).squeeze(-1)
        b_i_static = self.item_bias(item_ids).squeeze(-1)
        p_u_static = self.user_factors(user_ids)
        q_i = self.item_factors(item_ids)

        # === TEMPORAL BIAS COMPONENTS ===

        # dev_u(t) for this batch
        dev = self.compute_dev(user_ids, times)  # (B,)

        # α_u · dev_u(t) - user-specific drift
        alpha_u = self.user_alpha(user_ids).squeeze(-1)  # (B,)
        user_drift = alpha_u * dev  # (B,)

        # b_{u,Bin(t)} - user bias for this time bin
        user_time_indices = user_ids * self.num_time_bins + time_bins
        b_u_time = self.user_bias_time(user_time_indices).squeeze(-1)  # (B,)

        # b_{i,Bin(t)} - item bias for this time bin
        item_time_indices = item_ids * self.num_time_bins + time_bins
        b_i_time = self.item_bias_time(item_time_indices).squeeze(-1)  # (B,)

        # Full time-dependent biases:
        # b_u(t) = b_u + α_u · dev_u(t) + b_{u,Bin(t)}
        # b_i(t) = b_i + b_{i,Bin(t)}
        b_u_t = b_u_static + user_drift + b_u_time
        b_i_t = b_i_static + b_i_time

        # === TEMPORAL USER FACTORS ===
        # p_u(t) = p_u + α_u · dev_u(t)
        p_u_alpha = self.user_factors_alpha(user_ids)  # (B, k)
        p_u_t = p_u_static + p_u_alpha * dev.unsqueeze(-1)  # (B, k)

        # === IMPLICIT FEEDBACK (same as SVD++) ===
        implicit_sum = torch.zeros(batch_size, self.num_factors, device=user_ids.device)
        for idx, items_rated in enumerate(user_item_lists):
            if len(items_rated) > 0:
                items_tensor = torch.tensor(items_rated, device=user_ids.device)
                y_j = self.implicit_item_factors(items_tensor)
                normalization = 1.0 / np.sqrt(len(items_rated))
                implicit_sum[idx] = y_j.sum(dim=0) * normalization

        # === FINAL PREDICTION ===
        # r̂_ui(t) = μ + b_u(t) + b_i(t) + q_i^T (p_u(t) + implicit)
        user_representation = p_u_t + implicit_sum
        interaction = (q_i * user_representation).sum(dim=-1)
        prediction = self.global_mean + b_u_t + b_i_t + interaction

        return prediction

def precompute_user_mean_times(
    train_data: list[tuple],  # (user_id, item_id, rating, time)
    num_users: int,
) -> torch.Tensor:
    """
    Compute t̄_u (mean rating time) for each user.
    Required for the dev_u(t) function in timeSVD++.
    """
    user_times = defaultdict(list)
    for user_id, _, _, time in train_data:
        user_times[user_id].append(time)

    mean_times = torch.zeros(num_users)
    for user_id, times in user_times.items():
        mean_times[user_id] = np.mean(times)

    return mean_times

# Example usage:
"""
# Training data now includes timestamps
train_data = [
    (user_id, item_id, rating, timestamp),
    ...
]

# Convert timestamps to time bins (e.g., weeks or months)
def timestamp_to_bin(ts, min_ts, max_ts, num_bins=30):
    normalized = (ts - min_ts) / (max_ts - min_ts)
    return int(normalized * (num_bins - 1))

# Precompute user mean times
user_mean_times = precompute_user_mean_times(train_data, num_users)

# Initialize model
model = TimeSVDPlusPlus(
    num_users=480189,    # Netflix had 480K users
    num_items=17770,     # Netflix had 17.7K movies
    num_factors=50,
    num_time_bins=30,
    global_mean=3.6,
)
model.user_mean_time = user_mean_times

# Training loop similar to SVD++ but with time information
"""

Performance comparison on Netflix:

Model	Factors	RMSE	Notes
SVD (baseline)	200	0.8914	No implicit, no time
SVD++	200	0.8911	With implicit feedback
timeSVD++	10	< 0.8914	10 factors beats 200!
timeSVD++	200	0.8762	Best single model

The key insight: temporal dynamics are more important than model capacity. A small model that understands time beats a large model that doesn't.

BPR: Bayesian Personalized Ranking

Most real-world recommendation data is implicit: clicks, views, purchases, time spent—not explicit 1-5 star ratings. This fundamentally changes the problem. With explicit ratings, you predict "user U would rate item I 4.2 stars." With implicit feedback, you only know what users did interact with, not what they didn't like.

The key insight of BPR:

Don't predict absolute scores. Instead, learn to rank items correctly. If user U clicked on item I but not item J, then I should rank higher than J for user U. This pairwise comparison is the core of BPR.

The training approach:

Sample a user $u$
Sample a positive item $i$ (user interacted with it)
Sample a negative item $j$ (user didn't interact with it)
Train the model so that $\hat{r}_{ui} > \hat{r}_{uj}$

The BPR loss function:

$\mathcal{L}_{BPR} = -\sum_{(u, i, j)} \ln \sigma(\hat{r}_{ui} - \hat{r}_{uj}) + \lambda ||\Theta||^2$

Where $\sigma$ is the sigmoid function. This loss pushes the model to score positive items higher than negative items, with a margin determined by the sigmoid.

Why BPR works better for implicit feedback:

Point-wise losses fail: MSE on 0/1 labels treats all non-interactions as "dislike." But a user might love an item they just haven't seen yet.
Pairwise comparison is natural: We're confident the user prefers clicked items over random ones. We're not confident they hate random ones.
Ranking is what matters: In production, we show the top-k items. Getting the ranking right matters more than exact scores.

Negative sampling strategies:

The choice of negative samples dramatically affects learning:

Uniform random: Simple but wastes training on easy negatives
Popularity-based: Sample popular items as negatives (harder negatives)
In-batch negatives: Use other users' positives as negatives (efficient)
Hard negative mining: Find items the model incorrectly ranks high

Python

class BPR(nn.Module):
    """
    Bayesian Personalized Ranking.
    Optimizes: user prefers positive item over negative item.

    Key insight: We don't need absolute scores, just relative ranking.
    """

    def __init__(self, num_users: int, num_items: int, num_factors: int = 50):
        super().__init__()

        self.user_factors = nn.Embedding(num_users, num_factors)
        self.item_factors = nn.Embedding(num_items, num_factors)

        nn.init.normal_(self.user_factors.weight, std=0.01)
        nn.init.normal_(self.item_factors.weight, std=0.01)

    def forward(self, user_ids, pos_item_ids, neg_item_ids):
        """
        Compute BPR loss for triplets.

        Args:
            user_ids: (batch_size,)
            pos_item_ids: (batch_size,) - items user interacted with
            neg_item_ids: (batch_size,) - sampled negative items
        """
        user_emb = self.user_factors(user_ids)
        pos_emb = self.item_factors(pos_item_ids)
        neg_emb = self.item_factors(neg_item_ids)

        # Scores
        pos_score = (user_emb * pos_emb).sum(dim=1)
        neg_score = (user_emb * neg_emb).sum(dim=1)

        # BPR loss: -log(sigmoid(pos - neg))
        loss = -torch.log(torch.sigmoid(pos_score - neg_score) + 1e-8)

        return loss.mean()

    def recommend(self, user_id: int, n: int = 10) -> list[int]:
        """Get top-n items for user."""
        user_emb = self.user_factors.weight[user_id]  # (k,)
        all_scores = self.item_factors.weight @ user_emb  # (num_items,)

        top_items = torch.topk(all_scores, n).indices.tolist()
        return top_items

def train_bpr(model, train_interactions, epochs=20, lr=0.01):
    """
    Train BPR with negative sampling.

    Args:
        train_interactions: List of (user_id, item_id) positive pairs
    """
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    num_items = model.item_factors.num_embeddings

    # Build user->items dict for negative sampling
    user_items = {}
    for user, item in train_interactions:
        if user not in user_items:
            user_items[user] = set()
        user_items[user].add(item)

    for epoch in range(epochs):
        total_loss = 0
        np.random.shuffle(train_interactions)

        for user, pos_item in train_interactions:
            # Sample negative item (not in user's history)
            neg_item = np.random.randint(0, num_items)
            while neg_item in user_items[user]:
                neg_item = np.random.randint(0, num_items)

            user_t = torch.tensor([user])
            pos_t = torch.tensor([pos_item])
            neg_t = torch.tensor([neg_item])

            loss = model(user_t, pos_t, neg_t)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch}: Loss = {total_loss / len(train_interactions):.4f}")

Matrix Factorization Legacy:

Netflix Prize winner used an ensemble of 107 models, many MF-based
Fundamental insight: Embeddings capture latent preferences
This idea persists in all modern deep learning RecSys

Part III: Deep Learning for Recommendations

The Limitations of Matrix Factorization

Matrix factorization is linear: prediction = dot product of embeddings. It cannot capture complex, non-linear interactions between user preferences and item attributes.

Code

MF limitation:
If user likes [action, comedy] and item is [action, romance],
MF computes: action*action + comedy*romance = positive (maybe wrong)

We need non-linear combinations:
"User likes action AND comedy together" (not just sum of factors)

Neural Collaborative Filtering (NCF)

Matrix factorization uses a simple dot product to combine user and item embeddings. But a dot product is a linear operation—it can only model linear interactions between latent factors. What if the true user-item relationship is nonlinear?

NCF (WWW 2017) answers this by replacing the dot product with a neural network. The key insight: let the network learn an arbitrary interaction function between user and item representations.

The architecture:

Embedding layer: Map user ID → user embedding, item ID → item embedding (same as MF)
Interaction layers: Instead of dot product, concatenate embeddings and pass through MLPs
Output layer: Predict probability of interaction

Two interaction approaches combined:

NCF actually combines two sub-networks:

GMF (Generalized Matrix Factorization): Element-wise product of embeddings (like MF, but learnable weights) $\phi^{GMF} = p_u \odot q_i$
MLP: Concatenate embeddings and pass through hidden layers $\phi^{MLP} = \text{MLP}([p_u; q_i])$

The final prediction combines both: $\hat{y}_{ui} = \sigma(h^T [\phi^{GMF}; \phi^{MLP}])$

Why this matters:

Nonlinear interactions: A user might like action movies AND comedies, but hate action-comedies. Linear models can't capture this. Neural networks can.
Automatic feature learning: MLPs learn what combinations of latent factors matter, instead of assuming simple dot product.
Flexible capacity: Add more layers or neurons to increase model capacity.

The trade-off:

NCF is more expressive but also more prone to overfitting. Matrix factorization has an elegant closed-form interpretation; NCF is a black box. In practice, the gains from NCF over well-tuned MF are often modest (5-10%), and MF remains competitive for many applications.

Python

class NCF(nn.Module):
    """
    Neural Collaborative Filtering.
    Replace dot product with learnable MLP.

    Architecture:
    - GMF branch: Generalized Matrix Factorization (element-wise product)
    - MLP branch: Multi-layer perceptron for non-linear interactions
    - Combine: Concatenate and predict
    """

    def __init__(
        self,
        num_users: int,
        num_items: int,
        gmf_dim: int = 32,
        mlp_dims: list[int] = [64, 32, 16],
    ):
        super().__init__()

        # GMF embeddings
        self.gmf_user = nn.Embedding(num_users, gmf_dim)
        self.gmf_item = nn.Embedding(num_items, gmf_dim)

        # MLP embeddings (separate from GMF)
        self.mlp_user = nn.Embedding(num_users, mlp_dims[0] // 2)
        self.mlp_item = nn.Embedding(num_items, mlp_dims[0] // 2)

        # MLP layers
        layers = []
        for i in range(len(mlp_dims) - 1):
            layers.append(nn.Linear(mlp_dims[i], mlp_dims[i + 1]))
            layers.append(nn.ReLU())
        self.mlp = nn.Sequential(*layers)

        # Final prediction layer
        self.predict = nn.Linear(gmf_dim + mlp_dims[-1], 1)

        self._init_weights()

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Embedding):
                nn.init.normal_(m.weight, std=0.01)
            elif isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                nn.init.zeros_(m.bias)

    def forward(self, user_ids: torch.Tensor, item_ids: torch.Tensor) -> torch.Tensor:
        """
        Args:
            user_ids: (batch_size,)
            item_ids: (batch_size,)

        Returns:
            predictions: (batch_size,)
        """
        # GMF branch: element-wise product
        gmf_user_emb = self.gmf_user(user_ids)
        gmf_item_emb = self.gmf_item(item_ids)
        gmf_out = gmf_user_emb * gmf_item_emb  # (B, gmf_dim)

        # MLP branch: concatenate and transform
        mlp_user_emb = self.mlp_user(user_ids)
        mlp_item_emb = self.mlp_item(item_ids)
        mlp_input = torch.cat([mlp_user_emb, mlp_item_emb], dim=1)
        mlp_out = self.mlp(mlp_input)  # (B, mlp_dims[-1])

        # Combine GMF and MLP
        combined = torch.cat([gmf_out, mlp_out], dim=1)

        # Predict
        pred = self.predict(combined).squeeze()

        return torch.sigmoid(pred)  # For binary (click/no-click)

NCF Controversy: A 2020 RecSys paper showed that with proper hyperparameters, simple dot products can match NCF. The lesson: Don't overcomplicate unless you have evidence non-linearity helps.

Wide & Deep Learning

Google's Wide & Deep (2016) paper introduced one of the most influential architectures in industrial recommender systems. It powers Google Play app recommendations and has been adopted across the industry.

The core insight: Memorization vs Generalization

Recommendation systems face a fundamental tension:

Memorization: Learn specific patterns from historical data. "Users who installed Netflix also install Hulu." This captures rare but valuable co-occurrences.
Generalization: Learn broad patterns that apply to new situations. "Users who like streaming apps might like any new streaming app." This handles new items and sparse data.

Linear models (like logistic regression) are great at memorization through feature crosses, but require manual feature engineering. Deep neural networks automatically learn features and generalize well, but can over-generalize and miss important specific patterns.

Wide & Deep combines both:

Code

         ┌─────────────────────────────────────────┐
         │              Wide & Deep                 │
         │                                          │
         │   Wide (Linear)    Deep (DNN)           │
         │        ↓               ↓                 │
         │   Memorization    Generalization        │
         │        ↓               ↓                 │
         │        └───────┬───────┘                │
         │                ↓                         │
         │           Combined                       │
         │           Prediction                     │
         └─────────────────────────────────────────┘

The Wide component:

A linear model on cross-product features: $y_{wide} = w^T [x, \phi(x)] + b$

Where $\phi(x)$ are cross-product transformations like:

AND(installed_app=netflix, impression_app=hulu)
AND(user_country=US, app_language=english)

These must be manually engineered based on domain knowledge.

The Deep component:

A standard feedforward network on dense embeddings: $a^{(l+1)} = f(W^{(l)} a^{(l)} + b^{(l)})$

Categorical features are embedded; continuous features are normalized. The network automatically learns feature interactions.

The combined prediction:

$P(Y=1|x) = \sigma(w_{wide}^T [x, \phi(x)] + w_{deep}^T a^{(l_f)} + b)$

Both components are trained jointly end-to-end.

Why this works in practice:

Best of both worlds: Wide captures specific user-item affinities; deep generalizes to new combinations
Handles cold start: New items with features but no history can get recommendations through the deep component
Production-friendly: The wide component can use any feature engineering you already have

Python

class WideAndDeep(nn.Module):
    """
    Wide & Deep Learning for Recommendations.

    Wide component: Linear model for memorization
    - Captures specific patterns: "users who bought X also bought Y"

    Deep component: DNN for generalization
    - Captures general patterns: "users who like action movies"
    """

    def __init__(
        self,
        num_wide_features: int,
        num_deep_features: int,
        deep_dims: list[int] = [256, 128, 64],
    ):
        super().__init__()

        # Wide: Linear model on cross-product features
        self.wide = nn.Linear(num_wide_features, 1)

        # Deep: MLP on dense features
        layers = [nn.Linear(num_deep_features, deep_dims[0]), nn.ReLU()]
        for i in range(len(deep_dims) - 1):
            layers.append(nn.Linear(deep_dims[i], deep_dims[i + 1]))
            layers.append(nn.ReLU())
        layers.append(nn.Linear(deep_dims[-1], 1))
        self.deep = nn.Sequential(*layers)

    def forward(self, wide_features: torch.Tensor, deep_features: torch.Tensor):
        """
        Args:
            wide_features: (B, num_wide_features) - sparse cross-product features
            deep_features: (B, num_deep_features) - dense embeddings
        """
        wide_out = self.wide(wide_features)
        deep_out = self.deep(deep_features)

        return torch.sigmoid(wide_out + deep_out)

Wide & Deep at Google:

Used for Google Play app recommendations
Wide: Manually engineered cross-features (e.g., app × installed_apps)
Deep: Embeddings for categorical features

Part IV: Feature Interaction Models

DeepFM: Automating Feature Crosses

Wide & Deep's limitation: you must manually engineer the cross features for the wide component. This requires domain expertise and experimentation. What if we could automatically learn all useful feature interactions?

DeepFM (IJCAI 2017) solves this by replacing Wide & Deep's manual crosses with Factorization Machines (FM), which automatically model all pairwise feature interactions.

Understanding Factorization Machines:

FM models all second-order feature interactions:

$y_{FM} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle v_i, v_j \rangle x_i x_j$

Where:

$w_0, w_i$ are first-order weights (like linear regression)
$v_i \in \mathbb{R}^k$ is the latent vector for feature $i$
$\langle v_i, v_j \rangle = \sum_{f=1}^{k} v_{i,f} \cdot v_{j,f}$ is the interaction strength

The key insight: instead of learning $n^2$ interaction weights directly (intractable for sparse data), learn $n \times k$ latent vectors and compute interactions via dot products. This makes sparse interactions learnable.

The DeepFM architecture:

Code

          ┌────────────────────────────────────────┐
          │               DeepFM                    │
          │                                         │
          │ ┌─────────┐         ┌─────────────┐   │
          │ │   FM    │         │    DNN      │   │
          │ │ (auto   │         │ (high-order │   │
          │ │ 2nd     │         │ interactions│   │
          │ │ order)  │         │ )           │   │
          │ └────┬────┘         └──────┬──────┘   │
          │      │                     │           │
          │      └──────────┬──────────┘           │
          │                 ↓                      │
          │         Final Prediction               │
          └────────────────────────────────────────┘

FM component captures all 2nd-order interactions automatically:

user_age × item_category
user_country × item_language
hour_of_day × item_type
... all pairs, without manual feature engineering

DNN component captures higher-order interactions through deep layers.

Why DeepFM beats Wide & Deep:

No manual feature engineering: FM automatically learns which pairs interact
Shared embeddings: FM and DNN share the same embedding layer, reducing parameters and improving learning
End-to-end learning: All interactions are learned jointly

The math:

$\hat{y} = \sigma(y_{FM} + y_{DNN})$

Where $y_{FM}$ is the FM output and $y_{DNN}$ is the deep network output.

Python

class DeepFM(nn.Module):
    """
    DeepFM: FM + DNN for automatic feature interactions.

    FM component: 2nd-order feature interactions (efficient)
    DNN component: Higher-order feature interactions
    """

    def __init__(
        self,
        field_dims: list[int],  # Number of unique values per field
        embedding_dim: int = 16,
        mlp_dims: list[int] = [256, 128, 64],
    ):
        super().__init__()

        self.num_fields = len(field_dims)
        self.embedding_dim = embedding_dim

        # Embedding tables for each field
        self.embeddings = nn.ModuleList([
            nn.Embedding(dim, embedding_dim)
            for dim in field_dims
        ])

        # First-order weights
        self.first_order = nn.ModuleList([
            nn.Embedding(dim, 1)
            for dim in field_dims
        ])

        # Global bias
        self.bias = nn.Parameter(torch.zeros(1))

        # DNN
        dnn_input_dim = self.num_fields * embedding_dim
        layers = [nn.Linear(dnn_input_dim, mlp_dims[0]), nn.ReLU(), nn.Dropout(0.2)]
        for i in range(len(mlp_dims) - 1):
            layers.extend([
                nn.Linear(mlp_dims[i], mlp_dims[i + 1]),
                nn.ReLU(),
                nn.Dropout(0.2)
            ])
        layers.append(nn.Linear(mlp_dims[-1], 1))
        self.dnn = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (B, num_fields) - field indices

        Returns:
            predictions: (B,)
        """
        # Get embeddings for each field
        emb_list = [
            self.embeddings[i](x[:, i])
            for i in range(self.num_fields)
        ]
        emb_stack = torch.stack(emb_list, dim=1)  # (B, num_fields, emb_dim)

        # First-order term
        first_order = sum([
            self.first_order[i](x[:, i]).squeeze()
            for i in range(self.num_fields)
        ])

        # FM second-order: (sum of embeddings)^2 - sum of (embeddings^2)
        # This efficiently computes all pairwise interactions
        sum_of_emb = emb_stack.sum(dim=1)  # (B, emb_dim)
        sum_of_square = (emb_stack ** 2).sum(dim=1)  # (B, emb_dim)
        square_of_sum = sum_of_emb ** 2  # (B, emb_dim)

        fm_second_order = 0.5 * (square_of_sum - sum_of_square).sum(dim=1)

        # DNN
        dnn_input = emb_stack.view(x.size(0), -1)  # (B, num_fields * emb_dim)
        dnn_out = self.dnn(dnn_input).squeeze()

        # Combine
        output = self.bias + first_order + fm_second_order + dnn_out

        return torch.sigmoid(output)

DCN: Deep & Cross Network

DeepFM automates 2nd-order feature interactions. But what about 3rd-order? 4th-order? Higher-order interactions can be crucial: "young male users in California who browse on mobile in the evening" is a 5-way interaction.

DCN (KDD 2017) introduces the Cross Network, which explicitly models feature crosses of bounded degree at each layer.

The key innovation: Cross layers

A cross layer computes: $x_{l+1} = x_0 \odot (W_l \cdot x_l + b_l) + x_l$

Where:

$x_0$ is the original input (always included!)
$x_l$ is the output of layer $l$
$\odot$ is element-wise multiplication
The $+ x_l$ is a residual connection

Why this captures high-order interactions:

Layer 1 captures 2nd-order interactions: $x_0 \odot (W_1 x_0)$ contains all pairs
Layer 2 captures 3rd-order: $x_0 \odot (W_2 x_1)$ where $x_1$ already has pairs
Layer $l$ captures up to $(l+1)$ -order interactions

The beauty: each layer adds only $d$ parameters (where $d$ is input dimension), not $d^2$ . This is polynomial complexity in interaction order, not exponential.

DCN architecture:

Code

         ┌────────────────────────────────────────┐
         │                DCN                      │
         │                                         │
         │   ┌──────────┐      ┌──────────────┐  │
         │   │  Cross   │      │    Deep      │  │
         │   │  Network │      │   Network    │  │
         │   │  (explicit│      │   (implicit  │  │
         │   │  crosses) │      │   crosses)   │  │
         │   └────┬─────┘      └──────┬───────┘  │
         │        │                   │           │
         │        └─────────┬─────────┘           │
         │                  ↓                     │
         │          Concatenate & MLP             │
         │                  ↓                     │
         │            Prediction                  │
         └────────────────────────────────────────┘

Cross Network vs Deep Network:

Aspect	Cross Network	Deep Network
Interactions	Explicit, bounded-degree	Implicit, learned
Parameters	Linear in depth	Quadratic in width
Interpretability	Can analyze which crosses matter	Black box
Expressiveness	All crosses up to degree L	Arbitrary nonlinear

DCN-v2 (2020 improvement):

The original DCN uses a vector $w_l$ for each layer. DCN-v2 uses a full matrix $W_l$ , making each cross layer a low-rank approximation of all pairwise interactions. This significantly improves performance.

Python

class CrossNetwork(nn.Module):
    """
    Cross Network: Explicit bounded-degree feature crossing.

    Each layer computes:
    x_{l+1} = x_0 * (w_l · x_l + b_l) + x_l

    This captures l-th order feature interactions at layer l.
    """

    def __init__(self, input_dim: int, num_layers: int = 3):
        super().__init__()

        self.num_layers = num_layers
        self.weights = nn.ParameterList([
            nn.Parameter(torch.randn(input_dim))
            for _ in range(num_layers)
        ])
        self.biases = nn.ParameterList([
            nn.Parameter(torch.zeros(input_dim))
            for _ in range(num_layers)
        ])

    def forward(self, x0: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x0: (B, input_dim) - initial features

        Returns:
            output: (B, input_dim) - crossed features
        """
        x = x0
        for i in range(self.num_layers):
            # x_{l+1} = x_0 * (w_l · x_l) + b_l + x_l
            xw = (x * self.weights[i]).sum(dim=1, keepdim=True)
            x = x0 * xw + self.biases[i] + x

        return x

class DCN(nn.Module):
    """Deep & Cross Network."""

    def __init__(
        self,
        input_dim: int,
        cross_layers: int = 3,
        deep_dims: list[int] = [256, 128, 64],
    ):
        super().__init__()

        self.cross_net = CrossNetwork(input_dim, cross_layers)

        # Deep network
        layers = [nn.Linear(input_dim, deep_dims[0]), nn.ReLU()]
        for i in range(len(deep_dims) - 1):
            layers.extend([nn.Linear(deep_dims[i], deep_dims[i + 1]), nn.ReLU()])
        self.deep_net = nn.Sequential(*layers)

        # Combine cross and deep
        self.output = nn.Linear(input_dim + deep_dims[-1], 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        cross_out = self.cross_net(x)
        deep_out = self.deep_net(x)

        combined = torch.cat([cross_out, deep_out], dim=1)
        return torch.sigmoid(self.output(combined))

Part V: Two-Tower Models for Retrieval

The Retrieval Problem

At scale (millions of items), you can't score every item for every user. The solution: Two-tower models for efficient retrieval.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TWO-TOWER ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CHALLENGE:                                                              │
│  ───────────                                                             │
│  Score all 10M items for each user?                                     │
│  At 100M users × 10M items = 10^15 computations = IMPOSSIBLE           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: DECOUPLE USER AND ITEM ENCODING                              │
│  ───────────────────────────────────────────                             │
│                                                                          │
│  USER TOWER              ITEM TOWER                                     │
│  ───────────             ───────────                                    │
│  User features    →      Item features    →                             │
│  Neural network   →      Neural network   →                             │
│  User embedding   →      Item embedding   →                             │
│       ↓                       ↓                                         │
│       └───────── DOT PRODUCT ─┘                                        │
│                      ↓                                                   │
│                   Score                                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHT:                                                            │
│  ─────────────                                                           │
│  Item embeddings are INDEPENDENT of users                               │
│  → Pre-compute ALL item embeddings offline                              │
│  → Store in vector index (FAISS, ScaNN, HNSW)                          │
│  → At serving: Compute user embedding, find nearest items              │
│  → Complexity: O(log N) instead of O(N)                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

YouTube DNN

YouTube's Deep Neural Network (RecSys 2016) was foundational—perhaps the most influential industrial recommendation paper. It revealed how YouTube serves billions of recommendations daily with sub-100ms latency.

The core problem YouTube faced:

YouTube has over 800 million videos and 2 billion logged-in users monthly. Scoring every video for every user is computationally impossible (10^15+ operations). The solution: a two-stage funnel architecture.

Code

Stage 1: CANDIDATE GENERATION (what we call "retrieval")
─────────────────────────────────────────────────────────
800M videos → Neural network → ~1000 candidates
Goal: High recall (don't miss relevant videos)
Latency budget: ~10ms

Stage 2: RANKING (what we call "scoring")
─────────────────────────────────────────────────────────
1000 candidates → Complex model → ~50 ranked videos
Goal: High precision (order correctly)
Latency budget: ~50ms

The candidate generation model (what this code implements):

The key insight is treating recommendation as an extreme multiclass classification problem. Given the user's context (watch history, search history, demographics), predict which video they'll watch next—out of millions of classes.

Input features for the user tower:

Watch history: The user's last N watched videos. Instead of using a complex sequence model, YouTube simply averages the embeddings of watched videos. This works surprisingly well and is fast.
Search history: Similarly averaged embeddings of search queries (tokenized and embedded).
Demographics: Age bucket, gender, geographic region.
Context features: Time since last watch (crucial for freshness), device type.

The "example age" trick:

YouTube discovered that neural networks learn to predict popular videos because they appear more often in training data. To counteract this freshness bias, they include the video's age (time since upload) as a feature during training. At serving time, they set this to zero—telling the model "assume all videos were just uploaded"—which dramatically improved recommendation freshness.

Training with negative sampling:

With millions of classes, computing the full softmax is impossible. YouTube uses sampled softmax: for each positive example, sample a few thousand negative classes from the video corpus (weighted by popularity). The learned embedding space places similar users and items close together.

Why average pooling works:

Averaging embeddings loses order information but captures "interest topics." A user who watched [sports, sports, cooking] ends up in a similar embedding region to another sports+cooking fan. For coarse-grained retrieval, this is sufficient—ranking handles fine-grained preferences.

Python

class YouTubeDNN(nn.Module):
    """
    YouTube's Deep Neural Network for Recommendations.

    User tower:
    - Watch history (average of video embeddings)
    - Search history
    - Demographics
    - Context (time, device)

    Item tower:
    - Video embeddings
    """

    def __init__(
        self,
        num_videos: int,
        embedding_dim: int = 256,
        hidden_dims: list[int] = [1024, 512, 256],
    ):
        super().__init__()

        # Video embeddings (shared between history and candidates)
        self.video_embedding = nn.Embedding(num_videos, embedding_dim)

        # User tower
        user_input_dim = embedding_dim * 2 + 32  # History + search + demographics
        layers = [nn.Linear(user_input_dim, hidden_dims[0]), nn.ReLU()]
        for i in range(len(hidden_dims) - 1):
            layers.extend([nn.Linear(hidden_dims[i], hidden_dims[i + 1]), nn.ReLU()])
        self.user_tower = nn.Sequential(*layers)

        # Demographics embedding
        self.age_embedding = nn.Embedding(10, 16)  # Age buckets
        self.gender_embedding = nn.Embedding(3, 16)

    def encode_user(
        self,
        watch_history: torch.Tensor,  # (B, max_history)
        watch_mask: torch.Tensor,  # (B, max_history) - 1 for valid, 0 for padding
        search_history: torch.Tensor,  # (B, max_search)
        search_mask: torch.Tensor,
        age: torch.Tensor,  # (B,)
        gender: torch.Tensor,  # (B,)
    ) -> torch.Tensor:
        """Encode user into embedding."""

        # Average watch history embeddings
        watch_emb = self.video_embedding(watch_history)  # (B, H, D)
        watch_emb = (watch_emb * watch_mask.unsqueeze(-1)).sum(dim=1)
        watch_emb = watch_emb / watch_mask.sum(dim=1, keepdim=True).clamp(min=1)

        # Average search history (simplified - could use separate embeddings)
        search_emb = self.video_embedding(search_history)
        search_emb = (search_emb * search_mask.unsqueeze(-1)).sum(dim=1)
        search_emb = search_emb / search_mask.sum(dim=1, keepdim=True).clamp(min=1)

        # Demographics
        demo_emb = torch.cat([
            self.age_embedding(age),
            self.gender_embedding(gender)
        ], dim=1)

        # Combine and transform
        user_features = torch.cat([watch_emb, search_emb, demo_emb], dim=1)
        user_embedding = self.user_tower(user_features)

        return user_embedding

    def encode_items(self, item_ids: torch.Tensor) -> torch.Tensor:
        """Encode items into embeddings."""
        return self.video_embedding(item_ids)

    def forward(self, user_embedding: torch.Tensor, item_ids: torch.Tensor) -> torch.Tensor:
        """Score user-item pairs."""
        item_emb = self.encode_items(item_ids)
        scores = (user_embedding.unsqueeze(1) * item_emb).sum(dim=-1)
        return scores

DSSM: Deep Structured Semantic Models

DSSM (Microsoft, 2013) pioneered the two-tower approach for search and remains influential today. Understanding DSSM is essential because modern retrieval systems (including embedding-based search and RAG) are direct descendants of this architecture.

The problem DSSM solves:

Traditional search systems rely on lexical matching: the query "NY hotels" finds documents containing those exact words. But what about "New York accommodations" or "Manhattan lodging"? Users and documents can express the same concept with completely different words—the vocabulary mismatch problem.

DSSM's solution: Semantic matching in embedding space

Instead of matching words, learn to map queries and documents into a shared semantic space where similar meanings have similar vectors, regardless of exact wording:

Code

Query: "NY hotels"           Document: "Manhattan lodging options"
       ↓                              ↓
   Query Tower                    Doc Tower
       ↓                              ↓
   [0.8, 0.3, ...]    ←cosine→   [0.75, 0.35, ...]
                         ↓
                   High similarity!

The architecture in detail:

Word hashing (character trigrams): Raw text is first converted to a bag of character trigrams. "hotel" → {"#ho", "hot", "ote", "tel", "el#"}. This reduces vocabulary from millions of words to ~30K trigrams, handles misspellings, and generalizes to unseen words.
Query tower: Trigram features → Dense layer (30K→300) → Tanh → Dense (300→300) → Tanh → Dense (300→128). The final 128-dim vector is the query embedding.
Document tower: Same architecture as query tower (but separate parameters). Documents are also converted to 128-dim embeddings.
Similarity scoring: Cosine similarity between normalized query and document embeddings.

Why Tanh activation?

DSSM uses Tanh (not ReLU) throughout. The bounded output [-1, 1] helps with cosine similarity computation and provides more stable gradients. Later work showed ReLU can work too, but Tanh remains common in embedding models.

Training with contrastive loss:

For each query $q$ , we have one positive document $d^+$ (clicked) and several negative documents $d^-_1, d^-_2, ..., d^-_k$ (not clicked). The loss maximizes:

$P(d^+|q) = \frac{\exp(\gamma \cdot \cos(q, d^+))}{\exp(\gamma \cdot \cos(q, d^+)) + \sum_{i=1}^{k} \exp(\gamma \cdot \cos(q, d^-_i))}$

Where $\gamma$ is a temperature parameter (typically ~10). This softmax-style loss pushes the query embedding closer to positive documents and farther from negative documents.

Negative sampling strategies:

Random negatives: Sample from the document corpus. Easy but may include false negatives.
In-batch negatives: Use other queries' positive documents as negatives. Efficient and provides harder negatives.
Hard negatives: Mine documents that are similar but not relevant (e.g., from the same topic but not answering the query).

Why DSSM is still relevant:

Modern sentence transformers, dense retrievers (DPR, Contriever), and embedding APIs (OpenAI, Cohere) all use the same fundamental architecture: encode query and document with separate towers, score via similarity, train with contrastive loss.

Python

import torch
import torch.nn as nn
import torch.nn.functional as F

class DSSM(nn.Module):
    """
    Deep Structured Semantic Model.
    Originally for query-document matching, adapted for recommendations.
    """

    def __init__(
        self,
        query_input_dim: int,
        doc_input_dim: int,
        hidden_dims: list[int] = [300, 300, 128],
    ):
        super().__init__()

        # Query tower
        q_layers = [nn.Linear(query_input_dim, hidden_dims[0]), nn.Tanh()]
        for i in range(len(hidden_dims) - 1):
            q_layers.extend([nn.Linear(hidden_dims[i], hidden_dims[i + 1]), nn.Tanh()])
        self.query_tower = nn.Sequential(*q_layers)

        # Document tower
        d_layers = [nn.Linear(doc_input_dim, hidden_dims[0]), nn.Tanh()]
        for i in range(len(hidden_dims) - 1):
            d_layers.extend([nn.Linear(hidden_dims[i], hidden_dims[i + 1]), nn.Tanh()])
        self.doc_tower = nn.Sequential(*d_layers)

    def forward(self, query_features: torch.Tensor, doc_features: torch.Tensor) -> torch.Tensor:
        """
        Args:
            query_features: (B, query_dim)
            doc_features: (B, num_docs, doc_dim)

        Returns:
            scores: (B, num_docs) - cosine similarity scores
        """
        query_emb = self.query_tower(query_features)  # (B, hidden)
        query_emb = F.normalize(query_emb, dim=-1)

        doc_emb = self.doc_tower(doc_features)  # (B, num_docs, hidden)
        doc_emb = F.normalize(doc_emb, dim=-1)

        # Cosine similarity
        scores = (query_emb.unsqueeze(1) * doc_emb).sum(dim=-1)

        return scores

Serving Two-Tower Models

Training a two-tower model is only half the battle. The real challenge is serving it at scale—delivering personalized recommendations to millions of users in milliseconds. This section covers the production infrastructure that makes it possible.

The serving challenge:

At serving time, given a user, you need to find the most similar items from potentially billions of candidates. Even with decoupled towers, a naive linear scan is too slow:

Code

10M items × 256 dimensions × 1 float op = 2.56 billion operations per user
At 10K QPS = 25.6 trillion operations per second = IMPOSSIBLE

The solution: Approximate Nearest Neighbor (ANN) search

Instead of exact nearest neighbor search (O(N)), we use approximate methods (O(log N) or O(1)) that find "close enough" neighbors with high probability. The key insight: we can trade a small amount of accuracy for massive speedup.

Popular ANN indexes:

FAISS (Facebook AI Similarity Search):
- Most popular library for dense vector search
- Multiple index types: Flat (exact), IVF (inverted file), HNSW (graph-based), PQ (product quantization)
- GPU support for batch processing
- Used by: Meta, Spotify, Pinterest
ScaNN (Google):
- Anisotropic vector quantization—learns quantization optimized for MIPS (maximum inner product search)
- 2x faster than HNSW at same recall
- Used internally at Google/YouTube
HNSW (Hierarchical Navigable Small World):
- Graph-based index with hierarchical structure
- Excellent recall-latency tradeoff
- Available in: Faiss, Milvus, Qdrant, Weaviate

Index building workflow:

Code

OFFLINE (daily/weekly):
───────────────────────────────────────────────────────────────
1. Export all item IDs from catalog
2. Run item tower inference on all items → item embeddings
3. Normalize embeddings (for cosine similarity)
4. Build ANN index from embeddings
5. Deploy index to serving infrastructure

ONLINE (per request):
───────────────────────────────────────────────────────────────
1. Receive user request
2. Fetch user features (watch history, demographics, etc.)
3. Run user tower inference → user embedding (typically ~5-10ms)
4. Query ANN index for top-K items → candidate set (typically ~1-5ms)
5. Pass candidates to ranking model

Embedding normalization for cosine similarity:

FAISS's most efficient indexes use inner product (dot product) similarity. To get cosine similarity, we normalize all vectors to unit length:

$\cos(u, v) = \frac{u \cdot v}{\|u\| \|v\|} = u_{norm} \cdot v_{norm}$

After normalization, inner product equals cosine similarity.

Production considerations:

Index sharding: For billions of items, shard the index across multiple machines. Route queries to all shards, merge results.
Index updates: Item embeddings change as you retrain. Options:
- Full rebuild (safest, but expensive)
- Incremental updates (add new items, but index quality degrades over time)
- Hybrid: Periodic full rebuilds + incremental updates
Filtering: Often you need to filter candidates (e.g., exclude already-watched items, region restrictions). Options:
- Pre-filter: Build separate indexes per filter dimension (expensive)
- Post-filter: Retrieve more candidates, filter afterward (wasteful)
- In-query filter: Some indexes (Milvus, Qdrant) support filtered search
Caching: User embeddings can be cached for repeat visitors (typically with short TTL as preferences change).

Python

import faiss
import numpy as np

class TwoTowerServing:
    """
    Serving infrastructure for two-tower models.
    """

    def __init__(self, model: YouTubeDNN, num_items: int, embedding_dim: int):
        self.model = model
        self.embedding_dim = embedding_dim

        # Pre-compute all item embeddings
        print("Computing item embeddings...")
        item_ids = torch.arange(num_items)
        with torch.no_grad():
            self.item_embeddings = model.encode_items(item_ids).numpy()

        # Build FAISS index for fast retrieval
        print("Building FAISS index...")
        self.index = faiss.IndexFlatIP(embedding_dim)  # Inner product
        faiss.normalize_L2(self.item_embeddings)  # For cosine similarity
        self.index.add(self.item_embeddings)

    def retrieve(self, user_embedding: np.ndarray, k: int = 100) -> tuple[np.ndarray, np.ndarray]:
        """
        Retrieve top-k items for user.

        Args:
            user_embedding: (embedding_dim,)
            k: Number of items to retrieve

        Returns:
            item_ids: (k,)
            scores: (k,)
        """
        # Normalize user embedding
        user_embedding = user_embedding / np.linalg.norm(user_embedding)
        user_embedding = user_embedding.reshape(1, -1)

        # Search
        scores, item_ids = self.index.search(user_embedding, k)

        return item_ids[0], scores[0]

    def recommend(self, user_features: dict, k: int = 100) -> list[int]:
        """Full recommendation pipeline."""

        # Encode user
        with torch.no_grad():
            user_emb = self.model.encode_user(**user_features).numpy()[0]

        # Retrieve candidates
        item_ids, scores = self.retrieve(user_emb, k)

        return item_ids.tolist()

Part VI: Sequential Models

The models we've seen so far treat user-item interactions as a bag—they don't consider the order. But order matters! A user who watched [Inception, Interstellar, The Martian] likely wants more sci-fi, while [The Martian, Cast Away, Into the Wild] suggests survival films. Sequential models capture these temporal dynamics.

GRU4Rec: RNNs for Session-Based Recommendation

GRU4Rec (ICLR 2016 Workshop) was the first successful application of deep learning to session-based recommendation, fundamentally changing how we model user behavior.

The session-based recommendation problem:

In many scenarios, you don't have long-term user profiles:

Anonymous users (no login)
New users with no history (cold start)
E-commerce sessions where intent changes rapidly

Instead, you only have the current session: a sequence of items the user interacted with in the past few minutes/hours. The task: predict the next item they'll interact with.

Code

Session: [Product A] → [Product B] → [Product C] → ???
                                                    ↑
                                         Predict next item

Why RNNs for sequences?

Recurrent Neural Networks (RNNs) are designed for sequential data. They maintain a hidden state that evolves as they process each element:

Code

Input:   Item₁    Item₂    Item₃    Item₄
           ↓        ↓        ↓        ↓
         [GRU] → [GRU] → [GRU] → [GRU]
           ↓        ↓        ↓        ↓
Hidden:   h₁   →   h₂   →   h₃   →   h₄
                                       ↓
                                   Prediction

Each hidden state $h_t$ encodes the "summary" of the session so far. The final hidden state captures the user's current intent.

Why GRU instead of LSTM?

GRU4Rec uses Gated Recurrent Units (GRU) rather than LSTMs:

Fewer parameters: GRU has 2 gates vs LSTM's 3, meaning faster training
Similar performance: For recommendation sequences (typically short), GRU performs comparably to LSTM
Simpler: Easier to implement and debug

The GRU update equations:

$z_t = \sigma(W_z x_t + U_z h_{t-1})$ (update gate: how much to update) $r_t = \sigma(W_r x_t + U_r h_{t-1})$ (reset gate: how much to forget) $\tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}))$ (candidate hidden state) $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$ (final hidden state)

Training with session-parallel mini-batches:

A key contribution of GRU4Rec is how they handle batching. Sessions have variable lengths, so naive batching is inefficient. Their solution:

Sort sessions by length
Create a sliding window across sessions
When a session ends, replace it with a new one and reset its hidden state

This allows efficient GPU utilization while maintaining session independence.

BPR loss for recommendation:

GRU4Rec uses BPR (Bayesian Personalized Ranking) loss rather than cross-entropy:

$L_{BPR} = -\sum_{(s, i, j)} \log \sigma(r_{s,i} - r_{s,j})$

Where $i$ is the actual next item and $j$ is a negative sample. This ranking loss directly optimizes for correct ordering.

TOP1 loss (GRU4Rec's innovation):

The paper also introduces TOP1 loss, which regularizes scores:

$L_{TOP1} = \frac{1}{N_s} \sum_j \sigma(r_{s,j} - r_{s,i}) + \sigma(r_{s,j}^2)$

The second term penalizes large scores, improving stability.

Practical considerations:

Sequence length: Most value comes from recent items. 20-50 items is typical.
Item embeddings: Shared between input and output (parameter tying).
Dropout: Applied to inputs and between layers. Critical for preventing overfitting.
Hidden size: 100-500 dimensions typically. Larger helps for diverse catalogs.

Python

class GRU4Rec(nn.Module):
    """
    GRU-based Session Recommendation.
    Models user's current session as a sequence.
    """

    def __init__(
        self,
        num_items: int,
        embedding_dim: int = 100,
        hidden_dim: int = 100,
        num_layers: int = 1,
        dropout: float = 0.2,
    ):
        super().__init__()

        self.item_embedding = nn.Embedding(num_items, embedding_dim, padding_idx=0)

        self.gru = nn.GRU(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

        self.output = nn.Linear(hidden_dim, num_items)

    def forward(self, session: torch.Tensor) -> torch.Tensor:
        """
        Args:
            session: (B, seq_len) - item IDs in session

        Returns:
            scores: (B, num_items) - scores for next item
        """
        # Embed items
        emb = self.item_embedding(session)  # (B, L, D)

        # GRU
        output, _ = self.gru(emb)  # (B, L, H)

        # Use last hidden state to predict next item
        last_hidden = output[:, -1, :]  # (B, H)
        scores = self.output(last_hidden)  # (B, num_items)

        return scores

    def predict_next(self, session: torch.Tensor, k: int = 20) -> torch.Tensor:
        """Get top-k next items."""
        scores = self.forward(session)
        top_k = torch.topk(scores, k, dim=-1)
        return top_k.indices

Caser: CNNs for Sequential Patterns

Caser (WSDM 2018) challenged the assumption that RNNs are the best architecture for sequences. Instead, Caser uses Convolutional Neural Networks to capture sequential patterns—and often outperforms RNN-based methods.

The key insight: Local patterns matter

Consider a shopping sequence: [Laptop] → [Laptop Bag] → [Mouse] → [Keyboard]. The patterns here are mostly local—consecutive items are related. You don't need to remember what happened 50 items ago; the last few items determine the next purchase.

CNNs excel at capturing local patterns through sliding windows (convolutions). This makes them surprisingly effective for short-term sequential recommendation.

Two types of patterns Caser captures:

Caser uses two distinct types of convolutions to capture different pattern types:

Code

HORIZONTAL CONVOLUTIONS (union-level patterns):
──────────────────────────────────────────────────────────────
Capture which COMBINATIONS of recent items predict the next item.
Filter slides HORIZONTALLY across the embedding matrix.

Example: Filter height = 2 captures bigrams
         [Laptop, Bag] → likely to buy accessories

  Embedding matrix (L items × D dimensions):
  ┌─────────────────────┐
  │ Laptop    [0.8 0.2] │ ← Filter slides
  │ Bag       [0.7 0.3] │   horizontally
  │ Mouse     [0.6 0.4] │
  │ Keyboard  [0.5 0.5] │
  └─────────────────────┘

VERTICAL CONVOLUTIONS (point-level patterns):
──────────────────────────────────────────────────────────────
Capture how INDIVIDUAL items influence the next item.
Filter slides VERTICALLY, aggregating information across features.

Example: "Laptop" (regardless of what followed) → predict accessories
         This captures "skip patterns" where a single item signals intent.

  Filter slides vertically across ALL positions:
  ┌─────────────────────┐
  │ Laptop    [0.8 0.2] │
  │ Bag       [0.7 0.3] │  ← Same filter applied
  │ Mouse     [0.6 0.4] │     to each row
  │ Keyboard  [0.5 0.5] │
  └─────────────────────┘

Architecture breakdown:

Input: User's last L items (e.g., L=5), embedded into D dimensions → (L × D) matrix
Horizontal convolutions: Multiple filters with different heights (1, 2, 3, ..., L)
- Height 1: Unigrams (single item patterns)
- Height 2: Bigrams (consecutive pairs)
- Height L: The entire sequence
- Each filter produces a 1D output, then max-pooled
Vertical convolutions: Filters with height L (full sequence) and width 1
- Slides across embedding dimensions
- Captures per-dimension patterns across all items
User embedding: Added to capture long-term preferences (not just session)
Output: Concatenate [horizontal features, vertical features, user embedding] → MLP → item scores

Why CNNs work for sequences:

Parallelization: All convolutions run in parallel (unlike RNNs which are sequential)
Local pattern capture: Explicit modeling of n-gram patterns
Position sensitivity: Unlike bag-of-words, convolutions respect order
Faster training: 2-3x faster than GRU4Rec in practice

The math behind horizontal convolutions:

For a filter $F$ of height $h$ and width $D$ (embedding dim), applied to embedding matrix $E$ :

$o_i = \text{ReLU}(\sum_{j=0}^{h-1} \sum_{k=0}^{D-1} F_{j,k} \cdot E_{i+j, k} + b)$

Then max-pool over all positions: $\hat{o} = \max_i(o_i)$

When to choose Caser over GRU4Rec:

Short sequences: Caser shines when recent history (last 5-10 items) dominates
Clear local patterns: E-commerce, music playlists with strong sequential dependencies
Need for speed: 2-3x faster training, similar inference latency
Explicit n-grams: When you want to interpret which item combinations matter

Python

import torch
import torch.nn as nn
import torch.nn.functional as F

class Caser(nn.Module):
    """
    Convolutional Sequence Embedding for Sequential Recommendation.
    Uses CNNs to capture point-level and union-level patterns.
    """

    def __init__(
        self,
        num_items: int,
        num_users: int,
        embedding_dim: int = 50,
        max_seq_len: int = 5,
        num_horizontal_filters: int = 16,
        num_vertical_filters: int = 4,
    ):
        super().__init__()

        self.embedding_dim = embedding_dim
        self.max_seq_len = max_seq_len

        # Embeddings
        self.item_embedding = nn.Embedding(num_items, embedding_dim, padding_idx=0)
        self.user_embedding = nn.Embedding(num_users, embedding_dim)

        # Horizontal convolutions (capture union-level patterns)
        # Different filter heights for different n-gram sizes
        self.horizontal_convs = nn.ModuleList([
            nn.Conv2d(1, num_horizontal_filters, (i + 1, embedding_dim))
            for i in range(max_seq_len)
        ])

        # Vertical convolution (capture point-level patterns)
        self.vertical_conv = nn.Conv2d(1, num_vertical_filters, (max_seq_len, 1))

        # Output
        horizontal_out = num_horizontal_filters * max_seq_len
        vertical_out = num_vertical_filters * embedding_dim
        self.fc = nn.Linear(horizontal_out + vertical_out + embedding_dim, num_items)

    def forward(self, user: torch.Tensor, sequence: torch.Tensor) -> torch.Tensor:
        """
        Args:
            user: (B,) user IDs
            sequence: (B, max_seq_len) item sequence

        Returns:
            scores: (B, num_items)
        """
        B = user.size(0)

        # Embed sequence
        seq_emb = self.item_embedding(sequence)  # (B, L, D)
        seq_emb = seq_emb.unsqueeze(1)  # (B, 1, L, D) - add channel dim

        # Horizontal convolutions
        h_outs = []
        for conv in self.horizontal_convs:
            h = F.relu(conv(seq_emb))  # (B, filters, L-k+1, 1)
            h = h.squeeze(-1)  # (B, filters, L-k+1)
            h = F.max_pool1d(h, h.size(-1))  # (B, filters, 1)
            h_outs.append(h.squeeze(-1))
        horizontal = torch.cat(h_outs, dim=1)  # (B, filters * L)

        # Vertical convolution
        v = F.relu(self.vertical_conv(seq_emb))  # (B, filters, 1, D)
        vertical = v.view(B, -1)  # (B, filters * D)

        # User embedding
        user_emb = self.user_embedding(user)  # (B, D)

        # Combine
        combined = torch.cat([horizontal, vertical, user_emb], dim=1)
        scores = self.fc(combined)

        return scores

The Transition to Attention

GRU4Rec and Caser showed that sequence matters. But they had limitations:

GRU: Sequential processing (slow), vanishing gradients
Caser: Fixed receptive field, limited long-range modeling

This set the stage for attention-based models (SASRec, BERT4Rec) covered in the next post.

Part VII: Production Systems (Pre-Transformer Era)

How Tech Giants Built Their RecSys

Understanding how major companies evolved their recommendation systems provides crucial context. Each faced unique scale challenges and made different architectural choices.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│          PRODUCTION RECSYS EVOLUTION BY COMPANY                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  NETFLIX (2006-2017)                                                    │
│  ───────────────────                                                     │
│  2006: Netflix Prize launches (MF era begins)                           │
│  2009: BellKor's Pragmatic Chaos wins with 107-model ensemble          │
│  2012: Moves to hybrid: MF + content features + context                │
│  2016: Deep learning experiments begin                                  │
│  Key lesson: Ensembles beat single models at scale                      │
│                                                                          │
│  AMAZON (2003-2017)                                                     │
│  ─────────────────────                                                   │
│  2003: Item-based CF paper (foundational)                               │
│  2007: DSSTNE (Deep Scalable Sparse Tensor Network Engine)             │
│  2013: Hybrid CF + content for "Customers who bought..."               │
│  Key lesson: Item-item similarity scales better than user-user          │
│                                                                          │
│  YOUTUBE (2010-2018)                                                    │
│  ───────────────────                                                     │
│  2010: Video recommendations via click-through rates                    │
│  2016: Deep Neural Networks paper (two-tower revolution)               │
│  2018: Adds watch time prediction, not just CTR                        │
│  Key lesson: Optimize for engagement, not just clicks                   │
│                                                                          │
│  SPOTIFY (2014-2018)                                                    │
│  ──────────────────────                                                  │
│  2014: Discover Weekly launches (CF + audio features)                  │
│  2016: Adds NLP on playlists ("chill vibes" → similar tracks)         │
│  2018: BaRT (Bandits for Recommendations) for exploration             │
│  Key lesson: Audio content features enable cold-start                   │
│                                                                          │
│  LINKEDIN (2012-2018)                                                   │
│  ─────────────────────                                                   │
│  2012: People You May Know (graph-based CF)                            │
│  2015: GLMix (Generalized Linear Mixed Models)                         │
│  2017: Photon-ML for feature engineering                               │
│  Key lesson: Professional graph structure is gold                       │
│                                                                          │
│  PINTEREST (2015-2018)                                                  │
│  ───────────────────────                                                 │
│  2015: Related Pins via visual similarity (CNN features)               │
│  2017: PinSage (graph neural networks at scale)                        │
│  2018: Multi-task learning (engagement + saves)                        │
│  Key lesson: Visual features dominate in image-first platforms         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Netflix: The Prize That Changed Everything

The Netflix Prize ($1M, 2006-2009) was a watershed moment for recommendation systems—it brought collaborative filtering from an academic curiosity to a mainstream machine learning problem, attracting thousands of teams worldwide and spawning techniques still used today.

The challenge:

Netflix released 100 million movie ratings (1-5 stars) from 480,000 users on 18,000 movies. The goal: improve their existing recommendation system (Cinematch, a basic neighborhood CF) by 10% in RMSE. The prize seemed impossible at the time—Netflix's own engineers had been stuck at 0.9514 RMSE for years.

Why the Prize mattered:

Open data at scale: For the first time, researchers had access to real industrial-scale recommendation data
Clear metric: RMSE provided an objective benchmark that everyone optimized for
Competition drove innovation: Teams shared insights, combined approaches, and pushed boundaries
Industry attention: The $1M prize attracted researchers who had never worked on recommendations

The evolution of winning approaches:

Code

Year 1 (2007): Single-model innovations
────────────────────────────────────────
- SVD with bias terms
- Temporal dynamics (preferences change over time)
- Best single model: ~0.88 RMSE

Year 2 (2008): Feature engineering
────────────────────────────────────────
- Implicit feedback (what users DIDN'T rate matters)
- Neighborhood models with shrinkage
- Best: ~0.87 RMSE

Year 3 (2009): Ensemble everything
────────────────────────────────────────
- 100+ model ensembles
- Stacking and blending
- Final winner: 0.8567 RMSE (10.09% improvement)

Key technical innovations from the Prize:

SVD++ (implicit feedback): Users who rated many horror movies are horror fans, even if individual ratings vary. SVD++ added "implicit" factors based on which items a user interacted with—not just how they rated.
Temporal dynamics: User preferences drift over time. A user who loved rom-coms in 2005 might prefer thrillers in 2008. Time-aware models captured this.
Regularization: L2 regularization (weight decay) on latent factors was crucial. Without it, models memorized training data.
Neighborhood + latent factor hybrids: The best results combined global latent factor models (matrix factorization) with local neighborhood models (kNN).

Why Netflix didn't deploy the winning solution:

The winning 107-model ensemble was too complex for production. Netflix's actual deployed system used a simpler blend of ~3-5 models. The Prize's real value was the techniques and talent it surfaced—many participants joined Netflix.

The lesson for practitioners:

Single-model improvements matter, but ensembles almost always win competitions. However, production systems favor simplicity. The Prize demonstrated both truths simultaneously.

Python

# Simplified version of BellKor's winning approach
class NetflixPrizeEnsemble:
    """
    Key insight from Netflix Prize: Blend many models.
    The winning solution combined 107 models.
    """

    def __init__(self):
        self.models = [
            # Matrix factorization variants
            SVDModel(factors=100),
            SVDModel(factors=200),
            SVDPlusPlusModel(factors=100),  # SVD with implicit feedback

            # Neighborhood models
            ItemKNN(k=20),
            ItemKNN(k=50),

            # Restricted Boltzmann Machines
            RBMModel(hidden_units=100),

            # Time-aware models
            TimeSVDModel(factors=100),  # Captures preference drift
        ]

        # Learned blending weights
        self.weights = None

    def fit(self, train_data, val_data):
        """Train all models and learn blending weights."""
        # Train each model
        for model in self.models:
            model.fit(train_data)

        # Get predictions on validation set
        val_predictions = np.array([
            model.predict(val_data)
            for model in self.models
        ]).T  # (num_samples, num_models)

        # Learn optimal weights via linear regression
        self.weights = self._learn_weights(val_predictions, val_data.ratings)

    def predict(self, user_item_pairs):
        """Blended prediction."""
        predictions = np.array([
            model.predict(user_item_pairs)
            for model in self.models
        ]).T

        return predictions @ self.weights

Netflix lessons:

SVD++ with implicit feedback was crucial (views, not just ratings)
Time-aware models captured preference drift
Blending provided 10%+ improvement over single models
Real deployment was much simpler than the prize-winning solution

YouTube: Two-Tower at Scale

YouTube's 2016 paper defined how to do deep learning recommendations at billion-scale. This paper is arguably the most influential industrial recommendation paper ever written—it established patterns that every major tech company now follows.

The scale challenge YouTube faced:

800+ million videos in the catalog
2+ billion logged-in users monthly
500+ hours of video uploaded every minute
Latency budget: <200ms total for the entire recommendation flow

Scoring every video for every user is computationally impossible: 800M × 2B = 1.6 × 10^18 operations per day.

The solution: Two-stage architecture

YouTube's key architectural insight was splitting the problem into two stages, each optimized for different goals:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    YOUTUBE'S TWO-STAGE ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STAGE 1: CANDIDATE GENERATION                                          │
│  ─────────────────────────────                                          │
│  Input:  User context (history, demographics, time)                     │
│  Output: ~1000 candidate videos (from 800M)                             │
│  Model:  Deep Neural Network (two-tower)                                │
│  Goal:   HIGH RECALL - don't miss relevant videos                       │
│  Budget: ~10ms                                                           │
│                                                                          │
│  Features (intentionally sparse for speed):                             │
│  - Watch history (averaged embeddings of last 50 videos)                │
│  - Search history (averaged query embeddings)                           │
│  - Age, gender, geographic region                                       │
│  - Time since last watch (freshness signal)                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STAGE 2: RANKING                                                       │
│  ────────────                                                            │
│  Input:  ~1000 candidates + rich features                               │
│  Output: ~50 ranked videos for display                                  │
│  Model:  Feature-rich neural network                                    │
│  Goal:   HIGH PRECISION - order candidates correctly                    │
│  Budget: ~100ms                                                          │
│                                                                          │
│  Features (much richer, expensive to compute):                          │
│  - Video metadata (title, description, category, duration)              │
│  - Video quality signals (watch time, likes, comments)                  │
│  - User-video interaction history (has user seen similar?)              │
│  - Contextual (time of day, device, page location)                      │
│  - Freshness (video age, creator's recent performance)                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why this architecture works:

Candidate generation can be pre-computed. Video embeddings don't change per-user, so they're computed offline and indexed. Only user embedding computation is online (~1ms).
Ranking uses expensive features that would be impossible to compute for all videos, but feasible for 1000 candidates.
Different objectives: Candidate generation optimizes for recall (any potentially relevant video), ranking optimizes for engagement (watch time).

The "watch time" insight:

YouTube's ranking model predicts expected watch time, not click probability. This was crucial:

Clicks are gameable: Clickbait titles/thumbnails generate clicks but disappoint users
Watch time = actual engagement: If users watch the video, it was genuinely interesting
Better business alignment: YouTube's ad revenue correlates with watch time, not clicks

Training the ranking model:

The model is trained to predict the weighted logistic regression of watch time. For each impression:

Positive example: User watched the video. Weight = watch time in seconds.
Negative example: User didn't click. Weight = 1.

This gives longer-watched videos more influence on the model.

The business rules layer:

After ranking, YouTube applies business rules:

Diversity: Don't show 10 videos from the same channel
Freshness: Mix in recent uploads
Fairness: Ensure smaller creators get exposure
Brand safety: Filter inappropriate content

Python

class YouTubeProductionSystem:
    """
    YouTube's production recommendation architecture (2016).

    Key insight: Split into RETRIEVAL (fast, broad) and RANKING (slow, precise).
    """

    def __init__(self):
        # Candidate generation: Two-tower DNN
        self.candidate_generator = YouTubeDNN(
            num_videos=1_000_000_000,  # Billion videos
            embedding_dim=256,
        )

        # Ranking: Feature-rich model
        self.ranker = YouTubeRanker(
            video_features=['duration', 'freshness', 'channel_ctr'],
            user_features=['watch_history', 'search_history', 'demographics'],
            context_features=['time_of_day', 'device', 'page'],
        )

        # ANN index for retrieval
        self.ann_index = None

    def serve(self, user_context: dict, num_results: int = 20) -> list:
        """Full serving pipeline."""

        # Stage 1: Candidate Generation (100ms budget)
        # Retrieve ~1000 candidates from billions of videos
        user_embedding = self.candidate_generator.encode_user(user_context)
        candidates = self.ann_index.search(user_embedding, k=1000)

        # Stage 2: Ranking (100ms budget)
        # Score candidates with rich features
        video_features = self.fetch_features(candidates)
        scores = self.ranker.score(user_context, candidates, video_features)

        # Stage 3: Business Rules (10ms budget)
        # Diversity, freshness, creator fairness
        final_list = self.apply_business_rules(candidates, scores)

        return final_list[:num_results]

class YouTubeRanker(nn.Module):
    """
    YouTube's ranking model optimizes for WATCH TIME, not clicks.

    Key insight: Clicks are easy to game. Watch time = real engagement.
    """

    def __init__(self, video_features, user_features, context_features):
        super().__init__()

        # Feature processing
        self.video_encoder = FeatureEncoder(video_features)
        self.user_encoder = FeatureEncoder(user_features)
        self.context_encoder = FeatureEncoder(context_features)

        # Prediction heads
        combined_dim = 512
        self.watch_time_head = nn.Sequential(
            nn.Linear(combined_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
        )

    def forward(self, user_features, video_features, context_features):
        """Predict expected watch time."""
        user_emb = self.user_encoder(user_features)
        video_emb = self.video_encoder(video_features)
        context_emb = self.context_encoder(context_features)

        combined = torch.cat([user_emb, video_emb, context_emb], dim=-1)

        # Predict watch time (not binary click)
        watch_time = self.watch_time_head(combined)

        return watch_time

YouTube lessons:

Two-stage architecture is essential at scale
Optimize for watch time, not clicks (better engagement metric)
Fresh content needs special handling (cold-start)
Example age feature: Train on recent data, serve on current context

Amazon: Item-Based CF at Scale

Amazon's 2003 paper "Amazon.com Recommendations: Item-to-Item Collaborative Filtering" is one of the most influential recommendation papers ever published. It established item-based collaborative filtering as the go-to approach for e-commerce and shaped how we think about scalable recommendations.

The problem with user-based CF at Amazon's scale:

Before 2003, most recommendation systems used user-based collaborative filtering: "Find users similar to you, recommend what they liked." This approach hit a wall at Amazon's scale:

Code

User-Based CF Scalability Problem:
─────────────────────────────────────────────────────────────────────────
Users:  ~50 million (in 2003, now 300M+)
Items:  ~10 million products

To recommend for user U:
1. Compare U to ALL 50M users → 50M similarity calculations
2. For each similar user, iterate their purchases → expensive

This must happen in REAL-TIME (<100ms). Impossible.

Amazon's insight: Items are more stable than users

Item-based CF flips the approach: instead of "users like you liked X," it's "items similar to X are Y, Z."

Code

Item-Based CF:
─────────────────────────────────────────────────────────────────────────
Pre-compute: For each item, find similar items (based on co-purchase)
             This is OFFLINE. Can take hours. Run nightly.

Online: Given user's purchase history [A, B, C]
        Look up pre-computed similar items for A, B, C
        Aggregate and rank
        Total: O(purchase_history_size × k) = O(100 × 50) = 5000 ops

5000 ops vs 50,000,000 ops. That's the difference.

Why items are more stable:

User preferences drift: A user might love sci-fi this year, thrillers next year
Item relationships are stable: "Laptops are bought with laptop bags" doesn't change
New users appear constantly: 1000s of new users per minute. Recomputing user-user similarity is expensive.
Items change slowly: New products launch, but most catalog is stable

Computing item similarity:

The key question: How do we define "similar items"? Amazon uses purchase co-occurrence:

$\text{sim}(i, j) = \frac{|\text{Users who bought both } i \text{ and } j|}{|\text{Users who bought } i| \times |\text{Users who bought } j|}$

This is essentially cosine similarity over the user-item purchase matrix, viewed from the item perspective.

The "Customers who bought this also bought" pattern:

This is item-based CF in action:

User views product X
Look up pre-computed similar items for X
Display top-k as "Customers who bought this also bought..."

This simple pattern drives a significant portion of Amazon's revenue.

Implementation considerations:

Sparsity handling: Most users buy few items. Use co-occurrence counts, not Pearson correlation.
Popular item bias: Very popular items co-occur with everything. Dampen by dividing by item popularity.
Temporal filtering: Recent purchases matter more. Decay older co-occurrences.
Category constraints: "Laptop bag" is similar to "Laptop," but maybe don't recommend both.

Python

class AmazonItemCF:
    """
    Amazon's item-to-item collaborative filtering.

    Key insight: Item similarity is more stable than user similarity.
    "Customers who bought X also bought Y"
    """

    def __init__(self):
        self.item_similarity = {}  # item_id -> [(similar_item, score)]

    def compute_similarities_offline(self, purchase_matrix):
        """
        Pre-compute all item-item similarities.
        Run daily/weekly as batch job.
        """
        num_items = purchase_matrix.shape[1]

        for i in range(num_items):
            # Users who bought item i
            users_i = set(np.where(purchase_matrix[:, i] > 0)[0])

            similarities = []
            for j in range(num_items):
                if i == j:
                    continue

                # Users who bought item j
                users_j = set(np.where(purchase_matrix[:, j] > 0)[0])

                # Jaccard similarity (or cosine)
                intersection = len(users_i & users_j)
                union = len(users_i | users_j)

                if union > 0:
                    sim = intersection / union
                    if sim > 0.01:  # Threshold for storage
                        similarities.append((j, sim))

            # Keep top-k similar items
            similarities.sort(key=lambda x: x[1], reverse=True)
            self.item_similarity[i] = similarities[:100]

    def recommend(self, user_purchase_history: list, n: int = 10) -> list:
        """
        Real-time recommendation based on user's purchases.
        """
        # Aggregate similar items from purchase history
        candidate_scores = {}

        for purchased_item in user_purchase_history:
            if purchased_item not in self.item_similarity:
                continue

            for similar_item, similarity in self.item_similarity[purchased_item]:
                if similar_item in user_purchase_history:
                    continue  # Don't recommend already purchased

                if similar_item not in candidate_scores:
                    candidate_scores[similar_item] = 0
                candidate_scores[similar_item] += similarity

        # Sort by score
        ranked = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)

        return [item_id for item_id, _ in ranked[:n]]

Amazon lessons:

Pre-computation makes real-time serving feasible
Purchase signals are stronger than views/clicks
"Frequently bought together" = item-item in same basket
Simple methods at scale beat complex methods that don't scale

Spotify: Audio Features for Cold-Start

Spotify's recommendation system is a masterclass in hybrid approaches—combining collaborative filtering with content-based features to solve one of the hardest problems in music recommendation: the cold-start problem for new songs.

The music cold-start problem:

Music platforms face a brutal cold-start challenge:

40,000+ new tracks uploaded daily to Spotify
New songs have zero listening data
Artists depend on algorithmic discovery for exposure
Users expect fresh music in playlists like Discover Weekly

Traditional CF fails here: how do you recommend a song that no one has listened to yet?

Spotify's solution: Multiple signal sources

Spotify doesn't rely on a single model—they combine multiple signals, each addressing different scenarios:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SPOTIFY'S MULTI-SIGNAL ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SIGNAL 1: COLLABORATIVE FILTERING                                      │
│  ───────────────────────────────────                                    │
│  When it works: Songs with listening history                            │
│  Method: Matrix factorization on user-track listens                     │
│  Strength: Captures taste clusters (indie rock fans, K-pop fans)        │
│                                                                          │
│  SIGNAL 2: AUDIO CONTENT FEATURES                                       │
│  ─────────────────────────────────                                       │
│  When it works: ANY song (even brand new uploads)                       │
│  Method: CNN on mel-spectrograms → audio embedding                      │
│  Features: Tempo, energy, danceability, acousticness, valence           │
│  Strength: Enables cold-start recommendations                           │
│                                                                          │
│  SIGNAL 3: PLAYLIST CO-OCCURRENCE (NLP)                                 │
│  ─────────────────────────────────────────                               │
│  When it works: Songs that appear in user playlists                     │
│  Method: Word2Vec on playlists (songs = "words", playlists = "docs")   │
│  Strength: Captures context ("chill vibes" playlists = similar mood)   │
│                                                                          │
│  SIGNAL 4: EDITORIAL KNOWLEDGE                                          │
│  ─────────────────────────────────                                       │
│  When it works: Curated playlists, genre classifications                │
│  Method: Human-labeled tags, genre graphs                               │
│  Strength: Handles edge cases, ensures quality                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Audio content features: The cold-start savior

Spotify's audio analysis pipeline extracts features directly from the audio waveform:

Low-level features: Raw spectrograms, MFCCs (mel-frequency cepstral coefficients)
Mid-level features: Beat, tempo, key, mode, time signature
High-level features: Genre probability distribution, mood, energy

These are computed at upload time—before anyone listens. A new song immediately gets an embedding based on how it sounds, enabling recommendations like: "This new track sounds like songs you've loved before."

Playlist-based Word2Vec:

Spotify treats playlists as "documents" and songs as "words." Using Word2Vec:

Songs that frequently appear in the same playlists get similar embeddings
A song in many "workout" playlists ends up near other workout songs
Captures semantic relationships that pure audio can't (e.g., "party anthems" might span genres)

Discover Weekly: The flagship product

Discover Weekly (30 personalized songs every Monday) combines all signals:

Find your taste clusters: Use CF to identify what genres/artists you like
Explore similar users' discoveries: What did similar users recently discover?
Audio similarity expansion: Find audio-similar tracks to your favorites
Freshness filter: Exclude songs you've already heard
Diversity injection: Ensure variety across the 30 tracks

The result: ~40 million users listen to Discover Weekly, with >5 billion track streams.

Bandit exploration for new music:

Spotify uses multi-armed bandits to balance exploitation (recommend what you'll like) vs exploration (surface new tracks to gather data). New tracks with uncertain quality get some exposure to "test" them, then exploitation takes over based on observed engagement.

Python

class SpotifyHybridRecSys:
    """
    Spotify's hybrid approach: CF + Audio Content.

    Key insight: New songs have no listening data,
    but we can analyze the audio itself.
    """

    def __init__(self):
        # Collaborative filtering component
        self.cf_model = MatrixFactorization(num_factors=100)

        # Audio analysis component (CNN on spectrograms)
        self.audio_model = AudioCNN()

        # Playlist-based NLP
        self.playlist_nlp = PlaylistWord2Vec()

    def get_track_embedding(self, track_id: str, track_metadata: dict) -> np.ndarray:
        """
        Hybrid embedding combining multiple signals.
        """
        embeddings = []

        # CF embedding (if track has enough listens)
        if self.has_cf_embedding(track_id):
            cf_emb = self.cf_model.get_item_embedding(track_id)
            embeddings.append(cf_emb)

        # Audio embedding (always available)
        audio_features = self.audio_model.extract_features(
            track_metadata['audio_file']
        )
        embeddings.append(audio_features)

        # Playlist co-occurrence embedding
        if self.has_playlist_data(track_id):
            playlist_emb = self.playlist_nlp.get_track_embedding(track_id)
            embeddings.append(playlist_emb)

        # Weighted combination (learned weights)
        return self.combine_embeddings(embeddings)

    def discover_weekly(self, user_id: str) -> list:
        """
        Generate Discover Weekly playlist.
        Mix of CF recommendations + audio-similar discoveries.
        """
        # CF: Songs similar users liked
        cf_candidates = self.cf_model.recommend(user_id, n=100)

        # Audio: Songs audio-similar to user's favorites
        user_favorites = self.get_user_top_tracks(user_id)
        audio_candidates = []
        for fav in user_favorites[:10]:
            similar = self.audio_model.find_similar(fav, n=20)
            audio_candidates.extend(similar)

        # Blend and diversify
        final_playlist = self.blend_and_diversify(
            cf_candidates,
            audio_candidates,
            target_length=30,
        )

        return final_playlist

Spotify lessons:

Audio features (tempo, energy, danceability) enable cold-start
Playlist co-occurrence = implicit collaborative signal
Discover Weekly balances exploitation (CF) + exploration (audio similarity)
Release Radar uses freshness + artist affinity

Pinterest: Visual Similarity at Scale

Pinterest is an image-first platform—users come to discover visual ideas for home decor, fashion, recipes, and more. This makes Pinterest unique: visual features aren't just "nice to have," they're the primary signal for recommendation quality.

Why visual features dominate on Pinterest:

On Pinterest, two pins might have:

Identical metadata: "living room decor"
Completely different visual appeal: modern minimalist vs. bohemian maximalist

A user saving minimalist pins wants more minimalist—not just "living room decor." Visual similarity captures this nuance that metadata cannot.

The scale challenge:

Pinterest has:

300+ billion saved pins
5+ billion boards
500+ million monthly users
Billions of images to process

Computing visual similarity for billions of images is computationally massive.

Pinterest's multi-modal approach:

Pinterest combines several signals, with visual features as the foundation:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PINTEREST'S MULTI-MODAL ARCHITECTURE                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SIGNAL 1: VISUAL FEATURES (CNN-based)                                  │
│  ───────────────────────────────────────                                │
│  Method: Pre-trained ResNet/VGG → 4096-dim embedding                    │
│  Use case: Visual similarity, style matching                            │
│  Example: "Show more pins that LOOK like this"                          │
│                                                                          │
│  SIGNAL 2: GRAPH FEATURES (PinSage)                                     │
│  ─────────────────────────────────────                                  │
│  Method: Graph Neural Network on pin-board-user graph                   │
│  Use case: Contextual similarity, taste clustering                      │
│  Example: "Pins saved to similar boards"                                │
│                                                                          │
│  SIGNAL 3: TEXT FEATURES                                                │
│  ────────────────────────                                               │
│  Method: BERT embeddings on pin descriptions                            │
│  Use case: Semantic search, query understanding                         │
│  Example: "modern kitchen ideas"                                        │
│                                                                          │
│  SIGNAL 4: ENGAGEMENT FEATURES                                          │
│  ──────────────────────────────                                         │
│  Method: Historical CTR, save rate, closeup rate                        │
│  Use case: Quality scoring, ranking                                     │
│  Example: Filter low-engagement pins                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PinSage: Graph Neural Networks at Billion Scale

PinSage (SIGKDD 2018) was a breakthrough—the first GNN deployed at billion scale. Traditional GNNs aggregate information from all neighbors, which is O(n) per node. PinSage's innovations:

Random walk sampling: Instead of all neighbors, sample important neighbors via random walks
Importance pooling: Weight neighbors by visit frequency in random walks
Producer-consumer architecture: Separate graph traversal from neural network computation
Mini-batch training: Train on subgraphs, not the full graph

The result: GNN embeddings for 3 billion pins, computed in hours.

The graph structure:

Pinterest's graph is bipartite:

Pins (images)
Boards (collections of pins)

Two pins are "connected" if they appear on the same board. PinSage learns embeddings where pins saved to similar boards end up close in embedding space—even if they look visually different.

Visual search: "More like this"

Pinterest's visual search lets users:

Tap on any object in an image
See visually similar pins

This requires:

Object detection (identify what the user tapped)
Feature extraction (embed the detected object)
ANN search (find similar embeddings)

The system processes millions of visual searches per day.

Diversification: Avoiding the "same image" problem

Visual similarity can be too precise—showing 20 nearly identical images is useless. Pinterest uses Maximal Marginal Relevance (MMR) to balance relevance and diversity:

$\text{MMR} = \arg\max_{d \in R \setminus S} \left[ \lambda \cdot \text{sim}(d, q) - (1-\lambda) \cdot \max_{s \in S} \text{sim}(d, s) \right]$

Where:

$R$ = candidate pins
$S$ = already selected pins
$q$ = query/seed pin
$\lambda$ = relevance/diversity tradeoff

Python

class PinterestVisualRecSys:
    """
    Pinterest's visual recommendation system.

    Key insight: In an image platform, visual similarity
    is often more important than collaborative signals.
    """

    def __init__(self):
        # Visual feature extractor (pre-trained CNN)
        self.visual_encoder = torchvision.models.resnet50(pretrained=True)
        self.visual_encoder = nn.Sequential(*list(self.visual_encoder.children())[:-1])

        # Graph neural network for pin-board relationships
        self.pinsage = PinSage(
            embedding_dim=256,
            num_layers=3,
        )

    def get_pin_embedding(self, pin_id: str, pin_image: np.ndarray) -> np.ndarray:
        """
        Combine visual and graph embeddings.
        """
        # Visual features from image
        with torch.no_grad():
            image_tensor = self.preprocess(pin_image)
            visual_emb = self.visual_encoder(image_tensor).squeeze()

        # Graph features from pin-board-user graph
        graph_emb = self.pinsage.get_embedding(pin_id)

        # Concatenate
        return np.concatenate([visual_emb, graph_emb])

    def related_pins(self, pin_id: str, pin_image: np.ndarray, n: int = 20) -> list:
        """
        Find visually and contextually similar pins.
        """
        query_emb = self.get_pin_embedding(pin_id, pin_image)

        # ANN search
        similar_pins = self.ann_index.search(query_emb, k=n * 2)

        # Diversify (don't show 20 nearly identical images)
        diversified = self.maximal_marginal_relevance(similar_pins, n)

        return diversified

class PinSage(nn.Module):
    """
    PinSage: Graph Neural Network for Pinterest.
    Paper: https://arxiv.org/abs/1806.01973

    Key innovation: Random walk sampling for scalable GNN.
    """

    def __init__(self, embedding_dim: int, num_layers: int):
        super().__init__()

        self.layers = nn.ModuleList([
            PinSageConv(embedding_dim)
            for _ in range(num_layers)
        ])

    def forward(self, pin_features, neighbor_features, neighbor_weights):
        """
        Aggregate neighbor information.

        Unlike standard GNN: Uses importance-sampled neighbors
        from random walks, not all neighbors.
        """
        h = pin_features

        for layer in self.layers:
            # Aggregate from sampled neighbors
            neighbor_agg = self.aggregate(neighbor_features, neighbor_weights)

            # Combine with self
            h = layer(h, neighbor_agg)

        return h

Pinterest lessons:

Visual features from CNNs are first-class signals
PinSage enabled GNNs at billion scale via sampling
Diversification is crucial (avoid "more of the same")
Board context provides implicit interest clustering

ByteDance/TikTok: Real-Time Learning

TikTok's recommendation system is legendary—often cited as the most engaging recommendation algorithm ever built. A new user gets eerily personalized content within minutes. The secret? Real-time learning that updates models as users interact, rather than waiting for daily batch training.

Why real-time learning matters for TikTok:

TikTok's content cycle is extremely fast:

Videos go viral in hours, not days
Trends emerge and die within a single day
User preferences shift based on current mood
New creators need exposure immediately

A model trained yesterday is already stale. Traditional batch training (daily/weekly) can't keep up.

The speed advantage:

Code

Traditional Recommendation Pipeline:
───────────────────────────────────────────────────────────────────────────
Day 1: User likes cat videos
Day 2: Batch training incorporates yesterday's data
Day 3: Model deployed with updated weights
Day 4: User finally sees more cat videos

→ 3-day lag between signal and adaptation

TikTok's Real-Time Pipeline:
───────────────────────────────────────────────────────────────────────────
Second 0: User likes cat video
Second 1: Event streamed to Kafka
Second 2: Online model update
Second 60: Model synced to serving
Second 61: User sees more cat videos

→ ~1 minute lag

The Monolith architecture (2022 paper):

ByteDance published their system in "Monolith: Real Time Recommendation System With Collisionless Embedding Table". Key innovations:

Collisionless embedding tables: Traditional embedding tables use hashing (item_id % table_size), which causes collisions. Monolith uses expressive IDs that grow dynamically—no collisions, better personalization.
Online training from streaming data: Events (watches, likes, shares) stream to Kafka. Training workers consume events and update model parameters in real-time.
Parameter server architecture: Model parameters live in a distributed parameter server. Training pushes gradients, serving pulls latest weights.
Time-window features: Features like "user engagement in last 5 minutes" are computed on streaming data, not batch.

The For You Page (FYP) algorithm:

TikTok's FYP combines:

Candidate generation: Multiple retrievers (embedding similarity, trending, following, etc.) generate ~1000 candidates
Ranking model: A multi-task neural network predicts:
- Watch time probability
- Like probability
- Share probability
- Comment probability
- Follow probability
Exploration: New content gets "exploration traffic" to gather engagement signals. Low-engagement content gets filtered quickly.
Business rules: Creator fairness, content diversity, safety filters

Why TikTok "gets you" so fast:

The real-time system enables extremely fast adaptation:

Swipe patterns: Which videos you watch 5+ seconds vs. skip immediately
Engagement signals: Likes, comments, shares, follows
Implicit signals: Rewatch, slow-scroll, tap to unmute

Even 5-10 interactions give the model strong signal about your current interests.

The addiction machine critique:

TikTok's effectiveness has drawn criticism—the algorithm is so good at engaging users that it can be addictive. The same real-time learning that enables personalization also enables behavioral manipulation. This is a real ethical consideration for anyone building recommendation systems.

Python

class ByteDanceRealTimeLearning:
    """
    ByteDance's approach: Update models in real-time
    as users interact.

    Traditional: Train daily/weekly → deploy
    ByteDance: Train continuously → sync every minute
    """

    def __init__(self):
        # Streaming data pipeline
        self.kafka_consumer = KafkaConsumer('user_actions')

        # Online training
        self.model = RealTimeModel()
        self.parameter_server = ParameterServer()

        # Feature store (low-latency)
        self.feature_store = RedisFeatureStore()

    def online_training_loop(self):
        """
        Continuous training from streaming events.
        """
        for event in self.kafka_consumer:
            # Event: {user_id, video_id, action, timestamp}

            # Create training example
            user_features = self.feature_store.get(event['user_id'])
            video_features = self.feature_store.get(event['video_id'])

            label = 1 if event['action'] == 'watch_complete' else 0

            # Online update
            loss = self.model.train_step(user_features, video_features, label)

            # Sync to parameter server (batched)
            if self.should_sync():
                self.parameter_server.push_gradients(self.model.get_gradients())

    def serve(self, user_id: str, candidate_videos: list) -> list:
        """
        Serve with latest model parameters.
        """
        # Pull latest parameters (sync every ~1 minute)
        self.model.load_from_parameter_server(self.parameter_server)

        # Score candidates
        scores = self.model.predict(user_id, candidate_videos)

        return sorted(zip(candidate_videos, scores), key=lambda x: x[1], reverse=True)

ByteDance/TikTok lessons:

Real-time learning captures trending content immediately
Collision-less embeddings handle billion-scale item catalogs
Watch time prediction > click prediction
Short video = fast feedback loop = faster model iteration

LinkedIn: Professional Graph Structure

LinkedIn's recommendation challenges are fundamentally different from entertainment platforms. On Netflix or TikTok, the goal is engagement. On LinkedIn, recommendations must be professionally relevant—a bad connection recommendation isn't just annoying, it's embarrassing. "Do I know this person? Should I know them? Will they think I'm spam?"

What makes professional recommendations different:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│        ENTERTAINMENT vs. PROFESSIONAL RECOMMENDATIONS                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ENTERTAINMENT (Netflix, Spotify, TikTok)                               │
│  ─────────────────────────────────────────                               │
│  Goal: Maximize engagement                                               │
│  Mistake cost: User skips, no big deal                                   │
│  Signal: Implicit (watch time, skips)                                    │
│  Graph: Weak (users don't know each other)                               │
│                                                                          │
│  PROFESSIONAL (LinkedIn)                                                │
│  ────────────────────────                                                │
│  Goal: Maximize professional value                                       │
│  Mistake cost: Embarrassment, spam perception                            │
│  Signal: Explicit (connections are deliberate)                           │
│  Graph: Strong (users KNOW each other in real life)                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The professional graph is gold:

LinkedIn's most valuable asset is the connection graph. It encodes real-world professional relationships:

Same company: Colleagues who work together
Same school: Alumni who share educational background
Same industry: Professionals in the same field
Career trajectory: Job transitions that predict future moves

This graph structure is far more predictive than any content-based signal.

People You May Know (PYMK):

PYMK is LinkedIn's most important recommendation surface—it drives network growth. The algorithm combines multiple signals:

Mutual connections: The single most predictive feature. If you and X share 5+ mutual connections, you likely know each other.
Shared affiliations:
- Same current company (colleagues)
- Same past company (former colleagues)
- Same school (classmates, alumni)
Profile similarity:
- Same job title/function
- Same industry
- Same location
Graph structure:
- 2nd degree connections (friends of friends)
- Clustering coefficient (are your friends friends with each other?)

The "triangle closing" pattern:

The strongest PYMK signal is closing triangles. If A→B and A→C, then B→C is likely:

Code

     A
    / \
   B   C
    \?/

A knows B (connection)
A knows C (connection)
B probably knows C (recommend!)

This simple heuristic drives a huge portion of PYMK recommendations.

GLMix: Generalized Linear Mixed Models

LinkedIn developed GLMix (2015) to personalize at scale. The insight: some effects are global (industry trends), some are per-user (individual preferences), some are per-item (company popularity).

GLMix combines:

Fixed effects: Global model parameters (same for all users)
Random effects: Per-user and per-item parameters

This allows personalization without overfitting on sparse data.

Job recommendations: A different problem

Job recommendations have unique constraints:

Jobs expire: Unlike songs, jobs disappear when filled
Location matters: Users won't relocate for every job
Seniority matching: Entry-level vs. senior roles
Career trajectory: Recommend realistic next steps

LinkedIn's job recommendation combines:

Skills matching (job requirements vs. user skills)
Career path modeling (what jobs do similar people take next?)
Company affinity (has user engaged with this company?)
Freshness (prefer recently posted jobs)

Python

class LinkedInPYMK:
    """
    LinkedIn's People You May Know (PYMK).

    Key insight: Professional connections follow patterns
    (same company, same school, same industry).
    """

    def __init__(self):
        self.graph = ProfessionalGraph()
        self.embedding_model = GraphEmbedding()

    def people_you_may_know(self, user_id: str, n: int = 20) -> list:
        """
        Recommend connections based on graph structure.
        """
        candidates = []

        # 2nd degree connections (friends of friends)
        friends = self.graph.get_connections(user_id)
        for friend in friends:
            friends_of_friend = self.graph.get_connections(friend)
            for fof in friends_of_friend:
                if fof != user_id and fof not in friends:
                    candidates.append(fof)

        # Score by multiple signals
        scored = []
        for candidate in set(candidates):
            score = self.compute_pymk_score(user_id, candidate)
            scored.append((candidate, score))

        scored.sort(key=lambda x: x[1], reverse=True)
        return [c for c, _ in scored[:n]]

    def compute_pymk_score(self, user_id: str, candidate_id: str) -> float:
        """
        Multi-factor scoring for PYMK.
        """
        score = 0

        # Mutual connections
        mutual = self.graph.mutual_connections(user_id, candidate_id)
        score += len(mutual) * 10

        # Same current company
        if self.same_company(user_id, candidate_id):
            score += 50

        # Same school
        if self.same_school(user_id, candidate_id):
            score += 30

        # Same industry
        if self.same_industry(user_id, candidate_id):
            score += 20

        # Profile similarity (skills, experience)
        similarity = self.profile_similarity(user_id, candidate_id)
        score += similarity * 10

        return score

LinkedIn lessons:

Graph structure is a powerful signal for professional networks
Multi-factor models (GLMix) handle heterogeneous features
Mutual connections are the strongest predictor
Different recommendation contexts need different models (jobs vs. people vs. content)

Part VIII: Evaluation Metrics

Understanding how to evaluate recommendation systems is as important as building them. Different metrics capture different aspects of quality, and the right choice depends on your use case.

Ranking Metrics

Most recommendation systems produce ranked lists. These metrics evaluate ranking quality:

Precision@K: What fraction of the top-K recommendations were relevant?

$\text{Precision@K} = \frac{|\text{relevant items in top K}|}{K}$

Recall@K: What fraction of all relevant items appeared in top K?

$\text{Recall@K} = \frac{|\text{relevant items in top K}|}{|\text{all relevant items}|}$

Hit Rate@K (HR@K): Did at least one relevant item appear in top K? Binary version of Recall.

NDCG@K (Normalized Discounted Cumulative Gain): Measures ranking quality, giving higher scores when relevant items appear earlier.

$\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i + 1)}$

$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$

Where IDCG is the DCG of the ideal (perfect) ranking.

MRR (Mean Reciprocal Rank): Average of reciprocal ranks of first relevant item.

$\text{MRR} = \frac{1}{|U|} \sum_{u=1}^{|U|} \frac{1}{\text{rank}_u}$

MAP (Mean Average Precision): Average of precision at each relevant item's position.

Python

import numpy as np
from typing import List, Set

def precision_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
    """Precision@K: fraction of top-K that are relevant."""
    recommended_k = recommended[:k]
    hits = len(set(recommended_k) & relevant)
    return hits / k

def recall_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
    """Recall@K: fraction of relevant items in top K."""
    if len(relevant) == 0:
        return 0.0
    recommended_k = recommended[:k]
    hits = len(set(recommended_k) & relevant)
    return hits / len(relevant)

def ndcg_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
    """NDCG@K: normalized discounted cumulative gain."""
    recommended_k = recommended[:k]

    # DCG
    dcg = 0.0
    for i, item in enumerate(recommended_k):
        if item in relevant:
            dcg += 1.0 / np.log2(i + 2)  # +2 because position is 1-indexed

    # IDCG (ideal: all relevant items at top)
    ideal_hits = min(len(relevant), k)
    idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))

    return dcg / idcg if idcg > 0 else 0.0

def mrr(recommended_lists: List[List[int]], relevant_sets: List[Set[int]]) -> float:
    """Mean Reciprocal Rank across users."""
    reciprocal_ranks = []

    for recommended, relevant in zip(recommended_lists, relevant_sets):
        for rank, item in enumerate(recommended, 1):
            if item in relevant:
                reciprocal_ranks.append(1.0 / rank)
                break
        else:
            reciprocal_ranks.append(0.0)

    return np.mean(reciprocal_ranks)

def map_at_k(recommended: List[int], relevant: Set[int], k: int) -> float:
    """Average Precision@K."""
    recommended_k = recommended[:k]
    precisions = []

    hits = 0
    for i, item in enumerate(recommended_k):
        if item in relevant:
            hits += 1
            precisions.append(hits / (i + 1))

    return np.mean(precisions) if precisions else 0.0

# Example usage
recommended = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]  # Top 10 recommendations
relevant = {3, 7, 15, 22, 25}  # Ground truth relevant items

print(f"Precision@10: {precision_at_k(recommended, relevant, 10):.3f}")  # 0.300
print(f"Recall@10: {recall_at_k(recommended, relevant, 10):.3f}")        # 0.600
print(f"NDCG@10: {ndcg_at_k(recommended, relevant, 10):.3f}")            # ~0.5
print(f"MAP@10: {map_at_k(recommended, relevant, 10):.3f}")              # ~0.4

Beyond Accuracy: Diversity and Coverage

Accuracy metrics don't capture everything. A system that recommends only popular items might have high precision but provide poor user experience.

Coverage: What fraction of items ever get recommended?

$\text{Coverage} = \frac{|\text{items recommended to at least one user}|}{|\text{all items}|}$

Diversity (Intra-List): How different are items within a recommendation list?

$\text{Diversity} = \frac{1}{|L|(|L|-1)} \sum_{i \neq j} (1 - \text{sim}(i, j))$

Novelty: Are we recommending items users haven't seen before?

Serendipity: Are we recommending surprising but relevant items?

Online vs Offline Evaluation

Offline evaluation uses historical data to measure predictive accuracy. It's fast and cheap but has limitations:

Only measures prediction of past behavior
Can't capture exploration value
Misses business impact

Online evaluation (A/B testing) measures actual user behavior:

Click-through rate (CTR)
Conversion rate
Watch time / session duration
Return visits
Revenue per user

The offline-online gap: Models that win offline often don't win online. Selection bias in historical data, feedback loops, and missing features all contribute.

Best practice: Use offline metrics for rapid iteration during development, but always validate with A/B tests before production deployment.

Part IX: Practical Comparison

When to Use What

Method	Best For	Pros	Cons
Item-based CF	Small catalogs, explainability	Simple, interpretable	Doesn't scale
Matrix Factorization	Cold start items (with content)	Efficient, proven	Linear only
BPR	Implicit feedback ranking	Optimizes ranking directly	Needs negative sampling
NCF	When non-linearity helps	Flexible	Not always better than MF
Wide & Deep	Apps with rich features	Memorization + generalization	Needs feature engineering
DeepFM	CTR prediction	Automatic feature crosses	Compute intensive
Two-Tower	Large-scale retrieval	Fast serving (ANN)	Decoupled = less expressive
GRU4Rec	Session-based	Captures order	Slow, short-term focus
Transformers	Long sequences, SOTA	Best accuracy	Expensive

Production Recommendations

Start simple: Matrix factorization or item-based CF is often 80% of the value
Two-tower for scale: When you have millions of items, you need ANN retrieval
Add complexity gradually: NCF → DeepFM → Transformers, with A/B tests at each step
Consider latency: More complex ≠ better if you can't serve it fast enough

Conclusion

Recommendation systems have evolved dramatically over three decades:

1990s-2000s: Collaborative filtering and content-based methods
2006-2015: Matrix factorization and the Netflix Prize revolution
2016-2018: Deep learning enters with NCF, Wide & Deep, and sequential models
2018+: Transformers take over (see next post)

Each era built on the last. Matrix factorization introduced embeddings. Deep learning added non-linearity. Transformers enabled direct modeling of long-range dependencies.

Understanding this evolution isn't just historical—it's practical. Many production systems still use "older" techniques where they excel. The best recommendation engineers know when to use matrix factorization, when to add deep learning, and when to go all-in on transformers.

Sources:

Frequently Asked Questions

Absolutely. Matrix factorization introduced the fundamental concept of embeddings that underpins all modern deep learning recommendation systems. When you use user and item embeddings in a transformer model, you're using the same conceptual framework that matrix factorization pioneered.

Understanding MF deeply helps you:

Understand why dot products work for similarity (the mathematical foundation)
Know why we need regularization (overfitting is real, even with neural networks)
Grasp how implicit feedback differs from explicit ratings (a crucial distinction)
Debug embedding-based systems (when embeddings collapse or don't converge)
Appreciate why simpler models often win (proper hyperparameter tuning matters more than architecture)

Many production systems still use matrix factorization variants as a strong baseline or as the initial retrieval stage. It's not obsolete—it's foundational.

Not necessarily. A 2020 RecSys study showed that with proper hyperparameter tuning, simple dot products (matrix factorization) match NCF performance. This result was surprising because NCF was supposed to capture complex non-linear interactions that MF couldn't.

What happened? The original NCF paper compared against a poorly-tuned MF baseline. When researchers tuned MF properly (learning rates, regularization, embedding dimensions), the gap disappeared.

The broader lesson: Don't add complexity without evidence it helps on your data. Neural networks are powerful but they're not magic—they require proper tuning, more compute, and introduce training instability. If MF gets you 90% of the way there, the extra 10% from NCF may not justify the operational complexity.

That said, NCF can help in specific cases: when you have rich user/item features beyond IDs, when non-linear interactions are genuinely present in your domain, or when you need a unified neural architecture that can incorporate other signals.

Two-tower models become essential when you have more than ~100K items and need real-time recommendations (sub-100ms latency). The key advantage is that you can pre-compute all item embeddings offline and store them in an approximate nearest neighbor (ANN) index.

At serving time:

Compute the user embedding (one forward pass through the user tower)
Query the ANN index to find nearest item embeddings (sub-millisecond for millions of items)
Return top candidates

This is dramatically faster than scoring all items directly, which would require millions of model inferences per request.

When you might not need two-tower:

Less than ~100K items: You might afford to score all items directly
Offline/batch recommendations: Latency isn't critical
Very complex ranking logic: Two-tower forces simple dot-product similarity

Common production pattern: Two-tower for retrieval (find top 1000 candidates), then a more expressive model (DeepFM, DCN, transformer) for ranking those candidates.

These approaches answer fundamentally different questions:

Collaborative filtering asks: "What do similar users like?"

Based on the aggregated behavior of many users
Captures long-term preferences and taste
User A bought items 1, 2, 3; User B bought 1, 2 → recommend 3 to B

Sequential models ask: "What will this user do next, given their recent behavior?"

Based on the temporal order of one user's actions
Captures current intent and session dynamics
User just viewed laptop → laptop accessories → likely wants laptop case

Why this matters:

A user's long-term taste (CF) might be "likes jazz music"
Their current session intent might be "looking for workout music"
CF would recommend jazz; sequential models would recommend workout playlists

The best systems combine both: CF captures stable preferences, sequential models capture session intent. The relative weight depends on your domain—e-commerce has strong session intent (shopping missions), music streaming mixes long-term taste with current mood.

Modern recommendation systems have multiple stages, and understanding the distinction is crucial:

Retrieval (Candidate Generation):

Goal: Find a manageable set of potentially relevant items from millions
Constraints: Must be very fast (sub-millisecond for millions of items)
Models: Two-tower, ANN indices, simple heuristics
Output: ~100-1000 candidates

Ranking (Scoring):

Goal: Precisely order the candidates by relevance
Constraints: Can be slower (scoring hundreds of items, not millions)
Models: Feature interaction models (DeepFM, DCN), transformers, cross-encoders
Output: Final ordered list

Why two stages? Ranking models are more expressive but too expensive to run on millions of items. Retrieval models are fast but less accurate. The combination gives you both speed and quality.

Production example:

Retrieval: User embedding → ANN query → top 500 items (5ms)
Ranking: DeepFM scores 500 items → top 50 (20ms)
Re-ranking: Business rules, diversity, freshness → top 20 (5ms)
Total: ~30ms for a recommendation request

Transformers addressed a fundamental limitation of earlier sequential models: the information bottleneck.

RNNs (GRU4Rec, etc.) compress the entire history into a fixed-size hidden state. Information about early items must pass through all intermediate states to influence the prediction. Long-range dependencies get diluted or lost.

Transformers (SASRec, BERT4Rec, etc.) use attention to directly connect any two items in the sequence. If item 1 and item 50 are related, attention can capture this directly without information passing through items 2-49.

Concrete advantages:

Long-range dependencies: Can capture patterns spanning hundreds of interactions
Parallel training: All positions computed simultaneously (vs. sequential RNN)
Interpretability: Attention weights show what the model focuses on
Transfer learning: Pre-trained architectures can be fine-tuned

When RNNs might still win:

Very short sequences (< 10 items): Overhead of attention isn't worth it
Strict latency constraints: Transformers are more expensive at inference
Streaming/online settings: RNNs update state incrementally; transformers recompute

The transformer takeover also benefited from ecosystem effects: more research, better tooling, GPU optimizations, and pre-trained models made transformers increasingly practical.

Embedding dimension is one of the most important hyperparameters, and there's no universal answer. However, there are useful heuristics:

Rules of thumb:

Small catalogs (< 10K items): 32-64 dimensions often sufficient
Medium catalogs (10K-1M items): 64-256 dimensions
Large catalogs (> 1M items): 128-512 dimensions
General formula: d ≈ k × log(N) where N is catalog size, k is a constant (around 4-8)

Why these ranges work:

Too few dimensions: Can't capture the diversity of items (embeddings collide)
Too many dimensions: Overfitting, increased compute, diminishing returns

Practical considerations:

Higher dimensions require more data to train well
ANN search gets slower and less accurate with higher dimensions
GPU memory constraints may limit dimension size at scale

Recommendation: Start with 64-128, measure offline metrics, then tune. The difference between 64 and 256 is usually smaller than the difference between bad and good hyperparameter tuning.

Cold start—recommending for new users or new items with no interaction history—is a fundamental challenge that classical methods address in several ways:

For new users:

Popularity baseline: Recommend globally popular items until you have signal
Demographics/registration data: Use age, location, stated preferences for initial recommendations
Onboarding quiz: Explicitly ask preferences during signup
Explore-exploit: Show diverse items early to learn preferences quickly

For new items:

Content-based features: Use item attributes (category, brand, description) to place items near similar items in embedding space
Attribute averaging: New running shoe inherits average embedding of existing running shoes
Metadata injection: Add item features to the embedding lookup (concatenate ID embedding with attribute embedding)

Hybrid approaches: Matrix factorization variants like Factorization Machines naturally handle feature-based cold start by factorizing features in addition to user/item IDs.

Deep learning methods (covered in the transformer post) use content encoders (text, images) for more sophisticated cold-start handling.

This distinction fundamentally changes how you train recommendation models:

Explicit feedback: Users directly state preferences

Ratings (1-5 stars)
Thumbs up/down
"I like this" buttons

Implicit feedback: Preferences inferred from behavior

Clicks, views, purchases
Watch time, scroll depth
Add to cart, save for later

Why implicit dominates in practice:

Explicit feedback is sparse (users rarely rate items)
Implicit feedback is abundant (every interaction generates signal)
Implicit better reflects behavior (what users do vs. what they say)

Training differences:

Explicit: Predict the rating value (regression or classification)
Implicit: Handle missing data carefully—non-interaction ≠ negative preference

BPR (Bayesian Personalized Ranking) was specifically designed for implicit feedback: it assumes observed interactions are preferred over non-observed ones, but doesn't assume non-observed items are disliked.

Negative sampling is essential for implicit feedback: since you only have positive examples (interactions), you sample non-interacted items as negatives. The quality of negative sampling significantly impacts model performance.

While specific details are proprietary, public papers reveal common patterns:

Multi-stage architecture: All major systems use retrieval → ranking → re-ranking pipelines. The first stage is fast and broad; subsequent stages are precise and narrow.

Ensemble approaches: No single model dominates. Production systems combine multiple signals: collaborative filtering, content-based, sequential, knowledge-based, popularity, freshness.

Feature engineering matters: Despite deep learning hype, feature engineering remains crucial. YouTube's 2016 paper famously noted that "feature engineering is the art" of recommendation.

Online learning: Models are continuously updated with fresh data. Stale models perform worse, especially for trending content.

A/B testing is extensive: Every change is tested. Companies run hundreds of experiments simultaneously. Metrics include both engagement (clicks, watch time) and satisfaction (surveys, retention).

Business rules layer: After ML ranking, business logic applies: diversity requirements, content policies, promotional campaigns, freshness boosts.

For specific details, see Part VII (Production Systems) in this post, which covers Netflix, YouTube, Amazon, Spotify, Pinterest, and ByteDance systems.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

RecSysPersonalization

Transformers for Recommendation Systems: From SASRec to HSTU

In-depth tour of transformer-based recommendation systems. From the fundamentals of sequential recommendation to Meta's trillion-parameter HSTU, understand how attention mechanisms revolutionized personalization.

30 min read

RecSysPersonalization

Generative AI for Recommendation Systems: LLMs Meet Personalization

Practical guide to LLM-powered recommendation systems. From feature augmentation to conversational agents, understand how generative AI is transforming personalization.

9 min read

EmbeddingsRAG

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

Clear walkthrough of embedding models for RAG, search, and AI applications. Comparison of text-embedding-3, BGE, E5, Cohere Embed v4, and Voyage with guidance on fine-tuning, dimensionality, multimodal embeddings, and production optimization.

15 min read

RAGEmbeddings

Vector Databases: A Comprehensive Guide to Pinecone, Weaviate, Qdrant, Milvus & Chroma

Deep dive into vector database architecture, indexing algorithms, and production considerations. Comprehensive comparison of Pinecone vs Weaviate vs Qdrant vs Milvus vs Chroma with benchmarks, pricing, and use case recommendations for 2025.

13 min read

Table of Contents

The Evolution of Recommendation Systems

Part I: Classical Recommendation Methods

The Three Paradigms

User-Based Collaborative Filtering

Item-Based Collaborative Filtering

Part II: Matrix Factorization (The Netflix Prize Era)

The $1 Million Challenge

SVD and ALS

How SVD++ Won the Netflix Prize

SVD++: The Netflix Prize Winning Algorithm

SVD++ Training: The Exact Gradient Derivation

timeSVD++: Adding Temporal Dynamics

BPR: Bayesian Personalized Ranking

Part III: Deep Learning for Recommendations

The Limitations of Matrix Factorization

Neural Collaborative Filtering (NCF)

Wide & Deep Learning

Part IV: Feature Interaction Models

DeepFM: Automating Feature Crosses

DCN: Deep & Cross Network

Part V: Two-Tower Models for Retrieval

The Retrieval Problem

YouTube DNN

DSSM: Deep Structured Semantic Models

Serving Two-Tower Models

Part VI: Sequential Models

GRU4Rec: RNNs for Session-Based Recommendation

Caser: CNNs for Sequential Patterns

The Transition to Attention

Part VII: Production Systems (Pre-Transformer Era)

How Tech Giants Built Their RecSys

Netflix: The Prize That Changed Everything

YouTube: Two-Tower at Scale

Amazon: Item-Based CF at Scale

Spotify: Audio Features for Cold-Start

Pinterest: Visual Similarity at Scale

ByteDance/TikTok: Real-Time Learning

LinkedIn: Professional Graph Structure

Part VIII: Evaluation Metrics

Ranking Metrics

Beyond Accuracy: Diversity and Coverage

Online vs Offline Evaluation

Part IX: Practical Comparison

When to Use What

Production Recommendations

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Transformers for Recommendation Systems: From SASRec to HSTU

Generative AI for Recommendation Systems: LLMs Meet Personalization

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

Vector Databases: A Comprehensive Guide to Pinecone, Weaviate, Qdrant, Milvus & Chroma