Generative AI for Recommendation Systems: LLMs Meet Personalization
A comprehensive guide to LLM-powered recommendation systems. From feature augmentation to conversational agents, understand how generative AI is transforming personalization.
Table of Contents
The Convergence of LLMs and RecSys
At RecSys 2025 in Prague, one trend dominated: Large Language Models and recommendation systems are converging. This isn't hype—it's a fundamental shift in how we think about personalization.
Traditional recommendation systems excel at collaborative filtering: finding patterns in user-item interactions. But they struggle with:
- Cold start: New users and items have no interaction history
- Explainability: Why was this recommended?
- Natural interaction: Users want to converse, not just click
- Semantic understanding: "I want something like that movie but more uplifting"
LLMs address all of these. They understand language, reason about preferences, and generate explanations. The question is no longer "should we use LLMs for recommendations?" but "how?"
┌─────────────────────────────────────────────────────────────────────────┐
│ TRADITIONAL RECSYS vs LLM-ENHANCED RECSYS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL RECSYS: │
│ ──────────────────── │
│ │
│ User → [Interaction History] → Collaborative Filtering → Items │
│ │
│ Strengths: │
│ + Fast inference (embeddings + ANN) │
│ + Captures behavioral patterns │
│ + Well-understood, mature │
│ │
│ Weaknesses: │
│ - Cold start for new users/items │
│ - No natural language understanding │
│ - Black box (limited explainability) │
│ - Static (can't reason about context) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LLM-ENHANCED RECSYS: │
│ │
│ User → [Natural Language + History] → LLM Reasoning → Items │
│ │
│ Strengths: │
│ + Handles cold start via content understanding │
│ + Natural conversational interface │
│ + Explainable ("I recommended this because...") │
│ + Can reason about complex preferences │
│ + Zero/few-shot adaptation to new domains │
│ │
│ Weaknesses: │
│ - Higher latency and cost │
│ - Less precise on behavioral patterns │
│ - Hallucination risks │
│ - Harder to A/B test and control │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ THE WINNING APPROACH: HYBRID │
│ │
│ LLMs augment traditional systems, not replace them │
│ • LLM for understanding, reasoning, explanation │
│ • Traditional models for fast retrieval, behavioral patterns │
│ │
└─────────────────────────────────────────────────────────────────────────┘
2025 State of the Art: A comprehensive survey analyzing 50+ studies identifies three fundamental paradigms: Recommender-oriented (LLMs enhance recommendation mechanisms), Interaction-oriented (conversational recommendations), and Simulation-oriented (multi-agent systems modeling user-item dynamics).
Part I: The LLM-RecSys Taxonomy
Three Paradigms for LLM Integration
Research has converged on three primary ways to integrate LLMs with recommendation systems:
┌─────────────────────────────────────────────────────────────────────────┐
│ LLM-RECSYS INTEGRATION PARADIGMS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. RECOMMENDER-ORIENTED (Enhance the Model) │
│ ───────────────────────────────────────────── │
│ │
│ LLM augments or replaces traditional recommendation components │
│ │
│ Approaches: │
│ • Knowledge Enhancement: LLM generates item descriptions, features │
│ • Interaction Enhancement: LLM enriches user-item signals │
│ • Model Enhancement: LLM as scorer, ranker, or full recommender │
│ │
│ Example: LLMRec, CoLLM, P5 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 2. INTERACTION-ORIENTED (Conversational) │
│ ────────────────────────────────────────── │
│ │
│ LLM enables natural language interaction for recommendations │
│ │
│ Approaches: │
│ • Conversational recommendation systems (CRS) │
│ • Explainable recommendations via dialogue │
│ • Preference elicitation through conversation │
│ │
│ Example: Chat-REC, RecLLM, InteRecAgent │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 3. SIMULATION-ORIENTED (Multi-Agent) │
│ ─────────────────────────────────────── │
│ │
│ LLM-powered agents simulate users, items, and system dynamics │
│ │
│ Approaches: │
│ • User simulation for training/evaluation │
│ • Item agents for dynamic pricing/availability │
│ • Ecosystem simulation for policy testing │
│ │
│ Example: RecAgent, Agent4Rec, CRAVE │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Operational Distinctions
Within these paradigms, systems differ in how the LLM operates:
Model-Centric LLMRec: The LLM is fine-tuned or prompt-engineered to directly produce recommendations. Items are mapped to tokens, and the LLM generates item sequences.
Hybrid LLMRec: The LLM augments traditional models—generating features, enhancing embeddings, or providing semantic signals that feed into collaborative filtering.
Agentic LLMRec: The LLM acts as an autonomous agent, using tools (search, database queries, APIs) to gather information and make recommendations through multi-step reasoning.
Part II: Knowledge Enhancement
LLMs as Feature Generators
The simplest integration: use LLMs to generate rich features for items and users. This approach is low-risk and immediately valuable—you're not replacing your recommendation system, just making it smarter with better features.
Why LLM-generated features are powerful:
Traditional item features are either:
- Structured metadata: Category, brand, price. Limited and requires manual curation.
- Embeddings: Dense vectors from models trained on similar items. Good but opaque.
LLMs can generate semantic features that capture nuances humans understand but traditional systems miss:
Traditional features for "Patagonia Fleece Jacket":
─────────────────────────────────────────────────────────────────────────
category: "outerwear"
brand: "patagonia"
price: 150
color: "blue"
LLM-generated features:
─────────────────────────────────────────────────────────────────────────
target_audience: "environmentally-conscious outdoor enthusiasts, 25-45"
use_cases: ["hiking", "casual everyday wear", "light camping"]
emotional_appeal: "rugged reliability, environmental responsibility"
style: "casual athletic, works with jeans or hiking pants"
similar_buyers_also_like: ["hiking boots", "wool base layers", "camping gear"]
The LLM-generated features enable recommendations that traditional systems can't make: "Users who care about sustainability might also like these eco-friendly products."
When to use LLM feature generation:
- Cold-start items: New products with no user interaction data
- Long-tail items: Products with sparse interaction history
- Cross-category recommendations: Understanding that camping gear buyers might want sustainable products
- Explanation generation: Why did we recommend this?
Cost considerations:
LLM calls are expensive. Don't call them per-request. Instead:
- Batch processing: Generate features for all items offline
- Caching: Store generated features in your feature store
- Selective enrichment: Only use LLMs for items where traditional features are insufficient
from anthropic import Anthropic
client = Anthropic()
def generate_item_features(item: dict) -> dict:
"""
Use LLM to generate rich semantic features for items.
These features can augment traditional embeddings.
"""
prompt = f"""Analyze this product and extract structured features:
Product: {item['title']}
Category: {item['category']}
Description: {item['description']}
Price: ${item['price']}
Extract:
1. Target audience (demographics, interests)
2. Use cases (when/why someone would buy this)
3. Key attributes (quality level, style, features)
4. Emotional appeal (what feelings it evokes)
5. Similar products (what else might interest this buyer)
Format as JSON."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
return parse_json(response.content[0].text)
def generate_user_profile(user_history: list[dict]) -> dict:
"""
Generate semantic user profile from interaction history.
"""
history_text = "\n".join([
f"- {item['title']} ({item['category']}) - {item['action']}"
for item in user_history[-20:] # Recent history
])
prompt = f"""Based on this user's recent activity, create a preference profile:
Recent Activity:
{history_text}
Extract:
1. Primary interests (top 3 categories/themes)
2. Price sensitivity (budget, mid-range, premium)
3. Style preferences (if discernible)
4. Likely needs (what problems they're solving)
5. Recommendation strategy (what to show next)
Format as JSON."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return parse_json(response.content[0].text)
LLMRec: Graph Augmentation with LLMs
LLMRec (WSDM 2024) uses LLMs to augment the user-item interaction graph. This is a clever approach: instead of replacing your graph-based recommender, use LLMs to add missing edges to the graph.
The sparsity problem in recommendation graphs:
User-item interaction graphs are extremely sparse. A typical user interacts with <0.01% of items. This sparsity hurts recommendations:
- Users with few interactions get poor recommendations (cold start)
- Items with few interactions are never recommended (popularity bias)
- Implicit similarities aren't captured (if no user bought both A and B, no edge exists)
LLMRec's insight: LLMs can infer missing edges
LLMs understand semantic relationships that aren't in the interaction data:
User History: [Python Book, Machine Learning Course, GPU]
─────────────────────────────────────────────────────────────────────────
What the graph knows:
User → Python Book (purchased)
User → ML Course (enrolled)
User → GPU (purchased)
What LLM can infer (new edges to add):
Python Book ↔ ML Course (both for learning ML)
GPU ↔ ML Course (GPU needed for ML training)
User → "Data Science Tools" (implicit interest cluster)
Three types of augmentation LLMRec performs:
-
User profile augmentation: Generate textual profile from interaction history, embed it as a new node connected to the user
-
Item relationship augmentation: Ask LLM to identify semantically related items, add edges between them
-
Interaction reasoning: For each user-item pair, generate explanation of why this interaction happened, use explanation embedding to enrich the edge
Why this works better than just using LLM embeddings:
- Preserves graph structure: GNN-based recommenders rely on graph topology. Adding edges improves message passing.
- Cheaper than inference-time LLM: Augmentation is done once offline. Inference uses fast GNN.
- Combines strengths: LLM semantic understanding + GNN collaborative filtering
Implementation pattern:
OFFLINE PIPELINE:
1. For each item: LLM generates "related items" → add item-item edges
2. For each user: LLM generates "interest summary" → add user profile node
3. Retrain GNN on augmented graph
ONLINE INFERENCE:
Same as before—fast GNN inference, no LLM calls
class LLMRecAugmenter:
"""
LLMRec-style graph augmentation.
Uses LLM to generate synthetic interactions and enrich item features.
"""
def __init__(self, llm_client, item_catalog: dict):
self.llm = llm_client
self.items = item_catalog
def augment_item_graph(self, item_id: str) -> list[tuple[str, float]]:
"""
Generate synthetic item-item relationships via LLM reasoning.
"""
item = self.items[item_id]
prompt = f"""Given this item:
Title: {item['title']}
Category: {item['category']}
Description: {item['description'][:200]}
List 5 items that would strongly appeal to the same customer.
For each, explain why and rate confidence (0-1).
Format:
1. [Item type/category] - [Reason] - [Confidence]"""
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
# Parse and match to actual catalog items
synthetic_edges = self._match_to_catalog(response.content[0].text)
return synthetic_edges
def generate_user_augmentation(
self,
user_history: list[str],
num_synthetic: int = 5
) -> list[str]:
"""
Generate synthetic interactions for sparse users.
Helps with cold start.
"""
history_items = [self.items[i] for i in user_history if i in self.items]
prompt = f"""A user has interacted with these items:
{self._format_items(history_items)}
Based on these preferences, what other items would they likely enjoy?
List {num_synthetic} item types/categories with confidence scores.
Focus on items that reveal underlying preferences (not obvious similar items)."""
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
synthetic_items = self._match_to_catalog(response.content[0].text)
return synthetic_items
Part III: LLMs as Recommenders
Direct Recommendation via Prompting
The most direct approach: ask the LLM to recommend items. This sounds simple, but doing it well requires understanding LLM limitations and designing around them.
Why direct prompting is appealing:
- Zero training: No ML infrastructure needed. Just prompt.
- Rich reasoning: LLM can explain why each recommendation fits.
- Context awareness: Can incorporate real-time context ("I'm shopping for a gift for my mom").
- Language understanding: Handles natural language queries that keyword search can't.
Why direct prompting is dangerous:
The LLM doesn't actually know your catalog. It hallucinates:
User: "Recommend running shoes under $100"
─────────────────────────────────────────────────────────────────────────
LLM response (without grounding):
"I recommend the Nike Air Zoom Pegasus 38..."
Problems:
❌ That shoe might cost $130 in your store
❌ You might not carry Nike at all
❌ The "Pegasus 38" might be discontinued
❌ LLM might invent products that don't exist
The solution: Retrieval-Augmented Recommendation
Never let the LLM recommend from its imagination. Always:
- Use traditional retrieval to get candidate items from YOUR catalog
- Provide those candidates in the prompt
- Ask LLM to rank/select from the provided candidates only
Correct pattern:
─────────────────────────────────────────────────────────────────────────
1. Embedding search: "running shoes" → 100 candidates from your catalog
2. Filter: price < $100 → 40 candidates
3. Prompt LLM: "From these 40 shoes, which 5 best match [user history]?"
4. LLM returns IDs from your candidate list (can't hallucinate)
Why candidate pre-filtering is essential:
LLMs can't efficiently process millions of items. Their context window is limited (even Claude's 200K tokens can only hold ~50K product descriptions). Pre-filter to 50-200 candidates using fast traditional methods, then use LLM for intelligent ranking.
When to use direct LLM recommendation:
- Conversational commerce: User is chatting, asking questions
- Complex queries: "Something for a dinner party with vegetarians"
- Explanation-heavy: When users want to know WHY this recommendation
- Low-volume, high-value: B2B sales, luxury goods where personalization matters
When NOT to use:
- High-volume feeds: Homepage recommendations (too slow, too expensive)
- Latency-sensitive: Search results where 100ms matters
- Simple queries: "Show me popular laptops" (traditional RecSys is faster/cheaper)
class LLMRecommender:
"""
LLM as direct recommender via in-context learning.
"""
def __init__(self, llm_client, item_catalog: list[dict]):
self.llm = llm_client
self.items = item_catalog
self.item_index = {item['id']: item for item in item_catalog}
def recommend(
self,
user_history: list[str],
context: str = None,
num_recommendations: int = 10,
) -> list[dict]:
"""
Generate recommendations via LLM reasoning.
"""
# Format user history
history_text = self._format_history(user_history)
# Format candidate items (subset for efficiency)
candidates = self._get_candidates(user_history, n=100)
candidates_text = self._format_candidates(candidates)
prompt = f"""You are a recommendation system. Based on the user's history,
recommend items they would enjoy.
## User History (most recent first):
{history_text}
{f"## Current Context: {context}" if context else ""}
## Available Items:
{candidates_text}
## Task:
Select the {num_recommendations} best items for this user.
For each, explain why it matches their preferences.
Format:
1. [Item ID] - [Title] - [Reason]
2. ..."""
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
recommendations = self._parse_recommendations(response.content[0].text)
return recommendations
def _get_candidates(self, user_history: list[str], n: int) -> list[dict]:
"""
Pre-filter candidates using traditional retrieval.
LLM can't efficiently search millions of items.
"""
# Use embedding similarity, popularity, or collaborative filtering
# to get candidate set for LLM to rank
pass
P5: Pretrain, Prompt, and Predict
P5 frames multiple recommendation tasks as text generation:
class P5Recommender:
"""
P5-style unified recommendation via text generation.
All tasks formulated as sequence-to-sequence.
"""
# Task templates
TEMPLATES = {
"sequential": (
"User {user_id} has purchased {history}. "
"What will they purchase next?"
),
"rating": (
"How will user {user_id} rate {item}? "
"User's previous ratings: {history}"
),
"explanation": (
"User {user_id} purchased {item}. "
"Explain why based on their history: {history}"
),
"search": (
"User {user_id} searched for '{query}'. "
"Given their history {history}, recommend items."
),
}
def __init__(self, model_name: str = "google/flan-t5-xl"):
from transformers import T5ForConditionalGeneration, T5Tokenizer
self.tokenizer = T5Tokenizer.from_pretrained(model_name)
self.model = T5ForConditionalGeneration.from_pretrained(model_name)
def recommend_next(self, user_id: str, history: list[str]) -> str:
"""Sequential recommendation via text generation."""
prompt = self.TEMPLATES["sequential"].format(
user_id=user_id,
history=", ".join(history[-10:])
)
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model.generate(**inputs, max_length=50)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
def explain_recommendation(
self,
user_id: str,
item: str,
history: list[str]
) -> str:
"""Generate explanation for a recommendation."""
prompt = self.TEMPLATES["explanation"].format(
user_id=user_id,
item=item,
history=", ".join(history[-10:])
)
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model.generate(**inputs, max_length=200)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
CoLLM: Collaborative Embeddings in LLMs
CoLLM (TKDE 2025) integrates collaborative filtering embeddings directly into the LLM:
┌─────────────────────────────────────────────────────────────────────────┐
│ CoLLM ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL LLM RECOMMENDATION: │
│ ───────────────────────────────── │
│ │
│ "User liked: iPhone, MacBook, AirPods" → LLM → "Recommend: iPad" │
│ │
│ Problem: LLM only sees text, not collaborative signals │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CoLLM APPROACH: │
│ ──────────────── │
│ │
│ 1. Train collaborative filtering model (e.g., matrix factorization) │
│ → User embeddings U, Item embeddings V │
│ │
│ 2. Map CF embeddings to LLM token space │
│ CF embedding → Projection → "Soft tokens" in LLM vocabulary │
│ │
│ 3. Inject soft tokens into LLM prompt │
│ "[USER_EMB] liked: iPhone, MacBook. Recommend: [ITEM_EMB]?" │
│ │
│ Benefits: │
│ + LLM sees collaborative signals (who else liked these items) │
│ + Combines semantic understanding with behavioral patterns │
│ + Can be fine-tuned end-to-end │
│ │
└─────────────────────────────────────────────────────────────────────────┘
import torch
import torch.nn as nn
class CoLLM(nn.Module):
"""
Collaborative LLM: Inject CF embeddings into LLM.
"""
def __init__(
self,
llm_model, # Pre-trained LLM
cf_user_embeddings: torch.Tensor, # (num_users, cf_dim)
cf_item_embeddings: torch.Tensor, # (num_items, cf_dim)
llm_dim: int = 4096,
cf_dim: int = 64,
):
super().__init__()
self.llm = llm_model
# Store CF embeddings
self.user_cf = nn.Embedding.from_pretrained(cf_user_embeddings, freeze=False)
self.item_cf = nn.Embedding.from_pretrained(cf_item_embeddings, freeze=False)
# Project CF embeddings to LLM hidden dimension
self.user_proj = nn.Sequential(
nn.Linear(cf_dim, llm_dim),
nn.LayerNorm(llm_dim),
)
self.item_proj = nn.Sequential(
nn.Linear(cf_dim, llm_dim),
nn.LayerNorm(llm_dim),
)
def forward(
self,
input_ids: torch.Tensor,
user_ids: torch.Tensor,
item_ids: torch.Tensor = None,
):
"""
Forward pass with collaborative embedding injection.
"""
# Get LLM input embeddings
input_embeds = self.llm.get_input_embeddings()(input_ids)
# Get collaborative embeddings
user_cf_emb = self.user_proj(self.user_cf(user_ids)) # (B, llm_dim)
user_cf_emb = user_cf_emb.unsqueeze(1) # (B, 1, llm_dim)
# Prepend user collaborative embedding as soft token
input_embeds = torch.cat([user_cf_emb, input_embeds], dim=1)
# If item_ids provided (for scoring), append item embedding
if item_ids is not None:
item_cf_emb = self.item_proj(self.item_cf(item_ids))
item_cf_emb = item_cf_emb.unsqueeze(1)
input_embeds = torch.cat([input_embeds, item_cf_emb], dim=1)
# Forward through LLM
outputs = self.llm(inputs_embeds=input_embeds)
return outputs
Part IV: Conversational Recommendation
Chat-REC: Interactive LLM Recommendations
Chat-REC enables multi-turn conversational recommendations:
class ChatREC:
"""
Conversational Recommendation System using LLM.
Supports multi-turn dialogue for preference elicitation.
"""
def __init__(self, llm_client, retriever, item_catalog):
self.llm = llm_client
self.retriever = retriever # Traditional RecSys for candidates
self.items = item_catalog
def chat(
self,
user_message: str,
conversation_history: list[dict],
user_profile: dict,
) -> dict:
"""
Process user message and generate response with recommendations.
"""
# Classify intent
intent = self._classify_intent(user_message, conversation_history)
if intent == "ask_recommendation":
return self._handle_recommendation_request(
user_message, conversation_history, user_profile
)
elif intent == "provide_feedback":
return self._handle_feedback(
user_message, conversation_history, user_profile
)
elif intent == "ask_explanation":
return self._handle_explanation_request(
user_message, conversation_history
)
elif intent == "refine_preferences":
return self._handle_preference_refinement(
user_message, conversation_history, user_profile
)
else:
return self._handle_general_query(
user_message, conversation_history
)
def _classify_intent(
self,
message: str,
history: list[dict]
) -> str:
"""Classify user intent for routing."""
prompt = f"""Classify the user's intent in this conversation:
Conversation:
{self._format_history(history[-5:])}
User: {message}
Intent categories:
- ask_recommendation: User wants item suggestions
- provide_feedback: User gives opinion on suggested items
- ask_explanation: User wants to know why something was recommended
- refine_preferences: User clarifying or updating preferences
- general: Other queries
Reply with just the intent category."""
response = self.llm.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=20,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip().lower()
def _handle_recommendation_request(
self,
message: str,
history: list[dict],
profile: dict,
) -> dict:
"""Generate recommendations based on conversation."""
# Extract preferences from conversation
preferences = self._extract_preferences(message, history)
# Get candidates via traditional retrieval
candidates = self.retriever.retrieve(
user_profile=profile,
preferences=preferences,
n=50
)
# LLM selects and explains best matches
prompt = f"""Based on this conversation, recommend items:
Conversation:
{self._format_history(history[-5:])}
User's current request: {message}
Extracted preferences: {preferences}
Available items:
{self._format_items(candidates[:20])}
Select the 5 best items and explain why each matches the user's needs.
Be conversational and helpful."""
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=800,
messages=[{"role": "user", "content": prompt}]
)
recommendations = self._parse_recommendations(response.content[0].text)
return {
"response": response.content[0].text,
"recommendations": recommendations,
"intent": "ask_recommendation",
}
def _extract_preferences(
self,
message: str,
history: list[dict]
) -> dict:
"""Extract structured preferences from conversation."""
prompt = f"""Extract user preferences from this conversation:
Conversation:
{self._format_history(history)}
Current message: {message}
Extract:
- Category/type preferences
- Price range
- Specific features wanted
- Features to avoid
- Style/aesthetic preferences
- Use case/occasion
Format as JSON."""
response = self.llm.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
return parse_json(response.content[0].text)
Proactive Preference Elicitation
The best conversational systems don't just respond—they proactively ask questions to understand preferences:
class ProactiveRecommender:
"""
Proactively elicits preferences through strategic questions.
"""
def __init__(self, llm_client, item_catalog):
self.llm = llm_client
self.items = item_catalog
def generate_clarifying_question(
self,
user_query: str,
known_preferences: dict,
candidate_items: list[dict],
) -> str:
"""
Generate a clarifying question to narrow down recommendations.
"""
# Identify dimensions with high variance in candidates
differentiating_dims = self._find_differentiating_dimensions(
candidate_items, known_preferences
)
prompt = f"""The user asked: "{user_query}"
We know these preferences: {known_preferences}
We have {len(candidate_items)} potential matches, varying mainly in:
{differentiating_dims}
Generate ONE clarifying question that would most help narrow down
the recommendations. Make it natural and conversational.
Don't ask about preferences we already know.
Focus on the most impactful differentiator."""
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def should_ask_question(
self,
candidates: list[dict],
confidence_threshold: float = 0.7
) -> bool:
"""
Decide whether to ask a clarifying question or recommend.
"""
# If top candidates are very similar, we're confident
# If they're diverse, we should clarify
diversity = self._compute_diversity(candidates[:10])
return diversity > confidence_threshold
Part V: Agentic Recommendations
LLM Agents for Recommendations
The most sophisticated approach: LLMs as autonomous agents that use tools to gather information and make decisions.
from typing import Callable
class RecommendationAgent:
"""
LLM-powered recommendation agent with tool use.
"""
def __init__(self, llm_client, tools: dict[str, Callable]):
self.llm = llm_client
self.tools = tools
def recommend(
self,
user_request: str,
user_context: dict,
max_steps: int = 10,
) -> dict:
"""
Multi-step recommendation via agent reasoning.
"""
messages = [{
"role": "user",
"content": f"""You are a recommendation agent. Help the user find what they need.
User request: {user_request}
User context:
- Previous purchases: {user_context.get('purchase_history', [])}
- Browsing history: {user_context.get('browsing_history', [])}
- Preferences: {user_context.get('preferences', {})}
Available tools:
- search_catalog(query): Search items by text query
- get_item_details(item_id): Get detailed information about an item
- get_similar_items(item_id): Find items similar to a given item
- get_user_history(user_id): Get user's full interaction history
- get_trending_items(category): Get trending items in a category
- check_availability(item_id): Check stock and delivery options
Think step by step. Use tools to gather information, then make recommendations."""
}]
for step in range(max_steps):
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=messages,
tools=self._format_tools(),
)
# Check if agent wants to use a tool
if response.stop_reason == "tool_use":
tool_use = response.content[-1]
tool_name = tool_use.name
tool_input = tool_use.input
# Execute tool
tool_result = self.tools[tool_name](**tool_input)
# Add to conversation
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": str(tool_result)
}]
})
else:
# Agent is done, return final response
return {
"response": response.content[0].text,
"steps": step + 1,
"messages": messages,
}
return {"response": "Max steps reached", "steps": max_steps}
def _format_tools(self) -> list[dict]:
"""Format tools for Claude API."""
return [
{
"name": "search_catalog",
"description": "Search the product catalog by text query",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
},
{
"name": "get_item_details",
"description": "Get detailed information about a specific item",
"input_schema": {
"type": "object",
"properties": {
"item_id": {"type": "string", "description": "Item ID"}
},
"required": ["item_id"]
}
},
# ... more tools
]
Multi-Agent Recommendation Systems
RecAgent and Agent4Rec use multiple specialized agents:
┌─────────────────────────────────────────────────────────────────────────┐
│ MULTI-AGENT RECOMMENDATION ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ ORCHESTRATOR │ │
│ │ AGENT │ │
│ └────────┬────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ RETRIEVAL │ │ RANKING │ │ EXPLANATION │ │
│ │ AGENT │ │ AGENT │ │ AGENT │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │ │ │ │
│ - Search catalog - Score relevance - Generate reasons │
│ - Filter by rules - Apply preferences - Answer questions │
│ - Get candidates - Re-rank results - Justify choices │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMMUNICATION FLOW: │
│ │
│ 1. User: "I need running shoes for marathon training" │
│ │
│ 2. Orchestrator → Retrieval: "Search for marathon running shoes" │
│ Retrieval → Orchestrator: [100 candidate shoes] │
│ │
│ 3. Orchestrator → Ranking: "Rank for marathon training" │
│ Ranking → Orchestrator: [Top 10 ranked shoes] │
│ │
│ 4. Orchestrator → Explanation: "Explain top 3 picks" │
│ Explanation → Orchestrator: [Detailed explanations] │
│ │
│ 5. Orchestrator → User: Final recommendations with explanations │
│ │
└─────────────────────────────────────────────────────────────────────────┘
class MultiAgentRecommender:
"""
Multi-agent system for recommendations.
Specialized agents for different tasks.
"""
def __init__(self, llm_client, item_catalog, user_db):
self.llm = llm_client
self.items = item_catalog
self.users = user_db
# Specialized agents
self.agents = {
"retrieval": RetrievalAgent(llm_client, item_catalog),
"ranking": RankingAgent(llm_client),
"explanation": ExplanationAgent(llm_client),
"personalization": PersonalizationAgent(llm_client, user_db),
}
async def recommend(
self,
user_id: str,
query: str,
) -> dict:
"""
Coordinate agents to generate recommendations.
"""
# Step 1: Understand user context
user_profile = await self.agents["personalization"].get_profile(user_id)
# Step 2: Retrieve candidates
candidates = await self.agents["retrieval"].retrieve(
query=query,
user_preferences=user_profile["preferences"],
n=100
)
# Step 3: Rank candidates
ranked = await self.agents["ranking"].rank(
candidates=candidates,
user_profile=user_profile,
query=query,
)
# Step 4: Generate explanations
explained = await self.agents["explanation"].explain(
items=ranked[:10],
user_profile=user_profile,
query=query,
)
return {
"recommendations": explained,
"query_understanding": candidates["query_analysis"],
"personalization": user_profile["summary"],
}
class RetrievalAgent:
"""Agent specialized in candidate retrieval."""
def __init__(self, llm_client, item_catalog):
self.llm = llm_client
self.items = item_catalog
self.vector_store = self._build_vector_store(item_catalog)
async def retrieve(
self,
query: str,
user_preferences: dict,
n: int = 100
) -> dict:
"""
Retrieve candidates using multiple strategies.
"""
# LLM analyzes query
query_analysis = await self._analyze_query(query)
# Multiple retrieval strategies
semantic_results = self.vector_store.search(query, k=n)
category_results = self._category_filter(query_analysis["categories"])
attribute_results = self._attribute_filter(query_analysis["attributes"])
# LLM merges and deduplicates
merged = await self._merge_results(
semantic_results,
category_results,
attribute_results,
user_preferences,
)
return {
"candidates": merged[:n],
"query_analysis": query_analysis,
}
class ExplanationAgent:
"""Agent specialized in generating explanations."""
def __init__(self, llm_client):
self.llm = llm_client
async def explain(
self,
items: list[dict],
user_profile: dict,
query: str,
) -> list[dict]:
"""
Generate personalized explanations for recommendations.
"""
explained_items = []
for item in items:
explanation = await self._generate_explanation(
item, user_profile, query
)
explained_items.append({
**item,
"explanation": explanation["short"],
"detailed_explanation": explanation["detailed"],
"match_reasons": explanation["reasons"],
})
return explained_items
async def _generate_explanation(
self,
item: dict,
user_profile: dict,
query: str,
) -> dict:
"""Generate explanation for single item."""
prompt = f"""Explain why this item is recommended:
Item: {item['title']}
Category: {item['category']}
Features: {item['features']}
Price: ${item['price']}
User query: {query}
User preferences: {user_profile['preferences']}
User history themes: {user_profile['themes']}
Generate:
1. Short explanation (1 sentence)
2. Detailed explanation (2-3 sentences)
3. List of specific match reasons
Format as JSON."""
response = self.llm.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
return parse_json(response.content[0].text)
Part VI: User Simulation for Evaluation
Synthetic Users via LLMs
LLMs can simulate user behavior for testing and evaluation:
class LLMUserSimulator:
"""
Simulate user behavior for recommendation evaluation.
"""
def __init__(self, llm_client):
self.llm = llm_client
def create_persona(self, persona_description: str) -> dict:
"""Create a detailed user persona."""
prompt = f"""Create a detailed user persona for recommendation testing:
Description: {persona_description}
Generate:
1. Demographics (age, location, occupation)
2. Interests and hobbies
3. Shopping preferences (price sensitivity, brand loyalty)
4. Past purchase patterns
5. Decision-making style
6. Common objections/concerns
Format as JSON."""
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return parse_json(response.content[0].text)
def simulate_response(
self,
persona: dict,
recommendations: list[dict],
context: str = None,
) -> dict:
"""
Simulate how this persona would respond to recommendations.
"""
prompt = f"""You are simulating this user persona:
{json.dumps(persona, indent=2)}
They received these recommendations:
{self._format_recommendations(recommendations)}
{f"Context: {context}" if context else ""}
Simulate their response:
1. Which items would they click on? Why?
2. Which would they ignore? Why?
3. What would they say about the recommendations?
4. Would they convert (purchase)? Which item?
5. What's missing that they would want?
Be consistent with the persona's characteristics.
Format as JSON."""
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return parse_json(response.content[0].text)
def generate_interaction_trajectory(
self,
persona: dict,
item_catalog: list[dict],
num_interactions: int = 20,
) -> list[dict]:
"""
Generate a realistic interaction sequence for a persona.
Useful for creating synthetic training data.
"""
trajectory = []
browsing_context = []
for i in range(num_interactions):
prompt = f"""User persona:
{json.dumps(persona, indent=2)}
Previous interactions in this session:
{self._format_trajectory(trajectory[-5:])}
Available items (sample):
{self._format_items(random.sample(item_catalog, 20))}
What would this user do next?
- Browse a category?
- Search for something?
- Click on an item?
- Add to cart?
- Purchase?
- Leave?
Consider: time in session, previous actions, persona preferences.
Format: {{"action": "...", "item_id": "...", "reason": "..."}}"""
response = self.llm.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
action = parse_json(response.content[0].text)
trajectory.append(action)
if action["action"] == "leave":
break
return trajectory
CRAVE: Collaborative Verbalized Experience
CRAVE (Best Paper at GenAIRecP 2025) uses agent experiences to improve recommendations:
class CRAVESystem:
"""
CRAVE: Collaborative Verbalized Experience for Recommendations.
Agents learn from each other's experiences.
"""
def __init__(self, llm_client):
self.llm = llm_client
self.experience_bank = [] # Stored experiences
def collect_experience(
self,
user_query: str,
recommendations: list[dict],
user_feedback: dict,
agent_reasoning: str,
):
"""
Store verbalized experience from an interaction.
"""
# Verbalize the experience
experience = self._verbalize_experience(
user_query, recommendations, user_feedback, agent_reasoning
)
self.experience_bank.append(experience)
def _verbalize_experience(
self,
query: str,
recommendations: list[dict],
feedback: dict,
reasoning: str,
) -> dict:
"""Convert interaction to verbalized experience."""
prompt = f"""Summarize this recommendation interaction as a learning experience:
User query: {query}
Agent reasoning: {reasoning}
Recommendations made:
{self._format_recommendations(recommendations)}
User feedback:
- Clicked: {feedback.get('clicked', [])}
- Purchased: {feedback.get('purchased', [])}
- Dismissed: {feedback.get('dismissed', [])}
- Comments: {feedback.get('comments', '')}
Create a verbalized experience that captures:
1. What worked well
2. What could be improved
3. Key insight for similar future queries
Format as a concise lesson (2-3 sentences)."""
response = self.llm.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
return {
"query_type": self._classify_query(query),
"lesson": response.content[0].text,
"success_rate": len(feedback.get('purchased', [])) / len(recommendations),
"timestamp": datetime.now().isoformat(),
}
def retrieve_relevant_experiences(
self,
current_query: str,
n: int = 5
) -> list[dict]:
"""
Find experiences relevant to current query.
"""
# Embed and search (simplified)
query_type = self._classify_query(current_query)
relevant = [
exp for exp in self.experience_bank
if exp["query_type"] == query_type
]
# Sort by success rate and recency
relevant.sort(
key=lambda x: (x["success_rate"], x["timestamp"]),
reverse=True
)
return relevant[:n]
def recommend_with_experience(
self,
query: str,
candidates: list[dict],
user_profile: dict,
) -> list[dict]:
"""
Make recommendations informed by past experiences.
"""
experiences = self.retrieve_relevant_experiences(query)
prompt = f"""Make recommendations based on query and past learnings.
User query: {query}
User profile: {user_profile}
Lessons from similar past queries:
{self._format_experiences(experiences)}
Candidate items:
{self._format_items(candidates[:20])}
Apply the lessons learned to select and rank the best items.
Explain how past experiences informed your choices."""
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=800,
messages=[{"role": "user", "content": prompt}]
)
return self._parse_recommendations(response.content[0].text)
Part VII: Production Considerations
Latency and Cost Management
LLMs are expensive and slow compared to traditional RecSys:
┌─────────────────────────────────────────────────────────────────────────┐
│ LATENCY & COST COMPARISON │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL RECSYS: │
│ • Embedding lookup: ~1ms │
│ • ANN retrieval: ~5ms │
│ • Ranking model: ~10ms │
│ • Total: ~20ms │
│ • Cost: ~$0.0001 per request │
│ │
│ LLM-BASED RECSYS: │
│ • LLM API call: 500-2000ms │
│ • Multiple calls (agent): 2000-10000ms │
│ • Total: 1-10 seconds │
│ • Cost: $0.01-0.10 per request │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MITIGATION STRATEGIES: │
│ │
│ 1. HYBRID ARCHITECTURE │
│ Traditional model for fast retrieval + LLM for explanation │
│ LLM only for complex queries or high-value users │
│ │
│ 2. CACHING │
│ Cache LLM responses for similar queries │
│ Pre-compute explanations for popular items │
│ Semantic caching (similar queries → cached response) │
│ │
│ 3. SMALLER MODELS │
│ Use Haiku/small models for simple tasks │
│ Reserve large models for complex reasoning │
│ │
│ 4. ASYNC PROCESSING │
│ Show fast traditional recs immediately │
│ Enhance with LLM explanations async │
│ │
└─────────────────────────────────────────────────────────────────────────┘
class HybridRecommender:
"""
Hybrid system: fast traditional + smart LLM.
"""
def __init__(
self,
traditional_model,
llm_client,
cache,
llm_threshold: float = 0.7, # When to use LLM
):
self.traditional = traditional_model
self.llm = llm_client
self.cache = cache
self.llm_threshold = llm_threshold
async def recommend(
self,
user_id: str,
query: str = None,
context: dict = None,
) -> dict:
"""
Recommend with intelligent LLM usage.
"""
# Always start with fast traditional recommendations
traditional_recs = self.traditional.recommend(user_id, n=20)
# Decide if LLM is needed
needs_llm = self._should_use_llm(query, context)
if not needs_llm:
return {
"recommendations": traditional_recs,
"explanations": None,
"method": "traditional",
}
# Check cache first
cache_key = self._make_cache_key(user_id, query, traditional_recs)
cached = await self.cache.get(cache_key)
if cached:
return {**cached, "method": "cached_llm"}
# Use LLM to enhance/re-rank
enhanced = await self._llm_enhance(
traditional_recs, query, context
)
# Cache result
await self.cache.set(cache_key, enhanced, ttl=3600)
return {**enhanced, "method": "llm"}
def _should_use_llm(self, query: str, context: dict) -> bool:
"""Decide if LLM adds value for this request."""
# Use LLM for:
# - Natural language queries
# - Complex multi-criteria requests
# - Explanation requests
# - High-value user segments
if query and len(query.split()) > 3:
return True
if context and context.get("wants_explanation"):
return True
if context and context.get("user_tier") == "premium":
return True
return False
Evaluation Challenges
LLM recommendations are harder to evaluate:
class LLMRecEvaluator:
"""
Evaluation metrics for LLM-based recommendations.
"""
def evaluate_offline(
self,
model,
test_data: list[dict],
) -> dict:
"""Standard offline evaluation."""
metrics = {
"hr@10": [],
"ndcg@10": [],
"coverage": set(),
"diversity": [],
}
for sample in test_data:
recs = model.recommend(
user_id=sample["user_id"],
history=sample["history"],
)
rec_ids = [r["id"] for r in recs[:10]]
# Hit rate
hit = sample["target"] in rec_ids
metrics["hr@10"].append(int(hit))
# NDCG
if hit:
rank = rec_ids.index(sample["target"])
ndcg = 1 / np.log2(rank + 2)
else:
ndcg = 0
metrics["ndcg@10"].append(ndcg)
# Coverage
metrics["coverage"].update(rec_ids)
# Diversity (intra-list)
diversity = self._compute_diversity(recs[:10])
metrics["diversity"].append(diversity)
return {
"hr@10": np.mean(metrics["hr@10"]),
"ndcg@10": np.mean(metrics["ndcg@10"]),
"coverage": len(metrics["coverage"]) / self.num_items,
"diversity": np.mean(metrics["diversity"]),
}
def evaluate_explanations(
self,
explanations: list[str],
items: list[dict],
user_profiles: list[dict],
) -> dict:
"""
Evaluate explanation quality.
"""
# Use LLM to judge explanation quality
scores = []
for exp, item, profile in zip(explanations, items, user_profiles):
prompt = f"""Rate this recommendation explanation:
Item: {item['title']}
User profile: {profile['summary']}
Explanation: {exp}
Rate 1-5 on:
1. Relevance: Does it address why this item fits the user?
2. Specificity: Does it mention specific features/preferences?
3. Accuracy: Is the reasoning sound?
4. Helpfulness: Would this help the user decide?
Format: {{"relevance": X, "specificity": X, "accuracy": X, "helpfulness": X}}"""
response = self.judge_llm.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
scores.append(parse_json(response.content[0].text))
return {
"relevance": np.mean([s["relevance"] for s in scores]),
"specificity": np.mean([s["specificity"] for s in scores]),
"accuracy": np.mean([s["accuracy"] for s in scores]),
"helpfulness": np.mean([s["helpfulness"] for s in scores]),
}
Part VIII: Future Directions
Emerging Research Areas
┌─────────────────────────────────────────────────────────────────────────┐
│ FUTURE OF LLM + RECSYS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. MULTIMODAL RECOMMENDATIONS │
│ ────────────────────────────── │
│ • Image + text + behavior signals │
│ • "Find me something like this photo but cheaper" │
│ • Video understanding for content recommendations │
│ │
│ 2. REAL-TIME PERSONALIZATION │
│ ───────────────────────────── │
│ • LLMs that update beliefs within conversation │
│ • Streaming recommendations that adapt instantly │
│ • Edge-deployed small LLMs for latency │
│ │
│ 3. PRIVACY-PRESERVING LLM RECS │
│ ─────────────────────────────── │
│ • On-device processing of preferences │
│ • Federated learning for collaborative signals │
│ • Differential privacy for LLM fine-tuning │
│ │
│ 4. AUTONOMOUS SHOPPING AGENTS │
│ ─────────────────────────────── │
│ • Agents that browse, compare, and purchase │
│ • Multi-platform optimization │
│ • Negotiation and deal-finding │
│ │
│ 5. GENERATIVE ITEM CREATION │
│ ───────────────────────────── │
│ • "Generate a product that would appeal to users like X" │
│ • Personalized content generation │
│ • Dynamic bundle creation │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part IX: LLM RecSys in Production (2024-2025)
Industry Deployments
Major tech companies have moved beyond research to deploy LLM-powered recommendations at scale:
┌─────────────────────────────────────────────────────────────────────────┐
│ LLM RECSYS IN PRODUCTION (2024-2025) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ NETFLIX │
│ ───────── │
│ • UniCoRn: Unified contextual ranker for search + recommendations │
│ • FM-Intent: Predicts user intent AND next item simultaneously │
│ • Trace: Meta-optimization of rec pipelines with LLM agents │
│ • Conversational RS: Context-aware preference understanding │
│ │
│ SPOTIFY │
│ ───────── │
│ • Semantic IDs: Discretized embeddings added to LLaMA vocabulary │
│ • Domain-aware LLMs: Fine-tuned on catalog entities │
│ • Unified model: Combined search + recommendation retrieval │
│ • Use cases: Playlist sequencing, podcast recs, explanations │
│ │
│ AMAZON │
│ ───────── │
│ • Semantic IDs for product retrieval │
│ • 30% recall increase in beauty category │
│ • LLM-powered product descriptions and comparisons │
│ │
│ MICROSOFT │
│ ─────────── │
│ • RecAI: Open-source LLM4Rec research platform │
│ • InteRecAgent: LLMs + traditional RecSys integration │
│ • Copilot Shopping: Conversational commerce │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Netflix: From Static Models to LLM-Powered Personalization
Netflix has been at the forefront of LLM adoption for recommendations. Key insights from the Netflix PRS 2025 Workshop:
# Netflix's approach: Unified model consolidation
class NetflixUniCoRn:
"""
UniCoRn: Unified Contextual Ranker
Serves both search and recommendations with a single model.
"""
def __init__(self):
# Single transformer model for multiple tasks
self.unified_model = UnifiedRanker()
# Task-specific heads
self.search_head = nn.Linear(hidden_dim, 1)
self.rec_head = nn.Linear(hidden_dim, 1)
# Context encoder (handles diverse signals)
self.context_encoder = ContextEncoder()
def rank(
self,
user_context: dict,
candidates: list[dict],
task: str, # "search" or "recommend"
) -> list[float]:
"""
Unified ranking for search and recommendations.
Key insight: Same user signals, same item features,
just different task heads.
"""
# Encode context (same for both tasks)
context_emb = self.context_encoder(user_context)
# Encode candidates
candidate_embs = self.item_encoder(candidates)
# Cross-attention
hidden = self.unified_model(context_emb, candidate_embs)
# Task-specific scoring
if task == "search":
scores = self.search_head(hidden)
else:
scores = self.rec_head(hidden)
return scores
# FM-Intent: Predict intent and item together
class FMIntent:
"""
Netflix's intent-aware recommendation.
Predicts WHAT user wants to do and WHICH item simultaneously.
"""
def predict(self, user_state: dict) -> tuple[str, list[dict]]:
"""
Returns:
intent: "browse", "search", "continue_watching", etc.
items: Recommended items for that intent
"""
# Joint prediction of intent and items
# Not sequential (intent → items) but parallel
pass
Netflix key learnings:
- Model consolidation: Fewer specialized models, more unified architectures
- LLMs for meta-optimization: Trace uses LLM agents to optimize recommendation pipelines
- Periodic fine-tuning + RAG: Keeps models fresh without constant retraining
Spotify: Domain-Aware LLMs with Semantic IDs
Spotify's approach makes LLMs "domain-aware" by grounding them in catalog knowledge:
class SpotifyDomainLLM:
"""
Spotify's approach: Add catalog knowledge to LLM vocabulary.
"""
def __init__(self, base_llm: str = "llama-3-8b"):
self.llm = AutoModelForCausalLM.from_pretrained(base_llm)
self.tokenizer = AutoTokenizer.from_pretrained(base_llm)
# Semantic tokenization of catalog entities
self.semantic_tokenizer = SemanticTokenizer()
def add_catalog_to_vocabulary(self, catalog: list[dict]):
"""
Convert catalog entities to semantic IDs and add to vocabulary.
Process:
1. Encode entities (artists, tracks, podcasts) with embeddings
2. Discretize embeddings via LSH into "semantic tokens"
3. Add semantic tokens to LLM vocabulary
4. Fine-tune LLM on recommendation tasks
"""
for entity in catalog:
# Get embedding from content encoder
embedding = self.content_encoder(entity)
# Discretize to semantic ID (e.g., 4-8 tokens)
semantic_id = self.semantic_tokenizer.encode(embedding)
# Add to vocabulary with special prefix
token_str = f"<{entity['type']}:{semantic_id}>"
self.tokenizer.add_tokens([token_str])
# Resize model embeddings
self.llm.resize_token_embeddings(len(self.tokenizer))
def recommend_with_instructions(
self,
user_history: list[str], # Semantic IDs of past interactions
instruction: str, # e.g., "Create an upbeat workout playlist"
) -> list[str]:
"""
Generate recommendations that follow user instructions.
Unique capability: Steerable recommendations via natural language.
"""
prompt = f"""User's listening history:
{' '.join(user_history[-20:])}
Instruction: {instruction}
Generate a sequence of recommended tracks:"""
output = self.llm.generate(prompt, max_tokens=100)
return self._parse_semantic_ids(output)
Spotify use cases enabled:
- Playlist sequencing with coherent flow
- Cold-start video recommendations
- Personalized podcast discovery
- Natural language recommendation explanations
- Unified search + recommendation
Key Frameworks and Tools
InteRecAgent (Microsoft, TOIS 2025)
InteRecAgent bridges LLMs and traditional recommenders:
class InteRecAgent:
"""
InteRecAgent: LLM as brain, RecSys as tools.
Paper: https://dl.acm.org/doi/10.1145/3731446
"""
def __init__(self, llm_client, rec_tools: dict):
self.llm = llm_client
# Traditional RecSys models as tools
self.tools = {
"collaborative_filter": rec_tools["cf_model"],
"content_based": rec_tools["content_model"],
"popularity": rec_tools["popularity_model"],
"search": rec_tools["search_index"],
}
# Memory for conversation state
self.memory = ConversationMemory()
# Task planner
self.planner = TaskPlanner(llm_client)
async def interact(self, user_message: str, user_id: str) -> str:
"""
Interactive recommendation through conversation.
LLM decides which tools to use and how to combine results.
"""
# Plan tasks based on user message
tasks = await self.planner.plan(user_message, self.memory)
results = {}
for task in tasks:
if task.type == "get_recommendations":
results["recs"] = self.tools["collaborative_filter"].recommend(
user_id, n=task.params.get("n", 10)
)
elif task.type == "search":
results["search"] = self.tools["search"].search(
task.params["query"]
)
elif task.type == "explain":
results["explanation"] = await self._generate_explanation(
results.get("recs", [])
)
# Synthesize response
response = await self._synthesize_response(results, user_message)
# Update memory
self.memory.add(user_message, response)
return response
InteRecAgent benefits:
- Traditional RecSys handles behavioral patterns efficiently
- LLM handles natural language understanding and explanation
- Modular: Can upgrade either component independently
TALLRec (RecSys 2023)
TALLRec provides a tuning framework for aligning LLMs with recommendations:
# TALLRec: Two-stage tuning for recommendation LLMs
class TALLRecTrainer:
"""
TALLRec tuning framework.
Stage 1: Instruction tuning (general capability)
Stage 2: Recommendation tuning (domain-specific)
"""
def __init__(self, base_model: str = "llama-7b"):
self.model = AutoModelForCausalLM.from_pretrained(base_model)
self.tokenizer = AutoTokenizer.from_pretrained(base_model)
def stage1_instruction_tuning(self, instruction_data: list[dict]):
"""
Stage 1: General instruction following.
Uses Stanford Alpaca or similar data.
"""
# Standard instruction tuning
for example in instruction_data:
prompt = f"Instruction: {example['instruction']}\nResponse:"
target = example['response']
# Train with cross-entropy loss
pass
def stage2_recommendation_tuning(self, rec_data: list[dict]):
"""
Stage 2: Recommendation-specific tuning.
Teaches the model to recommend items.
"""
# Recommendation-specific prompts
for example in rec_data:
prompt = f"""User has interacted with: {example['history']}
Based on this history, recommend the next item."""
target = example['next_item']
# Train with cross-entropy loss
pass
def create_rec_prompt(self, history: list[str], task: str) -> str:
"""Create recommendation prompt in TALLRec format."""
templates = {
"sequential": "Given the user's history: {history}, predict the next item.",
"rating": "How would this user rate {item}? History: {history}",
"explanation": "Explain why {item} is recommended given: {history}",
}
return templates[task].format(history=", ".join(history))
MSRBench: Evaluating LVLMs for Recommendations
MSRBench (ACM Web Conference 2025) provides the first comprehensive benchmark for Large Vision-Language Models in multimodal sequential recommendation:
class MSRBenchEvaluator:
"""
MSRBench: Benchmark for LVLMs in recommendation.
Tests GPT-4V, GPT-4o, Claude-3-Opus on next-item prediction.
"""
# Integration strategies tested
STRATEGIES = [
"lvlm_as_recommender", # Direct recommendation
"lvlm_as_item_enhancer", # Generate item descriptions
"lvlm_as_reranker", # Rerank traditional candidates
"hybrid_enhance_rerank", # Combination
]
def evaluate(
self,
model: str, # "gpt-4-vision", "gpt-4o", "claude-3-opus"
strategy: str,
dataset: str = "amazon_review_plus",
) -> dict:
"""
Evaluate LVLM on next-item prediction with images.
"""
results = {}
for strategy in self.STRATEGIES:
if strategy == "lvlm_as_reranker":
# Best performing strategy
# Traditional model retrieves, LVLM reranks
candidates = self.traditional_model.retrieve(user, k=100)
reranked = self.lvlm_rerank(model, user, candidates)
results[strategy] = self.compute_metrics(reranked)
return results
def lvlm_rerank(
self,
model: str,
user_context: dict,
candidates: list[dict],
) -> list[dict]:
"""
Use LVLM to rerank candidates based on images + text.
"""
prompt = f"""Given this user's recent purchases:
{self._format_history_with_images(user_context['history'])}
Rank these candidate items by relevance:
{self._format_candidates_with_images(candidates)}
Return ranked item IDs."""
response = self.call_lvlm(model, prompt)
return self._parse_ranking(response)
MSRBench key findings:
- LVLMs as rerankers is the most effective strategy
- GPT-4o consistently outperforms GPT-4V and Claude-3-Opus
- Computational cost remains a barrier to real-time adoption
- Multimodal context significantly improves cold-start performance
RecSys 2025 Best Paper Insights
The RecSys 2025 Best Paper focused on conformal risk control for mitigating unwanted recommendations—a key concern as LLMs generate more creative outputs.
Key 2025 research themes:
- Fine-tuning + RAG combination: Keeps models fresh without constant retraining
- LLM agents for pipeline optimization: Meta-level improvements
- Multimodal integration: Images, video, audio in recommendations
- Scalability solutions: Efficient LLM serving for real-time recommendations
Part X: Prompt Engineering for Recommendations
Why Prompting Matters for RecSys
The quality of LLM-powered recommendations depends heavily on how you structure prompts. Unlike traditional ML where the model architecture determines capability, LLMs can perform radically different tasks based on prompt design. A well-crafted prompt can mean the difference between generic suggestions and personalized, actionable recommendations.
The fundamental insight: The same LLM with different prompts produces vastly different recommendation quality. Prompts determine what user context the LLM considers, how it reasons about preferences, and whether outputs are reliable enough for production use.
Five key dimensions of recommendation prompts:
-
Context Framing: How you present user history and preferences. Recency, relevance, and diversity of context all matter. Dumping entire history is counterproductive—selective context yields better results.
-
Task Specification: What exactly you want the LLM to do. "Recommend items" is vague. "Select 5 items under $100 that match their casual style preferences" is actionable.
-
Output Structure: Format for reliable parsing. Free-text responses are hard to use programmatically. JSON arrays of item IDs integrate cleanly with downstream systems.
-
Reasoning Guidance: Whether to encourage chain-of-thought. For complex recommendations, asking the LLM to first analyze preferences, then match candidates, improves quality and provides explainability.
-
Constraints: Guardrails on what can/cannot be recommended. In-stock items only, price limits, excluded categories, and valid item ID lists prevent hallucination.
Core Prompt Patterns
Pattern 1: Direct Recommendation
The simplest pattern: provide context, request recommendations, specify format. Best for fast recommendations when you have a good candidate set. Structure the prompt with clear sections: user history (most recent first), candidate items (with IDs, titles, categories, prices), the task (select exactly N items), and output format (JSON array of item IDs).
Pattern 2: Chain-of-Thought Recommendation
Encourage explicit reasoning for better recommendations and explainability. Structure the prompt to guide step-by-step analysis: first identify patterns in user history (categories, price range, brands, time patterns), then understand current intent (browsing vs buying, new interest vs continuing pattern), then match candidates (explain fit for each), and finally provide ranked recommendations with confidence scores.
This pattern is more expensive (more tokens) but produces higher-quality recommendations for complex queries and provides reasoning that can be shown to users or used for debugging.
Pattern 3: Persona-Based Prompting
Assign the LLM a specific expert persona for domain-specific recommendations. A fashion recommendation prompt might begin: "You are a personal stylist with 15 years of experience at luxury fashion houses. You understand body types, color theory, occasion dressing, and current trends."
Different domains benefit from different personas—a sommelier for wine, a tech reviewer for electronics, a literary curator for books. The persona shapes the recommendation style, vocabulary, and what factors the LLM emphasizes.
Pattern 4: Few-Shot Learning
Show examples of good recommendations to guide the model's output style. Include 2-3 examples showing: user history summary, user query, recommended item, and explanation. Then present the current task in the same format. This is particularly effective for maintaining consistent tone and reasoning depth across your recommendation system.
Optimization Techniques
Dynamic Context Selection: Not all user history is equally relevant. For a query about running shoes, recent athletic wear purchases matter more than a book bought last year. Select context based on recency, relevance to current query (via embedding similarity or keyword matching), and diversity (include variety of categories to capture full preference profile). Typically 10-20 carefully selected interactions outperform hundreds of undifferentiated history items.
Output Constraints and Validation: The most critical technique for production systems. Constrain the LLM to ONLY recommend from a provided list of valid item IDs. Specify constraints explicitly: maximum price, allowed categories, excluded brands, in-stock only. After receiving the response, always validate that returned IDs exist in your catalog—never trust LLM output without verification.
Temperature for Diversity: Lower temperature (0.3) produces focused, consistent recommendations—good for "more like this" scenarios. Higher temperature (0.9-1.0) produces more creative, unexpected suggestions—good for discovery. For most use cases, balanced temperature (0.6-0.7) provides a mix of safe bets and discoveries.
Multi-Sample Aggregation: For discovery-focused recommendations, generate multiple recommendation sets with high temperature and aggregate. Items appearing in multiple samples are more robust recommendations. Items appearing in only one sample are more exploratory.
Versioned Prompt Templates
Production systems need tested, versioned prompt templates for different scenarios:
- Quick suggestions: Fast, low-token prompts for homepage recommendations. Temperature 0.5, max 100 tokens.
- Detailed recommendation: Full context, chain-of-thought, explanations. Temperature 0.7, max 1000 tokens.
- Cold start: For new users with no history. Focus on stated interests and popular items. Temperature 0.6.
- Explanation only: Generate explanations for recommendations made by traditional models. Temperature 0.5, max 150 tokens.
Version your templates, track which versions are in production, and A/B test changes. Prompt engineering is iterative—small wording changes can significantly impact recommendation quality.
Common Mistakes to Avoid
┌─────────────────────────────────────────────────────────────────────────┐
│ COMMON PROMPTING MISTAKES IN RECSYS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. TOO MUCH CONTEXT │
│ ───────────────────── │
│ ✗ "Here's the user's entire 3-year history..." │
│ ✓ Select 10-20 most relevant recent interactions │
│ │
│ 2. VAGUE INSTRUCTIONS │
│ ─────────────────────── │
│ ✗ "Recommend some good items" │
│ ✓ "Recommend 5 items matching their style preferences, under $100" │
│ │
│ 3. NO OUTPUT FORMAT │
│ ───────────────────── │
│ ✗ "Give me your recommendations" │
│ ✓ "Return a JSON array of item IDs: [\"id1\", \"id2\", ...]" │
│ │
│ 4. ALLOWING HALLUCINATION │
│ ─────────────────────────── │
│ ✗ "Recommend items for this user" │
│ ✓ "Recommend ONLY from this list: [item_id_1, item_id_2, ...]" │
│ │
│ 5. IGNORING CONSTRAINTS │
│ ───────────────────────── │
│ ✗ Generic recommendations regardless of availability │
│ ✓ Specify: in_stock, max_price, excluded_categories │
│ │
│ 6. ONE-SIZE-FITS-ALL │
│ ────────────────────── │
│ ✗ Same prompt for all recommendation scenarios │
│ ✓ Different templates for: quick, detailed, cold_start, explanation │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Related Articles
Recommendation Systems: From Collaborative Filtering to Deep Learning
A comprehensive journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.
Transformers for Recommendation Systems: From SASRec to HSTU
A comprehensive deep dive into transformer-based recommendation systems. From the fundamentals of sequential recommendation to Meta's trillion-parameter HSTU, understand how attention mechanisms revolutionized personalization.
Building Agentic AI Systems: A Complete Implementation Guide
A comprehensive guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.
LLM-Powered Search for E-Commerce: Beyond NER and Elasticsearch
A deep dive into building intelligent e-commerce search systems that understand natural language, leverage metadata effectively, and support multi-turn conversations—moving beyond classical NER + Elasticsearch approaches.
Structured Outputs and Tool Use: Patterns for Reliable AI Applications
Master structured output generation and tool use patterns—JSON mode, schema enforcement, Instructor library, function calling best practices, error handling, and production patterns for reliable AI applications.
Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications
Comprehensive guide to embedding models for RAG, search, and AI applications. Comparison of text-embedding-3, BGE, E5, Cohere Embed v4, and Voyage with guidance on fine-tuning, dimensionality, multimodal embeddings, and production optimization.
Advanced Prompt Engineering: From Basic to Production-Grade
Master the techniques that separate amateur prompts from production systems—chain-of-thought, structured outputs, model-specific optimization, and prompt architecture.