Should I use LLMs for all my recommendations?

No—LLMs should augment traditional recommendation systems, not replace them entirely. The key is understanding where each approach excels. **Use LLMs for:** - Complex, nuanced queries that require semantic understanding ("I want a thriller like Gone Girl but with a male protagonist") - Cold start situations where you have little behavioral data but can reason about item content - Generating explanations that help users understand why items are recommended - Conversational interfaces where users refine preferences through dialogue - Cross-domain recommendations where content understanding bridges domains **Use traditional models for:** - High-volume, low-latency scenarios (homepages, real-time feeds) where you need sub-50ms responses - Users with rich behavioral history where collaborative signals are highly predictive - Simple "more like this" recommendations where semantic complexity isn't needed - Cost-sensitive contexts where LLM API costs would be prohibitive at scale The best production systems are hybrid: traditional models handle the bulk of traffic efficiently, while LLMs enhance specific high-value interactions like search, complex queries, and personalized explanations.

What's the best way to fine-tune LLMs for recommendations?

Start with prompt engineering—it's faster to iterate, requires no training infrastructure, and is often sufficient for many use cases. Only move to fine-tuning when you have clear evidence that prompting isn't working and you understand why. **When to fine-tune:** - Your domain has specialized language (medical, legal, highly technical products) that base LLMs don't understand well - You need consistent formatting that's hard to achieve with prompting - Latency requirements demand a smaller, faster model - You have proprietary data that significantly improves recommendation quality **Fine-tuning approaches:** - **LoRA (Low-Rank Adaptation):** Most efficient for recommendations. Trains small adapter layers while keeping base model frozen. Fast training, easy deployment. - **Full fine-tuning:** Better quality but expensive. Only worthwhile for large-scale production systems. - **Instruction tuning:** Train on (query, recommended items, explanation) triples to teach the model your recommendation style. **Training data:** - Use your domain's language: product descriptions, user reviews, category hierarchies - Create recommendation-specific tasks: ranking candidates, generating explanations, understanding user queries - Include negative examples to help calibration

How do I evaluate conversational recommendation systems?

Evaluation is harder for conversational systems because traditional recommendation metrics don't capture conversation quality. You need a multi-faceted approach: **Task completion metrics:** - Did the user find something they wanted? - How many turns did it take? - Did the user abandon the conversation? **Recommendation quality:** - Standard metrics (Recall, NDCG) on final recommendations - Compare against non-conversational baselines **Conversation quality:** - Coherence: Do responses make sense in context? - Helpfulness: Do responses advance the user toward a decision? - Naturalness: Does it feel like talking to a helpful assistant? **LLM-as-judge:** - Use a more powerful LLM to evaluate conversation quality - Works well for explanation quality, coherence, helpfulness - Faster and cheaper than human evaluation **Online metrics:** - Conversion rate for users engaging with conversational features - User satisfaction surveys - Return rate for the conversational interface

Can LLMs fully replace collaborative filtering?

No—LLMs and collaborative filtering capture fundamentally different signals. **What LLMs capture:** - Semantic similarity ("this book is similar in theme to that book") - Attribute reasoning ("user wants lightweight running shoes under $150") - Context understanding and common sense **What collaborative filtering captures:** - Behavioral patterns ("users who bought X also bought Y") - Implicit preferences that users can't articulate - Wisdom of the crowd (aggregated user behavior) The famous "beer and diapers" correlation illustrates collaborative filtering's strength: it captures relationships that have no semantic basis but reflect real user behavior. LLMs would never discover this pattern—there's nothing in language connecting these products. **The best approach:** Combine both. Use collaborative embeddings as features for LLM reasoning. Let traditional models handle retrieval, LLMs handle semantic refinement. This hybrid approach consistently outperforms either alone.

How do I manage the cost of LLM-powered recommendations?

LLM costs can quickly become prohibitive at recommendation scale. Cost management is essential: **Tiered models:** - Use cheap, fast models (GPT-4o-mini, Claude Haiku) for simple tasks - Reserve expensive models for complex reasoning - Route based on query complexity **Aggressive caching:** - Cache LLM responses for identical or similar queries - Typical cache hit rates of 40-70% are achievable - Semantic caching: Similar queries often have similar answers **Traditional models first:** - Use traditional models for candidate generation - LLMs only for enhancement: re-ranking, explanations - This limits LLM calls to high-value interactions **On-device small LLMs:** - For privacy-sensitive or latency-critical paths - Zero API cost and instant latency - Limited capability but useful for simple tasks

What about bias and fairness in LLM recommendations?

LLMs inherit biases from training data, and these can manifest in problematic recommendations: **Types of bias:** - **Popularity bias:** Over-recommending popular items from training data - **Demographic bias:** Different recommendations across demographic groups - **Cultural bias:** Western-centric training data may fail for other cultures - **Filter bubbles:** Reinforcing existing preferences without diversity **Monitoring:** - Disaggregate metrics by user demographics - Track recommendation diversity - Audit explanations for biased reasoning **Mitigation:** - **Diversity constraints:** Require recommendations from underrepresented categories - **Fairness-aware ranking:** Adjust scores for equitable exposure - **Bias mitigation during fine-tuning:** Include counter-examples - **Prompt engineering:** Instruct LLMs to consider diverse options

How do multi-agent recommendation systems compare to single-agent approaches?

Multi-agent systems decompose recommendation into specialized subtasks handled by different agents. **Advantages of multi-agent:** - **Specialization:** Each agent optimized for its task - **Modularity:** Easier to update individual components - **Debuggability:** Clear attribution of failures - **Different tools:** Agents can have different capabilities **Advantages of single-agent:** - **Simplicity:** One model, one prompt, easier to maintain - **Latency:** No inter-agent communication overhead - **Coherence:** Single context maintains conversation flow - **Cost:** One LLM call instead of multiple **Practical guidance:** Start with a single-agent approach. Add agents only when you have clear evidence that complexity is needed and you understand which subtasks benefit from specialization. Multi-agent adds operational complexity that must be justified by quality improvements.

How do I integrate LLMs with existing recommendation infrastructure?

Most organizations have existing systems that work. The challenge is adding LLMs without disruption. **Integration patterns:** 1. **Augmentation:** LLMs enhance existing outputs (add explanations, re-rank candidates). Lowest risk. 2. **Parallel systems:** Run LLM recommendations alongside existing. Use query routing. Gradually shift traffic. 3. **Feature extraction:** LLMs generate features for existing models (semantic features from descriptions). 4. **Full replacement:** Highest risk, rarely the right first step. **Recommendations:** - Start with augmentation—lowest risk, quick wins - Use A/B testing to measure actual impact - Maintain fallbacks during rollout - Monitor for regressions while adding capabilities

What are the privacy implications of LLM-powered recommendations?

LLM-powered recommendations raise privacy concerns: **Data sent to LLM providers:** - User queries contain sensitive information - Providers may log or train on this data - Check terms of service carefully **Mitigation:** - Self-hosted LLMs for sensitive applications - Anonymize user data before sending - Data minimization (only necessary context) - Choose providers with strong privacy commitments **Regulatory considerations:** - GDPR, CCPA apply to LLM-generated recommendations - Users have rights to explanation and deletion - Automated decision-making regulations may apply **Best practices:** - Be transparent about LLM usage - Provide opt-out options - Implement data retention policies - Consider privacy-by-design from the start

Back to Blog

RecSys Personalization LLMs Agentic AI ML Engineering Research

Generative AI for Recommendation Systems: LLMs Meet Personalization

Q: How do I prevent LLM hallucinations in recommendations?

LLM hallucinations in recommendations can be embarrassing (recommending non-existent products) or worse (recommending items with incorrect prices or features). Several strategies mitigate this: **Constrained generation:** Never let the LLM freely generate item names or IDs. Instead, provide a verified candidate set and ask the LLM to rank or select from it. The LLM reasons about preferences, but the items come from your catalog. **Structured outputs:** Use JSON mode or function calling to force the LLM to output structured data (item IDs, scores) rather than free text. This makes it impossible to hallucinate item names. **Retrieval-augmented generation (RAG):** Ground the LLM's knowledge in your actual catalog by retrieving relevant items first, then letting the LLM reason about them. The retrieved items are real by construction. **Validation layer:** Even with structured outputs, validate every recommendation against your catalog before showing to users. Check that item IDs exist, are in stock, are appropriate for the user's region, etc. The pattern that works best: retrieval (traditional models) → ranking (LLM) → validation (database checks). The LLM never generates items, only reasons about items that already exist.

Practical guide to LLM-powered recommendation systems. From feature augmentation to conversational agents, understand how generative AI is transforming personalization.

October 13, 20259 min read

The Convergence of LLMs and RecSys

At RecSys 2025 in Prague, one trend dominated: Large Language Models and recommendation systems are converging. This isn't hype—it's a fundamental shift in how we think about personalization.

Traditional recommendation systems excel at collaborative filtering: finding patterns in user-item interactions. But they struggle with:

Cold start: New users and items have no interaction history
Explainability: Why was this recommended?
Natural interaction: Users want to converse, not just click
Semantic understanding: "I want something like that movie but more uplifting"

LLMs address all of these. They understand language, reason about preferences, and generate explanations. The question is no longer "should we use LLMs for recommendations?" but "how?"

Code

┌─────────────────────────────────────────────────────────────────────────┐
│           TRADITIONAL RECSYS vs LLM-ENHANCED RECSYS                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL RECSYS:                                                     │
│  ────────────────────                                                    │
│                                                                          │
│  User → [Interaction History] → Collaborative Filtering → Items         │
│                                                                          │
│  Strengths:                                                              │
│  + Fast inference (embeddings + ANN)                                    │
│  + Captures behavioral patterns                                         │
│  + Well-understood, mature                                              │
│                                                                          │
│  Weaknesses:                                                             │
│  - Cold start for new users/items                                       │
│  - No natural language understanding                                    │
│  - Black box (limited explainability)                                   │
│  - Static (can't reason about context)                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LLM-ENHANCED RECSYS:                                                    │
│                                                                          │
│  User → [Natural Language + History] → LLM Reasoning → Items           │
│                                                                          │
│  Strengths:                                                              │
│  + Handles cold start via content understanding                         │
│  + Natural conversational interface                                     │
│  + Explainable ("I recommended this because...")                        │
│  + Can reason about complex preferences                                 │
│  + Zero/few-shot adaptation to new domains                              │
│                                                                          │
│  Weaknesses:                                                             │
│  - Higher latency and cost                                              │
│  - Less precise on behavioral patterns                                  │
│  - Hallucination risks                                                  │
│  - Harder to A/B test and control                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE WINNING APPROACH: HYBRID                                           │
│                                                                          │
│  LLMs augment traditional systems, not replace them                     │
│  • LLM for understanding, reasoning, explanation                        │
│  • Traditional models for fast retrieval, behavioral patterns           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

2025 State of the Art: A comprehensive survey analyzing 50+ studies identifies three fundamental paradigms: Recommender-oriented (LLMs enhance recommendation mechanisms), Interaction-oriented (conversational recommendations), and Simulation-oriented (multi-agent systems modeling user-item dynamics).

Part I: The LLM-RecSys Taxonomy

Three Paradigms for LLM Integration

Research has converged on three primary ways to integrate LLMs with recommendation systems:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LLM-RECSYS INTEGRATION PARADIGMS                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. RECOMMENDER-ORIENTED (Enhance the Model)                            │
│  ─────────────────────────────────────────────                           │
│                                                                          │
│  LLM augments or replaces traditional recommendation components         │
│                                                                          │
│  Approaches:                                                             │
│  • Knowledge Enhancement: LLM generates item descriptions, features     │
│  • Interaction Enhancement: LLM enriches user-item signals              │
│  • Model Enhancement: LLM as scorer, ranker, or full recommender        │
│                                                                          │
│  Example: LLMRec, CoLLM, P5                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. INTERACTION-ORIENTED (Conversational)                               │
│  ──────────────────────────────────────────                              │
│                                                                          │
│  LLM enables natural language interaction for recommendations           │
│                                                                          │
│  Approaches:                                                             │
│  • Conversational recommendation systems (CRS)                          │
│  • Explainable recommendations via dialogue                             │
│  • Preference elicitation through conversation                          │
│                                                                          │
│  Example: Chat-REC, RecLLM, InteRecAgent                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. SIMULATION-ORIENTED (Multi-Agent)                                   │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  LLM-powered agents simulate users, items, and system dynamics          │
│                                                                          │
│  Approaches:                                                             │
│  • User simulation for training/evaluation                              │
│  • Item agents for dynamic pricing/availability                         │
│  • Ecosystem simulation for policy testing                              │
│                                                                          │
│  Example: RecAgent, Agent4Rec, CRAVE                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Operational Distinctions

Within these paradigms, systems differ in how the LLM operates:

Model-Centric LLMRec: The LLM is fine-tuned or prompt-engineered to directly produce recommendations. Items are mapped to tokens, and the LLM generates item sequences.

Hybrid LLMRec: The LLM augments traditional models—generating features, enhancing embeddings, or providing semantic signals that feed into collaborative filtering.

Agentic LLMRec: The LLM acts as an autonomous agent, using tools (search, database queries, APIs) to gather information and make recommendations through multi-step reasoning.

Part II: Knowledge Enhancement

LLMs as Feature Generators

The simplest integration: use LLMs to generate rich features for items and users. This approach is low-risk and immediately valuable—you're not replacing your recommendation system, just making it smarter with better features.

Why LLM-generated features are powerful:

Traditional item features are either:

Structured metadata: Category, brand, price. Limited and requires manual curation.
Embeddings: Dense vectors from models trained on similar items. Good but opaque.

LLMs can generate semantic features that capture nuances humans understand but traditional systems miss:

Code

Traditional features for "Patagonia Fleece Jacket":
─────────────────────────────────────────────────────────────────────────
category: "outerwear"
brand: "patagonia"
price: 150
color: "blue"

LLM-generated features:
─────────────────────────────────────────────────────────────────────────
target_audience: "environmentally-conscious outdoor enthusiasts, 25-45"
use_cases: ["hiking", "casual everyday wear", "light camping"]
emotional_appeal: "rugged reliability, environmental responsibility"
style: "casual athletic, works with jeans or hiking pants"
similar_buyers_also_like: ["hiking boots", "wool base layers", "camping gear"]

The LLM-generated features enable recommendations that traditional systems can't make: "Users who care about sustainability might also like these eco-friendly products."

When to use LLM feature generation:

Cold-start items: New products with no user interaction data
Long-tail items: Products with sparse interaction history
Cross-category recommendations: Understanding that camping gear buyers might want sustainable products
Explanation generation: Why did we recommend this?

Cost considerations:

LLM calls are expensive. Don't call them per-request. Instead:

Batch processing: Generate features for all items offline
Caching: Store generated features in your feature store
Selective enrichment: Only use LLMs for items where traditional features are insufficient

Python

from anthropic import Anthropic

client = Anthropic()

def generate_item_features(item: dict) -> dict:
    """
    Use LLM to generate rich semantic features for items.
    These features can augment traditional embeddings.
    """

    prompt = f"""Analyze this product and extract structured features:

Product: {item['title']}
Category: {item['category']}
Description: {item['description']}
Price: ${item['price']}

Extract:
1. Target audience (demographics, interests)
2. Use cases (when/why someone would buy this)
3. Key attributes (quality level, style, features)
4. Emotional appeal (what feelings it evokes)
5. Similar products (what else might interest this buyer)

Format as JSON."""

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )

    return parse_json(response.content[0].text)

def generate_user_profile(user_history: list[dict]) -> dict:
    """
    Generate semantic user profile from interaction history.
    """

    history_text = "\n".join([
        f"- {item['title']} ({item['category']}) - {item['action']}"
        for item in user_history[-20:]  # Recent history
    ])

    prompt = f"""Based on this user's recent activity, create a preference profile:

Recent Activity:
{history_text}

Extract:
1. Primary interests (top 3 categories/themes)
2. Price sensitivity (budget, mid-range, premium)
3. Style preferences (if discernible)
4. Likely needs (what problems they're solving)
5. Recommendation strategy (what to show next)

Format as JSON."""

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )

    return parse_json(response.content[0].text)

LLMRec: Graph Augmentation with LLMs

LLMRec (WSDM 2024) uses LLMs to augment the user-item interaction graph. This is a clever approach: instead of replacing your graph-based recommender, use LLMs to add missing edges to the graph.

The sparsity problem in recommendation graphs:

User-item interaction graphs are extremely sparse. A typical user interacts with <0.01% of items. This sparsity hurts recommendations:

Users with few interactions get poor recommendations (cold start)
Items with few interactions are never recommended (popularity bias)
Implicit similarities aren't captured (if no user bought both A and B, no edge exists)

LLMRec's insight: LLMs can infer missing edges

LLMs understand semantic relationships that aren't in the interaction data:

Code

User History: [Python Book, Machine Learning Course, GPU]
─────────────────────────────────────────────────────────────────────────

What the graph knows:
  User → Python Book (purchased)
  User → ML Course (enrolled)
  User → GPU (purchased)

What LLM can infer (new edges to add):
  Python Book ↔ ML Course (both for learning ML)
  GPU ↔ ML Course (GPU needed for ML training)
  User → "Data Science Tools" (implicit interest cluster)

Three types of augmentation LLMRec performs:

User profile augmentation: Generate textual profile from interaction history, embed it as a new node connected to the user
Item relationship augmentation: Ask LLM to identify semantically related items, add edges between them
Interaction reasoning: For each user-item pair, generate explanation of why this interaction happened, use explanation embedding to enrich the edge

Why this works better than just using LLM embeddings:

Preserves graph structure: GNN-based recommenders rely on graph topology. Adding edges improves message passing.
Cheaper than inference-time LLM: Augmentation is done once offline. Inference uses fast GNN.
Combines strengths: LLM semantic understanding + GNN collaborative filtering

Implementation pattern:

Code

OFFLINE PIPELINE:
1. For each item: LLM generates "related items" → add item-item edges
2. For each user: LLM generates "interest summary" → add user profile node
3. Retrain GNN on augmented graph

ONLINE INFERENCE:
Same as before—fast GNN inference, no LLM calls

Python

class LLMRecAugmenter:
    """
    LLMRec-style graph augmentation.
    Uses LLM to generate synthetic interactions and enrich item features.
    """

    def __init__(self, llm_client, item_catalog: dict):
        self.llm = llm_client
        self.items = item_catalog

    def augment_item_graph(self, item_id: str) -> list[tuple[str, float]]:
        """
        Generate synthetic item-item relationships via LLM reasoning.
        """

        item = self.items[item_id]

        prompt = f"""Given this item:
Title: {item['title']}
Category: {item['category']}
Description: {item['description'][:200]}

List 5 items that would strongly appeal to the same customer.
For each, explain why and rate confidence (0-1).

Format:
1. [Item type/category] - [Reason] - [Confidence]"""

        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )

        # Parse and match to actual catalog items
        synthetic_edges = self._match_to_catalog(response.content[0].text)
        return synthetic_edges

    def generate_user_augmentation(
        self,
        user_history: list[str],
        num_synthetic: int = 5
    ) -> list[str]:
        """
        Generate synthetic interactions for sparse users.
        Helps with cold start.
        """

        history_items = [self.items[i] for i in user_history if i in self.items]

        prompt = f"""A user has interacted with these items:
{self._format_items(history_items)}

Based on these preferences, what other items would they likely enjoy?
List {num_synthetic} item types/categories with confidence scores.
Focus on items that reveal underlying preferences (not obvious similar items)."""

        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}]
        )

        synthetic_items = self._match_to_catalog(response.content[0].text)
        return synthetic_items

Part III: LLMs as Recommenders

Direct Recommendation via Prompting

The most direct approach: ask the LLM to recommend items. This sounds simple, but doing it well requires understanding LLM limitations and designing around them.

Why direct prompting is appealing:

Zero training: No ML infrastructure needed. Just prompt.
Rich reasoning: LLM can explain why each recommendation fits.
Context awareness: Can incorporate real-time context ("I'm shopping for a gift for my mom").
Language understanding: Handles natural language queries that keyword search can't.

Why direct prompting is dangerous:

The LLM doesn't actually know your catalog. It hallucinates:

Code

User: "Recommend running shoes under $100"
─────────────────────────────────────────────────────────────────────────

LLM response (without grounding):
  "I recommend the Nike Air Zoom Pegasus 38..."

Problems:
  ❌ That shoe might cost $130 in your store
  ❌ You might not carry Nike at all
  ❌ The "Pegasus 38" might be discontinued
  ❌ LLM might invent products that don't exist

The solution: Retrieval-Augmented Recommendation

Never let the LLM recommend from its imagination. Always:

Use traditional retrieval to get candidate items from YOUR catalog
Provide those candidates in the prompt
Ask LLM to rank/select from the provided candidates only

Code

Correct pattern:
─────────────────────────────────────────────────────────────────────────
1. Embedding search: "running shoes" → 100 candidates from your catalog
2. Filter: price < $100 → 40 candidates
3. Prompt LLM: "From these 40 shoes, which 5 best match [user history]?"
4. LLM returns IDs from your candidate list (can't hallucinate)

Why candidate pre-filtering is essential:

LLMs can't efficiently process millions of items. Their context window is limited (even Claude's 200K tokens can only hold ~50K product descriptions). Pre-filter to 50-200 candidates using fast traditional methods, then use LLM for intelligent ranking.

When to use direct LLM recommendation:

Conversational commerce: User is chatting, asking questions
Complex queries: "Something for a dinner party with vegetarians"
Explanation-heavy: When users want to know WHY this recommendation
Low-volume, high-value: B2B sales, luxury goods where personalization matters

When NOT to use:

High-volume feeds: Homepage recommendations (too slow, too expensive)
Latency-sensitive: Search results where 100ms matters
Simple queries: "Show me popular laptops" (traditional RecSys is faster/cheaper)

Python

class LLMRecommender:
    """
    LLM as direct recommender via in-context learning.
    """

    def __init__(self, llm_client, item_catalog: list[dict]):
        self.llm = llm_client
        self.items = item_catalog
        self.item_index = {item['id']: item for item in item_catalog}

    def recommend(
        self,
        user_history: list[str],
        context: str = None,
        num_recommendations: int = 10,
    ) -> list[dict]:
        """
        Generate recommendations via LLM reasoning.
        """

        # Format user history
        history_text = self._format_history(user_history)

        # Format candidate items (subset for efficiency)
        candidates = self._get_candidates(user_history, n=100)
        candidates_text = self._format_candidates(candidates)

        prompt = f"""You are a recommendation system. Based on the user's history,
recommend items they would enjoy.

## User History (most recent first):
{history_text}

{f"## Current Context: {context}" if context else ""}

## Available Items:
{candidates_text}

## Task:
Select the {num_recommendations} best items for this user.
For each, explain why it matches their preferences.

Format:
1. [Item ID] - [Title] - [Reason]
2. ..."""

        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1000,
            messages=[{"role": "user", "content": prompt}]
        )

        recommendations = self._parse_recommendations(response.content[0].text)
        return recommendations

    def _get_candidates(self, user_history: list[str], n: int) -> list[dict]:
        """
        Pre-filter candidates using traditional retrieval.
        LLM can't efficiently search millions of items.
        """
        # Use embedding similarity, popularity, or collaborative filtering
        # to get candidate set for LLM to rank
        pass

P5: Pretrain, Prompt, and Predict

P5 frames multiple recommendation tasks as text generation:

Python

class P5Recommender:
    """
    P5-style unified recommendation via text generation.
    All tasks formulated as sequence-to-sequence.
    """

    # Task templates
    TEMPLATES = {
        "sequential": (
            "User {user_id} has purchased {history}. "
            "What will they purchase next?"
        ),
        "rating": (
            "How will user {user_id} rate {item}? "
            "User's previous ratings: {history}"
        ),
        "explanation": (
            "User {user_id} purchased {item}. "
            "Explain why based on their history: {history}"
        ),
        "search": (
            "User {user_id} searched for '{query}'. "
            "Given their history {history}, recommend items."
        ),
    }

    def __init__(self, model_name: str = "google/flan-t5-xl"):
        from transformers import T5ForConditionalGeneration, T5Tokenizer

        self.tokenizer = T5Tokenizer.from_pretrained(model_name)
        self.model = T5ForConditionalGeneration.from_pretrained(model_name)

    def recommend_next(self, user_id: str, history: list[str]) -> str:
        """Sequential recommendation via text generation."""

        prompt = self.TEMPLATES["sequential"].format(
            user_id=user_id,
            history=", ".join(history[-10:])
        )

        inputs = self.tokenizer(prompt, return_tensors="pt")
        outputs = self.model.generate(**inputs, max_length=50)

        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

    def explain_recommendation(
        self,
        user_id: str,
        item: str,
        history: list[str]
    ) -> str:
        """Generate explanation for a recommendation."""

        prompt = self.TEMPLATES["explanation"].format(
            user_id=user_id,
            item=item,
            history=", ".join(history[-10:])
        )

        inputs = self.tokenizer(prompt, return_tensors="pt")
        outputs = self.model.generate(**inputs, max_length=200)

        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

CoLLM: Collaborative Embeddings in LLMs

CoLLM (TKDE 2025) integrates collaborative filtering embeddings directly into the LLM:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                         CoLLM ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL LLM RECOMMENDATION:                                        │
│  ─────────────────────────────────                                       │
│                                                                          │
│  "User liked: iPhone, MacBook, AirPods" → LLM → "Recommend: iPad"      │
│                                                                          │
│  Problem: LLM only sees text, not collaborative signals                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CoLLM APPROACH:                                                         │
│  ────────────────                                                        │
│                                                                          │
│  1. Train collaborative filtering model (e.g., matrix factorization)   │
│     → User embeddings U, Item embeddings V                              │
│                                                                          │
│  2. Map CF embeddings to LLM token space                                │
│     CF embedding → Projection → "Soft tokens" in LLM vocabulary        │
│                                                                          │
│  3. Inject soft tokens into LLM prompt                                  │
│     "[USER_EMB] liked: iPhone, MacBook. Recommend: [ITEM_EMB]?"        │
│                                                                          │
│  Benefits:                                                               │
│  + LLM sees collaborative signals (who else liked these items)         │
│  + Combines semantic understanding with behavioral patterns             │
│  + Can be fine-tuned end-to-end                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Python

import torch
import torch.nn as nn

class CoLLM(nn.Module):
    """
    Collaborative LLM: Inject CF embeddings into LLM.
    """

    def __init__(
        self,
        llm_model,  # Pre-trained LLM
        cf_user_embeddings: torch.Tensor,  # (num_users, cf_dim)
        cf_item_embeddings: torch.Tensor,  # (num_items, cf_dim)
        llm_dim: int = 4096,
        cf_dim: int = 64,
    ):
        super().__init__()
        self.llm = llm_model

        # Store CF embeddings
        self.user_cf = nn.Embedding.from_pretrained(cf_user_embeddings, freeze=False)
        self.item_cf = nn.Embedding.from_pretrained(cf_item_embeddings, freeze=False)

        # Project CF embeddings to LLM hidden dimension
        self.user_proj = nn.Sequential(
            nn.Linear(cf_dim, llm_dim),
            nn.LayerNorm(llm_dim),
        )
        self.item_proj = nn.Sequential(
            nn.Linear(cf_dim, llm_dim),
            nn.LayerNorm(llm_dim),
        )

    def forward(
        self,
        input_ids: torch.Tensor,
        user_ids: torch.Tensor,
        item_ids: torch.Tensor = None,
    ):
        """
        Forward pass with collaborative embedding injection.
        """

        # Get LLM input embeddings
        input_embeds = self.llm.get_input_embeddings()(input_ids)

        # Get collaborative embeddings
        user_cf_emb = self.user_proj(self.user_cf(user_ids))  # (B, llm_dim)
        user_cf_emb = user_cf_emb.unsqueeze(1)  # (B, 1, llm_dim)

        # Prepend user collaborative embedding as soft token
        input_embeds = torch.cat([user_cf_emb, input_embeds], dim=1)

        # If item_ids provided (for scoring), append item embedding
        if item_ids is not None:
            item_cf_emb = self.item_proj(self.item_cf(item_ids))
            item_cf_emb = item_cf_emb.unsqueeze(1)
            input_embeds = torch.cat([input_embeds, item_cf_emb], dim=1)

        # Forward through LLM
        outputs = self.llm(inputs_embeds=input_embeds)

        return outputs

Part IV: Conversational Recommendation

Chat-REC: Interactive LLM Recommendations

Chat-REC enables multi-turn conversational recommendations:

Python

class ChatREC:
    """
    Conversational Recommendation System using LLM.
    Supports multi-turn dialogue for preference elicitation.
    """

    def __init__(self, llm_client, retriever, item_catalog):
        self.llm = llm_client
        self.retriever = retriever  # Traditional RecSys for candidates
        self.items = item_catalog

    def chat(
        self,
        user_message: str,
        conversation_history: list[dict],
        user_profile: dict,
    ) -> dict:
        """
        Process user message and generate response with recommendations.
        """

        # Classify intent
        intent = self._classify_intent(user_message, conversation_history)

        if intent == "ask_recommendation":
            return self._handle_recommendation_request(
                user_message, conversation_history, user_profile
            )
        elif intent == "provide_feedback":
            return self._handle_feedback(
                user_message, conversation_history, user_profile
            )
        elif intent == "ask_explanation":
            return self._handle_explanation_request(
                user_message, conversation_history
            )
        elif intent == "refine_preferences":
            return self._handle_preference_refinement(
                user_message, conversation_history, user_profile
            )
        else:
            return self._handle_general_query(
                user_message, conversation_history
            )

    def _classify_intent(
        self,
        message: str,
        history: list[dict]
    ) -> str:
        """Classify user intent for routing."""

        prompt = f"""Classify the user's intent in this conversation:

Conversation:
{self._format_history(history[-5:])}

User: {message}

Intent categories:
- ask_recommendation: User wants item suggestions
- provide_feedback: User gives opinion on suggested items
- ask_explanation: User wants to know why something was recommended
- refine_preferences: User clarifying or updating preferences
- general: Other queries

Reply with just the intent category."""

        response = self.llm.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=20,
            messages=[{"role": "user", "content": prompt}]
        )

        return response.content[0].text.strip().lower()

    def _handle_recommendation_request(
        self,
        message: str,
        history: list[dict],
        profile: dict,
    ) -> dict:
        """Generate recommendations based on conversation."""

        # Extract preferences from conversation
        preferences = self._extract_preferences(message, history)

        # Get candidates via traditional retrieval
        candidates = self.retriever.retrieve(
            user_profile=profile,
            preferences=preferences,
            n=50
        )

        # LLM selects and explains best matches
        prompt = f"""Based on this conversation, recommend items:

Conversation:
{self._format_history(history[-5:])}

User's current request: {message}

Extracted preferences: {preferences}

Available items:
{self._format_items(candidates[:20])}

Select the 5 best items and explain why each matches the user's needs.
Be conversational and helpful."""

        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=800,
            messages=[{"role": "user", "content": prompt}]
        )

        recommendations = self._parse_recommendations(response.content[0].text)

        return {
            "response": response.content[0].text,
            "recommendations": recommendations,
            "intent": "ask_recommendation",
        }

    def _extract_preferences(
        self,
        message: str,
        history: list[dict]
    ) -> dict:
        """Extract structured preferences from conversation."""

        prompt = f"""Extract user preferences from this conversation:

Conversation:
{self._format_history(history)}

Current message: {message}

Extract:
- Category/type preferences
- Price range
- Specific features wanted
- Features to avoid
- Style/aesthetic preferences
- Use case/occasion

Format as JSON."""

        response = self.llm.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}]
        )

        return parse_json(response.content[0].text)

Proactive Preference Elicitation

The best conversational systems don't just respond—they proactively ask questions to understand preferences:

Python

class ProactiveRecommender:
    """
    Proactively elicits preferences through strategic questions.
    """

    def __init__(self, llm_client, item_catalog):
        self.llm = llm_client
        self.items = item_catalog

    def generate_clarifying_question(
        self,
        user_query: str,
        known_preferences: dict,
        candidate_items: list[dict],
    ) -> str:
        """
        Generate a clarifying question to narrow down recommendations.
        """

        # Identify dimensions with high variance in candidates
        differentiating_dims = self._find_differentiating_dimensions(
            candidate_items, known_preferences
        )

        prompt = f"""The user asked: "{user_query}"

We know these preferences: {known_preferences}

We have {len(candidate_items)} potential matches, varying mainly in:
{differentiating_dims}

Generate ONE clarifying question that would most help narrow down
the recommendations. Make it natural and conversational.

Don't ask about preferences we already know.
Focus on the most impactful differentiator."""

        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=100,
            messages=[{"role": "user", "content": prompt}]
        )

        return response.content[0].text

    def should_ask_question(
        self,
        candidates: list[dict],
        confidence_threshold: float = 0.7
    ) -> bool:
        """
        Decide whether to ask a clarifying question or recommend.
        """

        # If top candidates are very similar, we're confident
        # If they're diverse, we should clarify

        diversity = self._compute_diversity(candidates[:10])

        return diversity > confidence_threshold

Part V: Agentic Recommendations

LLM Agents for Recommendations

The most sophisticated approach: LLMs as autonomous agents that use tools to gather information and make decisions.

Python

from typing import Callable

class RecommendationAgent:
    """
    LLM-powered recommendation agent with tool use.
    """

    def __init__(self, llm_client, tools: dict[str, Callable]):
        self.llm = llm_client
        self.tools = tools

    def recommend(
        self,
        user_request: str,
        user_context: dict,
        max_steps: int = 10,
    ) -> dict:
        """
        Multi-step recommendation via agent reasoning.
        """

        messages = [{
            "role": "user",
            "content": f"""You are a recommendation agent. Help the user find what they need.

User request: {user_request}

User context:
- Previous purchases: {user_context.get('purchase_history', [])}
- Browsing history: {user_context.get('browsing_history', [])}
- Preferences: {user_context.get('preferences', {})}

Available tools:
- search_catalog(query): Search items by text query
- get_item_details(item_id): Get detailed information about an item
- get_similar_items(item_id): Find items similar to a given item
- get_user_history(user_id): Get user's full interaction history
- get_trending_items(category): Get trending items in a category
- check_availability(item_id): Check stock and delivery options

Think step by step. Use tools to gather information, then make recommendations."""
        }]

        for step in range(max_steps):
            response = self.llm.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1000,
                messages=messages,
                tools=self._format_tools(),
            )

            # Check if agent wants to use a tool
            if response.stop_reason == "tool_use":
                tool_use = response.content[-1]
                tool_name = tool_use.name
                tool_input = tool_use.input

                # Execute tool
                tool_result = self.tools[tool_name](**tool_input)

                # Add to conversation
                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": [{
                        "type": "tool_result",
                        "tool_use_id": tool_use.id,
                        "content": str(tool_result)
                    }]
                })
            else:
                # Agent is done, return final response
                return {
                    "response": response.content[0].text,
                    "steps": step + 1,
                    "messages": messages,
                }

        return {"response": "Max steps reached", "steps": max_steps}

    def _format_tools(self) -> list[dict]:
        """Format tools for Claude API."""
        return [
            {
                "name": "search_catalog",
                "description": "Search the product catalog by text query",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string", "description": "Search query"}
                    },
                    "required": ["query"]
                }
            },
            {
                "name": "get_item_details",
                "description": "Get detailed information about a specific item",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "item_id": {"type": "string", "description": "Item ID"}
                    },
                    "required": ["item_id"]
                }
            },
            # ... more tools
        ]

Multi-Agent Recommendation Systems

RecAgent and Agent4Rec use multiple specialized agents:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                MULTI-AGENT RECOMMENDATION ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                         ┌─────────────────┐                             │
│                         │  ORCHESTRATOR   │                             │
│                         │     AGENT       │                             │
│                         └────────┬────────┘                             │
│                                  │                                       │
│              ┌───────────────────┼───────────────────┐                  │
│              │                   │                   │                  │
│              ▼                   ▼                   ▼                  │
│     ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │
│     │   RETRIEVAL    │  │    RANKING     │  │  EXPLANATION   │        │
│     │     AGENT      │  │     AGENT      │  │     AGENT      │        │
│     └────────────────┘  └────────────────┘  └────────────────┘        │
│              │                   │                   │                  │
│     - Search catalog    - Score relevance   - Generate reasons        │
│     - Filter by rules   - Apply preferences - Answer questions        │
│     - Get candidates    - Re-rank results   - Justify choices         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMUNICATION FLOW:                                                     │
│                                                                          │
│  1. User: "I need running shoes for marathon training"                  │
│                                                                          │
│  2. Orchestrator → Retrieval: "Search for marathon running shoes"      │
│     Retrieval → Orchestrator: [100 candidate shoes]                    │
│                                                                          │
│  3. Orchestrator → Ranking: "Rank for marathon training"               │
│     Ranking → Orchestrator: [Top 10 ranked shoes]                      │
│                                                                          │
│  4. Orchestrator → Explanation: "Explain top 3 picks"                  │
│     Explanation → Orchestrator: [Detailed explanations]                │
│                                                                          │
│  5. Orchestrator → User: Final recommendations with explanations       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Python

class MultiAgentRecommender:
    """
    Multi-agent system for recommendations.
    Specialized agents for different tasks.
    """

    def __init__(self, llm_client, item_catalog, user_db):
        self.llm = llm_client
        self.items = item_catalog
        self.users = user_db

        # Specialized agents
        self.agents = {
            "retrieval": RetrievalAgent(llm_client, item_catalog),
            "ranking": RankingAgent(llm_client),
            "explanation": ExplanationAgent(llm_client),
            "personalization": PersonalizationAgent(llm_client, user_db),
        }

    async def recommend(
        self,
        user_id: str,
        query: str,
    ) -> dict:
        """
        Coordinate agents to generate recommendations.
        """

        # Step 1: Understand user context
        user_profile = await self.agents["personalization"].get_profile(user_id)

        # Step 2: Retrieve candidates
        candidates = await self.agents["retrieval"].retrieve(
            query=query,
            user_preferences=user_profile["preferences"],
            n=100
        )

        # Step 3: Rank candidates
        ranked = await self.agents["ranking"].rank(
            candidates=candidates,
            user_profile=user_profile,
            query=query,
        )

        # Step 4: Generate explanations
        explained = await self.agents["explanation"].explain(
            items=ranked[:10],
            user_profile=user_profile,
            query=query,
        )

        return {
            "recommendations": explained,
            "query_understanding": candidates["query_analysis"],
            "personalization": user_profile["summary"],
        }

class RetrievalAgent:
    """Agent specialized in candidate retrieval."""

    def __init__(self, llm_client, item_catalog):
        self.llm = llm_client
        self.items = item_catalog
        self.vector_store = self._build_vector_store(item_catalog)

    async def retrieve(
        self,
        query: str,
        user_preferences: dict,
        n: int = 100
    ) -> dict:
        """
        Retrieve candidates using multiple strategies.
        """

        # LLM analyzes query
        query_analysis = await self._analyze_query(query)

        # Multiple retrieval strategies
        semantic_results = self.vector_store.search(query, k=n)
        category_results = self._category_filter(query_analysis["categories"])
        attribute_results = self._attribute_filter(query_analysis["attributes"])

        # LLM merges and deduplicates
        merged = await self._merge_results(
            semantic_results,
            category_results,
            attribute_results,
            user_preferences,
        )

        return {
            "candidates": merged[:n],
            "query_analysis": query_analysis,
        }

class ExplanationAgent:
    """Agent specialized in generating explanations."""

    def __init__(self, llm_client):
        self.llm = llm_client

    async def explain(
        self,
        items: list[dict],
        user_profile: dict,
        query: str,
    ) -> list[dict]:
        """
        Generate personalized explanations for recommendations.
        """

        explained_items = []

        for item in items:
            explanation = await self._generate_explanation(
                item, user_profile, query
            )

            explained_items.append({
                **item,
                "explanation": explanation["short"],
                "detailed_explanation": explanation["detailed"],
                "match_reasons": explanation["reasons"],
            })

        return explained_items

    async def _generate_explanation(
        self,
        item: dict,
        user_profile: dict,
        query: str,
    ) -> dict:
        """Generate explanation for single item."""

        prompt = f"""Explain why this item is recommended:

Item: {item['title']}
Category: {item['category']}
Features: {item['features']}
Price: ${item['price']}

User query: {query}
User preferences: {user_profile['preferences']}
User history themes: {user_profile['themes']}

Generate:
1. Short explanation (1 sentence)
2. Detailed explanation (2-3 sentences)
3. List of specific match reasons

Format as JSON."""

        response = self.llm.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}]
        )

        return parse_json(response.content[0].text)

Part VI: User Simulation for Evaluation

Synthetic Users via LLMs

LLMs can simulate user behavior for testing and evaluation:

Python

class LLMUserSimulator:
    """
    Simulate user behavior for recommendation evaluation.
    """

    def __init__(self, llm_client):
        self.llm = llm_client

    def create_persona(self, persona_description: str) -> dict:
        """Create a detailed user persona."""

        prompt = f"""Create a detailed user persona for recommendation testing:

Description: {persona_description}

Generate:
1. Demographics (age, location, occupation)
2. Interests and hobbies
3. Shopping preferences (price sensitivity, brand loyalty)
4. Past purchase patterns
5. Decision-making style
6. Common objections/concerns

Format as JSON."""

        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )

        return parse_json(response.content[0].text)

    def simulate_response(
        self,
        persona: dict,
        recommendations: list[dict],
        context: str = None,
    ) -> dict:
        """
        Simulate how this persona would respond to recommendations.
        """

        prompt = f"""You are simulating this user persona:
{json.dumps(persona, indent=2)}

They received these recommendations:
{self._format_recommendations(recommendations)}

{f"Context: {context}" if context else ""}

Simulate their response:
1. Which items would they click on? Why?
2. Which would they ignore? Why?
3. What would they say about the recommendations?
4. Would they convert (purchase)? Which item?
5. What's missing that they would want?

Be consistent with the persona's characteristics.
Format as JSON."""

        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )

        return parse_json(response.content[0].text)

    def generate_interaction_trajectory(
        self,
        persona: dict,
        item_catalog: list[dict],
        num_interactions: int = 20,
    ) -> list[dict]:
        """
        Generate a realistic interaction sequence for a persona.
        Useful for creating synthetic training data.
        """

        trajectory = []
        browsing_context = []

        for i in range(num_interactions):
            prompt = f"""User persona:
{json.dumps(persona, indent=2)}

Previous interactions in this session:
{self._format_trajectory(trajectory[-5:])}

Available items (sample):
{self._format_items(random.sample(item_catalog, 20))}

What would this user do next?
- Browse a category?
- Search for something?
- Click on an item?
- Add to cart?
- Purchase?
- Leave?

Consider: time in session, previous actions, persona preferences.
Format: {{"action": "...", "item_id": "...", "reason": "..."}}"""

            response = self.llm.messages.create(
                model="claude-3-5-haiku-20241022",
                max_tokens=150,
                messages=[{"role": "user", "content": prompt}]
            )

            action = parse_json(response.content[0].text)
            trajectory.append(action)

            if action["action"] == "leave":
                break

        return trajectory

CRAVE: Collaborative Verbalized Experience

CRAVE (Best Paper at GenAIRecP 2025) uses agent experiences to improve recommendations:

Python

class CRAVESystem:
    """
    CRAVE: Collaborative Verbalized Experience for Recommendations.
    Agents learn from each other's experiences.
    """

    def __init__(self, llm_client):
        self.llm = llm_client
        self.experience_bank = []  # Stored experiences

    def collect_experience(
        self,
        user_query: str,
        recommendations: list[dict],
        user_feedback: dict,
        agent_reasoning: str,
    ):
        """
        Store verbalized experience from an interaction.
        """

        # Verbalize the experience
        experience = self._verbalize_experience(
            user_query, recommendations, user_feedback, agent_reasoning
        )

        self.experience_bank.append(experience)

    def _verbalize_experience(
        self,
        query: str,
        recommendations: list[dict],
        feedback: dict,
        reasoning: str,
    ) -> dict:
        """Convert interaction to verbalized experience."""

        prompt = f"""Summarize this recommendation interaction as a learning experience:

User query: {query}

Agent reasoning: {reasoning}

Recommendations made:
{self._format_recommendations(recommendations)}

User feedback:
- Clicked: {feedback.get('clicked', [])}
- Purchased: {feedback.get('purchased', [])}
- Dismissed: {feedback.get('dismissed', [])}
- Comments: {feedback.get('comments', '')}

Create a verbalized experience that captures:
1. What worked well
2. What could be improved
3. Key insight for similar future queries

Format as a concise lesson (2-3 sentences)."""

        response = self.llm.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=200,
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "query_type": self._classify_query(query),
            "lesson": response.content[0].text,
            "success_rate": len(feedback.get('purchased', [])) / len(recommendations),
            "timestamp": datetime.now().isoformat(),
        }

    def retrieve_relevant_experiences(
        self,
        current_query: str,
        n: int = 5
    ) -> list[dict]:
        """
        Find experiences relevant to current query.
        """

        # Embed and search (simplified)
        query_type = self._classify_query(current_query)

        relevant = [
            exp for exp in self.experience_bank
            if exp["query_type"] == query_type
        ]

        # Sort by success rate and recency
        relevant.sort(
            key=lambda x: (x["success_rate"], x["timestamp"]),
            reverse=True
        )

        return relevant[:n]

    def recommend_with_experience(
        self,
        query: str,
        candidates: list[dict],
        user_profile: dict,
    ) -> list[dict]:
        """
        Make recommendations informed by past experiences.
        """

        experiences = self.retrieve_relevant_experiences(query)

        prompt = f"""Make recommendations based on query and past learnings.

User query: {query}
User profile: {user_profile}

Lessons from similar past queries:
{self._format_experiences(experiences)}

Candidate items:
{self._format_items(candidates[:20])}

Apply the lessons learned to select and rank the best items.
Explain how past experiences informed your choices."""

        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=800,
            messages=[{"role": "user", "content": prompt}]
        )

        return self._parse_recommendations(response.content[0].text)

Part VII: Production Considerations

Latency and Cost Management

LLMs are expensive and slow compared to traditional RecSys:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LATENCY & COST COMPARISON                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL RECSYS:                                                     │
│  • Embedding lookup: ~1ms                                               │
│  • ANN retrieval: ~5ms                                                  │
│  • Ranking model: ~10ms                                                 │
│  • Total: ~20ms                                                         │
│  • Cost: ~$0.0001 per request                                          │
│                                                                          │
│  LLM-BASED RECSYS:                                                       │
│  • LLM API call: 500-2000ms                                             │
│  • Multiple calls (agent): 2000-10000ms                                 │
│  • Total: 1-10 seconds                                                  │
│  • Cost: $0.01-0.10 per request                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MITIGATION STRATEGIES:                                                  │
│                                                                          │
│  1. HYBRID ARCHITECTURE                                                  │
│     Traditional model for fast retrieval + LLM for explanation         │
│     LLM only for complex queries or high-value users                   │
│                                                                          │
│  2. CACHING                                                              │
│     Cache LLM responses for similar queries                             │
│     Pre-compute explanations for popular items                          │
│     Semantic caching (similar queries → cached response)               │
│                                                                          │
│  3. SMALLER MODELS                                                       │
│     Use Haiku/small models for simple tasks                            │
│     Reserve large models for complex reasoning                          │
│                                                                          │
│  4. ASYNC PROCESSING                                                     │
│     Show fast traditional recs immediately                              │
│     Enhance with LLM explanations async                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Python

class HybridRecommender:
    """
    Hybrid system: fast traditional + smart LLM.
    """

    def __init__(
        self,
        traditional_model,
        llm_client,
        cache,
        llm_threshold: float = 0.7,  # When to use LLM
    ):
        self.traditional = traditional_model
        self.llm = llm_client
        self.cache = cache
        self.llm_threshold = llm_threshold

    async def recommend(
        self,
        user_id: str,
        query: str = None,
        context: dict = None,
    ) -> dict:
        """
        Recommend with intelligent LLM usage.
        """

        # Always start with fast traditional recommendations
        traditional_recs = self.traditional.recommend(user_id, n=20)

        # Decide if LLM is needed
        needs_llm = self._should_use_llm(query, context)

        if not needs_llm:
            return {
                "recommendations": traditional_recs,
                "explanations": None,
                "method": "traditional",
            }

        # Check cache first
        cache_key = self._make_cache_key(user_id, query, traditional_recs)
        cached = await self.cache.get(cache_key)

        if cached:
            return {**cached, "method": "cached_llm"}

        # Use LLM to enhance/re-rank
        enhanced = await self._llm_enhance(
            traditional_recs, query, context
        )

        # Cache result
        await self.cache.set(cache_key, enhanced, ttl=3600)

        return {**enhanced, "method": "llm"}

    def _should_use_llm(self, query: str, context: dict) -> bool:
        """Decide if LLM adds value for this request."""

        # Use LLM for:
        # - Natural language queries
        # - Complex multi-criteria requests
        # - Explanation requests
        # - High-value user segments

        if query and len(query.split()) > 3:
            return True

        if context and context.get("wants_explanation"):
            return True

        if context and context.get("user_tier") == "premium":
            return True

        return False

Evaluation Challenges

LLM recommendations are harder to evaluate:

Python

class LLMRecEvaluator:
    """
    Evaluation metrics for LLM-based recommendations.
    """

    def evaluate_offline(
        self,
        model,
        test_data: list[dict],
    ) -> dict:
        """Standard offline evaluation."""

        metrics = {
            "hr@10": [],
            "ndcg@10": [],
            "coverage": set(),
            "diversity": [],
        }

        for sample in test_data:
            recs = model.recommend(
                user_id=sample["user_id"],
                history=sample["history"],
            )

            rec_ids = [r["id"] for r in recs[:10]]

            # Hit rate
            hit = sample["target"] in rec_ids
            metrics["hr@10"].append(int(hit))

            # NDCG
            if hit:
                rank = rec_ids.index(sample["target"])
                ndcg = 1 / np.log2(rank + 2)
            else:
                ndcg = 0
            metrics["ndcg@10"].append(ndcg)

            # Coverage
            metrics["coverage"].update(rec_ids)

            # Diversity (intra-list)
            diversity = self._compute_diversity(recs[:10])
            metrics["diversity"].append(diversity)

        return {
            "hr@10": np.mean(metrics["hr@10"]),
            "ndcg@10": np.mean(metrics["ndcg@10"]),
            "coverage": len(metrics["coverage"]) / self.num_items,
            "diversity": np.mean(metrics["diversity"]),
        }

    def evaluate_explanations(
        self,
        explanations: list[str],
        items: list[dict],
        user_profiles: list[dict],
    ) -> dict:
        """
        Evaluate explanation quality.
        """

        # Use LLM to judge explanation quality
        scores = []

        for exp, item, profile in zip(explanations, items, user_profiles):
            prompt = f"""Rate this recommendation explanation:

Item: {item['title']}
User profile: {profile['summary']}
Explanation: {exp}

Rate 1-5 on:
1. Relevance: Does it address why this item fits the user?
2. Specificity: Does it mention specific features/preferences?
3. Accuracy: Is the reasoning sound?
4. Helpfulness: Would this help the user decide?

Format: {{"relevance": X, "specificity": X, "accuracy": X, "helpfulness": X}}"""

            response = self.judge_llm.messages.create(
                model="claude-3-5-haiku-20241022",
                max_tokens=100,
                messages=[{"role": "user", "content": prompt}]
            )

            scores.append(parse_json(response.content[0].text))

        return {
            "relevance": np.mean([s["relevance"] for s in scores]),
            "specificity": np.mean([s["specificity"] for s in scores]),
            "accuracy": np.mean([s["accuracy"] for s in scores]),
            "helpfulness": np.mean([s["helpfulness"] for s in scores]),
        }

Part VIII: Future Directions

Emerging Research Areas

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FUTURE OF LLM + RECSYS                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. MULTIMODAL RECOMMENDATIONS                                          │
│  ──────────────────────────────                                          │
│  • Image + text + behavior signals                                      │
│  • "Find me something like this photo but cheaper"                      │
│  • Video understanding for content recommendations                      │
│                                                                          │
│  2. REAL-TIME PERSONALIZATION                                           │
│  ─────────────────────────────                                           │
│  • LLMs that update beliefs within conversation                         │
│  • Streaming recommendations that adapt instantly                       │
│  • Edge-deployed small LLMs for latency                                 │
│                                                                          │
│  3. PRIVACY-PRESERVING LLM RECS                                         │
│  ───────────────────────────────                                         │
│  • On-device processing of preferences                                  │
│  • Federated learning for collaborative signals                         │
│  • Differential privacy for LLM fine-tuning                            │
│                                                                          │
│  4. AUTONOMOUS SHOPPING AGENTS                                          │
│  ───────────────────────────────                                         │
│  • Agents that browse, compare, and purchase                           │
│  • Multi-platform optimization                                          │
│  • Negotiation and deal-finding                                         │
│                                                                          │
│  5. GENERATIVE ITEM CREATION                                            │
│  ─────────────────────────────                                           │
│  • "Generate a product that would appeal to users like X"               │
│  • Personalized content generation                                      │
│  • Dynamic bundle creation                                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IX: LLM RecSys in Production (2024-2025)

Industry Deployments

Major tech companies have moved beyond research to deploy LLM-powered recommendations at scale:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              LLM RECSYS IN PRODUCTION (2024-2025)                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  NETFLIX                                                                 │
│  ─────────                                                               │
│  • UniCoRn: Unified contextual ranker for search + recommendations      │
│  • FM-Intent: Predicts user intent AND next item simultaneously         │
│  • Trace: Meta-optimization of rec pipelines with LLM agents            │
│  • Conversational RS: Context-aware preference understanding            │
│                                                                          │
│  SPOTIFY                                                                 │
│  ─────────                                                               │
│  • Semantic IDs: Discretized embeddings added to LLaMA vocabulary       │
│  • Domain-aware LLMs: Fine-tuned on catalog entities                    │
│  • Unified model: Combined search + recommendation retrieval            │
│  • Use cases: Playlist sequencing, podcast recs, explanations           │
│                                                                          │
│  AMAZON                                                                  │
│  ─────────                                                               │
│  • Semantic IDs for product retrieval                                   │
│  • 30% recall increase in beauty category                               │
│  • LLM-powered product descriptions and comparisons                     │
│                                                                          │
│  MICROSOFT                                                               │
│  ───────────                                                             │
│  • RecAI: Open-source LLM4Rec research platform                        │
│  • InteRecAgent: LLMs + traditional RecSys integration                  │
│  • Copilot Shopping: Conversational commerce                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Netflix: From Static Models to LLM-Powered Personalization

Netflix has been at the forefront of LLM adoption for recommendations. Key insights from the Netflix PRS 2025 Workshop:

Python

# Netflix's approach: Unified model consolidation
class NetflixUniCoRn:
    """
    UniCoRn: Unified Contextual Ranker
    Serves both search and recommendations with a single model.
    """

    def __init__(self):
        # Single transformer model for multiple tasks
        self.unified_model = UnifiedRanker()

        # Task-specific heads
        self.search_head = nn.Linear(hidden_dim, 1)
        self.rec_head = nn.Linear(hidden_dim, 1)

        # Context encoder (handles diverse signals)
        self.context_encoder = ContextEncoder()

    def rank(
        self,
        user_context: dict,
        candidates: list[dict],
        task: str,  # "search" or "recommend"
    ) -> list[float]:
        """
        Unified ranking for search and recommendations.
        Key insight: Same user signals, same item features,
        just different task heads.
        """
        # Encode context (same for both tasks)
        context_emb = self.context_encoder(user_context)

        # Encode candidates
        candidate_embs = self.item_encoder(candidates)

        # Cross-attention
        hidden = self.unified_model(context_emb, candidate_embs)

        # Task-specific scoring
        if task == "search":
            scores = self.search_head(hidden)
        else:
            scores = self.rec_head(hidden)

        return scores

# FM-Intent: Predict intent and item together
class FMIntent:
    """
    Netflix's intent-aware recommendation.
    Predicts WHAT user wants to do and WHICH item simultaneously.
    """

    def predict(self, user_state: dict) -> tuple[str, list[dict]]:
        """
        Returns:
            intent: "browse", "search", "continue_watching", etc.
            items: Recommended items for that intent
        """
        # Joint prediction of intent and items
        # Not sequential (intent → items) but parallel
        pass

Netflix key learnings:

Model consolidation: Fewer specialized models, more unified architectures
LLMs for meta-optimization: Trace uses LLM agents to optimize recommendation pipelines
Periodic fine-tuning + RAG: Keeps models fresh without constant retraining

Spotify: Domain-Aware LLMs with Semantic IDs

Spotify's approach makes LLMs "domain-aware" by grounding them in catalog knowledge:

Python

class SpotifyDomainLLM:
    """
    Spotify's approach: Add catalog knowledge to LLM vocabulary.
    """

    def __init__(self, base_llm: str = "llama-3-8b"):
        self.llm = AutoModelForCausalLM.from_pretrained(base_llm)
        self.tokenizer = AutoTokenizer.from_pretrained(base_llm)

        # Semantic tokenization of catalog entities
        self.semantic_tokenizer = SemanticTokenizer()

    def add_catalog_to_vocabulary(self, catalog: list[dict]):
        """
        Convert catalog entities to semantic IDs and add to vocabulary.

        Process:
        1. Encode entities (artists, tracks, podcasts) with embeddings
        2. Discretize embeddings via LSH into "semantic tokens"
        3. Add semantic tokens to LLM vocabulary
        4. Fine-tune LLM on recommendation tasks
        """
        for entity in catalog:
            # Get embedding from content encoder
            embedding = self.content_encoder(entity)

            # Discretize to semantic ID (e.g., 4-8 tokens)
            semantic_id = self.semantic_tokenizer.encode(embedding)

            # Add to vocabulary with special prefix
            token_str = f"<{entity['type']}:{semantic_id}>"
            self.tokenizer.add_tokens([token_str])

        # Resize model embeddings
        self.llm.resize_token_embeddings(len(self.tokenizer))

    def recommend_with_instructions(
        self,
        user_history: list[str],  # Semantic IDs of past interactions
        instruction: str,  # e.g., "Create an upbeat workout playlist"
    ) -> list[str]:
        """
        Generate recommendations that follow user instructions.
        Unique capability: Steerable recommendations via natural language.
        """
        prompt = f"""User's listening history:
{' '.join(user_history[-20:])}

Instruction: {instruction}

Generate a sequence of recommended tracks:"""

        output = self.llm.generate(prompt, max_tokens=100)
        return self._parse_semantic_ids(output)

Spotify use cases enabled:

Playlist sequencing with coherent flow
Cold-start video recommendations
Personalized podcast discovery
Natural language recommendation explanations
Unified search + recommendation

Key Frameworks and Tools

InteRecAgent (Microsoft, TOIS 2025)

InteRecAgent bridges LLMs and traditional recommenders:

Python

class InteRecAgent:
    """
    InteRecAgent: LLM as brain, RecSys as tools.
    Paper: https://dl.acm.org/doi/10.1145/3731446
    """

    def __init__(self, llm_client, rec_tools: dict):
        self.llm = llm_client

        # Traditional RecSys models as tools
        self.tools = {
            "collaborative_filter": rec_tools["cf_model"],
            "content_based": rec_tools["content_model"],
            "popularity": rec_tools["popularity_model"],
            "search": rec_tools["search_index"],
        }

        # Memory for conversation state
        self.memory = ConversationMemory()

        # Task planner
        self.planner = TaskPlanner(llm_client)

    async def interact(self, user_message: str, user_id: str) -> str:
        """
        Interactive recommendation through conversation.
        LLM decides which tools to use and how to combine results.
        """
        # Plan tasks based on user message
        tasks = await self.planner.plan(user_message, self.memory)

        results = {}
        for task in tasks:
            if task.type == "get_recommendations":
                results["recs"] = self.tools["collaborative_filter"].recommend(
                    user_id, n=task.params.get("n", 10)
                )
            elif task.type == "search":
                results["search"] = self.tools["search"].search(
                    task.params["query"]
                )
            elif task.type == "explain":
                results["explanation"] = await self._generate_explanation(
                    results.get("recs", [])
                )

        # Synthesize response
        response = await self._synthesize_response(results, user_message)

        # Update memory
        self.memory.add(user_message, response)

        return response

InteRecAgent benefits:

Traditional RecSys handles behavioral patterns efficiently
LLM handles natural language understanding and explanation
Modular: Can upgrade either component independently

TALLRec (RecSys 2023)

TALLRec provides a tuning framework for aligning LLMs with recommendations:

Python

# TALLRec: Two-stage tuning for recommendation LLMs
class TALLRecTrainer:
    """
    TALLRec tuning framework.
    Stage 1: Instruction tuning (general capability)
    Stage 2: Recommendation tuning (domain-specific)
    """

    def __init__(self, base_model: str = "llama-7b"):
        self.model = AutoModelForCausalLM.from_pretrained(base_model)
        self.tokenizer = AutoTokenizer.from_pretrained(base_model)

    def stage1_instruction_tuning(self, instruction_data: list[dict]):
        """
        Stage 1: General instruction following.
        Uses Stanford Alpaca or similar data.
        """
        # Standard instruction tuning
        for example in instruction_data:
            prompt = f"Instruction: {example['instruction']}\nResponse:"
            target = example['response']
            # Train with cross-entropy loss
            pass

    def stage2_recommendation_tuning(self, rec_data: list[dict]):
        """
        Stage 2: Recommendation-specific tuning.
        Teaches the model to recommend items.
        """
        # Recommendation-specific prompts
        for example in rec_data:
            prompt = f"""User has interacted with: {example['history']}
Based on this history, recommend the next item."""
            target = example['next_item']
            # Train with cross-entropy loss
            pass

    def create_rec_prompt(self, history: list[str], task: str) -> str:
        """Create recommendation prompt in TALLRec format."""
        templates = {
            "sequential": "Given the user's history: {history}, predict the next item.",
            "rating": "How would this user rate {item}? History: {history}",
            "explanation": "Explain why {item} is recommended given: {history}",
        }
        return templates[task].format(history=", ".join(history))

MSRBench: Evaluating LVLMs for Recommendations

MSRBench (ACM Web Conference 2025) provides the first comprehensive benchmark for Large Vision-Language Models in multimodal sequential recommendation:

Python

class MSRBenchEvaluator:
    """
    MSRBench: Benchmark for LVLMs in recommendation.
    Tests GPT-4V, GPT-4o, Claude-3-Opus on next-item prediction.
    """

    # Integration strategies tested
    STRATEGIES = [
        "lvlm_as_recommender",  # Direct recommendation
        "lvlm_as_item_enhancer",  # Generate item descriptions
        "lvlm_as_reranker",  # Rerank traditional candidates
        "hybrid_enhance_rerank",  # Combination
    ]

    def evaluate(
        self,
        model: str,  # "gpt-4-vision", "gpt-4o", "claude-3-opus"
        strategy: str,
        dataset: str = "amazon_review_plus",
    ) -> dict:
        """
        Evaluate LVLM on next-item prediction with images.
        """
        results = {}

        for strategy in self.STRATEGIES:
            if strategy == "lvlm_as_reranker":
                # Best performing strategy
                # Traditional model retrieves, LVLM reranks
                candidates = self.traditional_model.retrieve(user, k=100)
                reranked = self.lvlm_rerank(model, user, candidates)
                results[strategy] = self.compute_metrics(reranked)

        return results

    def lvlm_rerank(
        self,
        model: str,
        user_context: dict,
        candidates: list[dict],
    ) -> list[dict]:
        """
        Use LVLM to rerank candidates based on images + text.
        """
        prompt = f"""Given this user's recent purchases:
{self._format_history_with_images(user_context['history'])}

Rank these candidate items by relevance:
{self._format_candidates_with_images(candidates)}

Return ranked item IDs."""

        response = self.call_lvlm(model, prompt)
        return self._parse_ranking(response)

MSRBench key findings:

LVLMs as rerankers is the most effective strategy
GPT-4o consistently outperforms GPT-4V and Claude-3-Opus
Computational cost remains a barrier to real-time adoption
Multimodal context significantly improves cold-start performance

RecSys 2025 Best Paper Insights

The RecSys 2025 Best Paper focused on conformal risk control for mitigating unwanted recommendations—a key concern as LLMs generate more creative outputs.

Key 2025 research themes:

Fine-tuning + RAG combination: Keeps models fresh without constant retraining
LLM agents for pipeline optimization: Meta-level improvements
Multimodal integration: Images, video, audio in recommendations
Scalability solutions: Efficient LLM serving for real-time recommendations

Part X: Prompt Engineering for Recommendations

Why Prompting Matters for RecSys

The quality of LLM-powered recommendations depends heavily on how you structure prompts. Unlike traditional ML where the model architecture determines capability, LLMs can perform radically different tasks based on prompt design. A well-crafted prompt can mean the difference between generic suggestions and personalized, actionable recommendations.

The fundamental insight: The same LLM with different prompts produces vastly different recommendation quality. Prompts determine what user context the LLM considers, how it reasons about preferences, and whether outputs are reliable enough for production use.

Five key dimensions of recommendation prompts:

Context Framing: How you present user history and preferences. Recency, relevance, and diversity of context all matter. Dumping entire history is counterproductive—selective context yields better results.
Task Specification: What exactly you want the LLM to do. "Recommend items" is vague. "Select 5 items under $100 that match their casual style preferences" is actionable.
Output Structure: Format for reliable parsing. Free-text responses are hard to use programmatically. JSON arrays of item IDs integrate cleanly with downstream systems.
Reasoning Guidance: Whether to encourage chain-of-thought. For complex recommendations, asking the LLM to first analyze preferences, then match candidates, improves quality and provides explainability.
Constraints: Guardrails on what can/cannot be recommended. In-stock items only, price limits, excluded categories, and valid item ID lists prevent hallucination.

Core Prompt Patterns

Pattern 1: Direct Recommendation

The simplest pattern: provide context, request recommendations, specify format. Best for fast recommendations when you have a good candidate set. Structure the prompt with clear sections: user history (most recent first), candidate items (with IDs, titles, categories, prices), the task (select exactly N items), and output format (JSON array of item IDs).

Pattern 2: Chain-of-Thought Recommendation

Encourage explicit reasoning for better recommendations and explainability. Structure the prompt to guide step-by-step analysis: first identify patterns in user history (categories, price range, brands, time patterns), then understand current intent (browsing vs buying, new interest vs continuing pattern), then match candidates (explain fit for each), and finally provide ranked recommendations with confidence scores.

This pattern is more expensive (more tokens) but produces higher-quality recommendations for complex queries and provides reasoning that can be shown to users or used for debugging.

Pattern 3: Persona-Based Prompting

Assign the LLM a specific expert persona for domain-specific recommendations. A fashion recommendation prompt might begin: "You are a personal stylist with 15 years of experience at luxury fashion houses. You understand body types, color theory, occasion dressing, and current trends."

Different domains benefit from different personas—a sommelier for wine, a tech reviewer for electronics, a literary curator for books. The persona shapes the recommendation style, vocabulary, and what factors the LLM emphasizes.

Pattern 4: Few-Shot Learning

Show examples of good recommendations to guide the model's output style. Include 2-3 examples showing: user history summary, user query, recommended item, and explanation. Then present the current task in the same format. This is particularly effective for maintaining consistent tone and reasoning depth across your recommendation system.

Optimization Techniques

Dynamic Context Selection: Not all user history is equally relevant. For a query about running shoes, recent athletic wear purchases matter more than a book bought last year. Select context based on recency, relevance to current query (via embedding similarity or keyword matching), and diversity (include variety of categories to capture full preference profile). Typically 10-20 carefully selected interactions outperform hundreds of undifferentiated history items.

Output Constraints and Validation: The most critical technique for production systems. Constrain the LLM to ONLY recommend from a provided list of valid item IDs. Specify constraints explicitly: maximum price, allowed categories, excluded brands, in-stock only. After receiving the response, always validate that returned IDs exist in your catalog—never trust LLM output without verification.

Temperature for Diversity: Lower temperature (0.3) produces focused, consistent recommendations—good for "more like this" scenarios. Higher temperature (0.9-1.0) produces more creative, unexpected suggestions—good for discovery. For most use cases, balanced temperature (0.6-0.7) provides a mix of safe bets and discoveries.

Multi-Sample Aggregation: For discovery-focused recommendations, generate multiple recommendation sets with high temperature and aggregate. Items appearing in multiple samples are more robust recommendations. Items appearing in only one sample are more exploratory.

Versioned Prompt Templates

Production systems need tested, versioned prompt templates for different scenarios:

Quick suggestions: Fast, low-token prompts for homepage recommendations. Temperature 0.5, max 100 tokens.
Detailed recommendation: Full context, chain-of-thought, explanations. Temperature 0.7, max 1000 tokens.
Cold start: For new users with no history. Focus on stated interests and popular items. Temperature 0.6.
Explanation only: Generate explanations for recommendations made by traditional models. Temperature 0.5, max 150 tokens.

Version your templates, track which versions are in production, and A/B test changes. Prompt engineering is iterative—small wording changes can significantly impact recommendation quality.

Common Mistakes to Avoid

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              COMMON PROMPTING MISTAKES IN RECSYS                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. TOO MUCH CONTEXT                                                    │
│  ─────────────────────                                                   │
│  ✗ "Here's the user's entire 3-year history..."                        │
│  ✓ Select 10-20 most relevant recent interactions                       │
│                                                                          │
│  2. VAGUE INSTRUCTIONS                                                  │
│  ───────────────────────                                                 │
│  ✗ "Recommend some good items"                                          │
│  ✓ "Recommend 5 items matching their style preferences, under $100"    │
│                                                                          │
│  3. NO OUTPUT FORMAT                                                    │
│  ─────────────────────                                                   │
│  ✗ "Give me your recommendations"                                       │
│  ✓ "Return a JSON array of item IDs: [\"id1\", \"id2\", ...]"          │
│                                                                          │
│  4. ALLOWING HALLUCINATION                                              │
│  ───────────────────────────                                             │
│  ✗ "Recommend items for this user"                                      │
│  ✓ "Recommend ONLY from this list: [item_id_1, item_id_2, ...]"        │
│                                                                          │
│  5. IGNORING CONSTRAINTS                                                │
│  ─────────────────────────                                               │
│  ✗ Generic recommendations regardless of availability                   │
│  ✓ Specify: in_stock, max_price, excluded_categories                   │
│                                                                          │
│  6. ONE-SIZE-FITS-ALL                                                   │
│  ──────────────────────                                                  │
│  ✗ Same prompt for all recommendation scenarios                         │
│  ✓ Different templates for: quick, detailed, cold_start, explanation   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Conclusion

LLMs are transforming recommendation systems—not by replacing traditional approaches, but by augmenting them with capabilities that were previously impossible: natural language understanding, explainability, and conversational interaction.

Key takeaways:

Hybrid is the answer: Combine traditional RecSys (fast, behavioral) with LLMs (smart, semantic) for the best results.
Multiple integration patterns: Choose based on your needs—feature augmentation for cold start, direct recommendation for explainability, agents for complex reasoning.
Conversation changes everything: Users can now express nuanced preferences in natural language, and systems can explain their reasoning.
Cost-conscious design: LLMs are expensive. Use them strategically where they add the most value.
Evaluation is hard: Traditional metrics don't capture explanation quality or conversation effectiveness. Invest in comprehensive evaluation.

The field is evolving rapidly. The papers and techniques from 2024-2025 will look different from what emerges in 2026. But the fundamental insight—that language understanding enhances personalization—will remain central to the next generation of recommendation systems.

Sources:

Frequently Asked Questions

No—LLMs should augment traditional recommendation systems, not replace them entirely. The key is understanding where each approach excels.

Use LLMs for:

Complex, nuanced queries that require semantic understanding ("I want a thriller like Gone Girl but with a male protagonist")
Cold start situations where you have little behavioral data but can reason about item content
Generating explanations that help users understand why items are recommended
Conversational interfaces where users refine preferences through dialogue
Cross-domain recommendations where content understanding bridges domains

Use traditional models for:

High-volume, low-latency scenarios (homepages, real-time feeds) where you need sub-50ms responses
Users with rich behavioral history where collaborative signals are highly predictive
Simple "more like this" recommendations where semantic complexity isn't needed
Cost-sensitive contexts where LLM API costs would be prohibitive at scale

The best production systems are hybrid: traditional models handle the bulk of traffic efficiently, while LLMs enhance specific high-value interactions like search, complex queries, and personalized explanations.

LLM hallucinations in recommendations can be embarrassing (recommending non-existent products) or worse (recommending items with incorrect prices or features). Several strategies mitigate this:

Constrained generation: Never let the LLM freely generate item names or IDs. Instead, provide a verified candidate set and ask the LLM to rank or select from it. The LLM reasons about preferences, but the items come from your catalog.

Structured outputs: Use JSON mode or function calling to force the LLM to output structured data (item IDs, scores) rather than free text. This makes it impossible to hallucinate item names.

Retrieval-augmented generation (RAG): Ground the LLM's knowledge in your actual catalog by retrieving relevant items first, then letting the LLM reason about them. The retrieved items are real by construction.

Validation layer: Even with structured outputs, validate every recommendation against your catalog before showing to users. Check that item IDs exist, are in stock, are appropriate for the user's region, etc.

The pattern that works best: retrieval (traditional models) → ranking (LLM) → validation (database checks). The LLM never generates items, only reasons about items that already exist.

Start with prompt engineering—it's faster to iterate, requires no training infrastructure, and is often sufficient for many use cases. Only move to fine-tuning when you have clear evidence that prompting isn't working and you understand why.

When to fine-tune:

Your domain has specialized language (medical, legal, highly technical products) that base LLMs don't understand well
You need consistent formatting that's hard to achieve with prompting
Latency requirements demand a smaller, faster model
You have proprietary data that significantly improves recommendation quality

Fine-tuning approaches:

LoRA (Low-Rank Adaptation): Most efficient for recommendations. Trains small adapter layers while keeping base model frozen. Fast training, easy deployment.
Full fine-tuning: Better quality but expensive. Only worthwhile for large-scale production systems.
Instruction tuning: Train on (query, recommended items, explanation) triples to teach the model your recommendation style.

Training data:

Use your domain's language: product descriptions, user reviews, category hierarchies
Create recommendation-specific tasks: ranking candidates, generating explanations, understanding user queries
Include negative examples to help calibration

Evaluation is harder for conversational systems because traditional recommendation metrics don't capture conversation quality. You need a multi-faceted approach:

Task completion metrics:

Did the user find something they wanted?
How many turns did it take?
Did the user abandon the conversation?

Recommendation quality:

Standard metrics (Recall, NDCG) on final recommendations
Compare against non-conversational baselines

Conversation quality:

Coherence: Do responses make sense in context?
Helpfulness: Do responses advance the user toward a decision?
Naturalness: Does it feel like talking to a helpful assistant?

LLM-as-judge:

Use a more powerful LLM to evaluate conversation quality
Works well for explanation quality, coherence, helpfulness
Faster and cheaper than human evaluation

Online metrics:

Conversion rate for users engaging with conversational features
User satisfaction surveys
Return rate for the conversational interface

No—LLMs and collaborative filtering capture fundamentally different signals.

What LLMs capture:

Semantic similarity ("this book is similar in theme to that book")
Attribute reasoning ("user wants lightweight running shoes under $150")
Context understanding and common sense

What collaborative filtering captures:

Behavioral patterns ("users who bought X also bought Y")
Implicit preferences that users can't articulate
Wisdom of the crowd (aggregated user behavior)

The famous "beer and diapers" correlation illustrates collaborative filtering's strength: it captures relationships that have no semantic basis but reflect real user behavior. LLMs would never discover this pattern—there's nothing in language connecting these products.

The best approach: Combine both. Use collaborative embeddings as features for LLM reasoning. Let traditional models handle retrieval, LLMs handle semantic refinement. This hybrid approach consistently outperforms either alone.

LLM costs can quickly become prohibitive at recommendation scale. Cost management is essential:

Tiered models:

Use cheap, fast models (GPT-4o-mini, Claude Haiku) for simple tasks
Reserve expensive models for complex reasoning
Route based on query complexity

Aggressive caching:

Cache LLM responses for identical or similar queries
Typical cache hit rates of 40-70% are achievable
Semantic caching: Similar queries often have similar answers

Traditional models first:

Use traditional models for candidate generation
LLMs only for enhancement: re-ranking, explanations
This limits LLM calls to high-value interactions

On-device small LLMs:

For privacy-sensitive or latency-critical paths
Zero API cost and instant latency
Limited capability but useful for simple tasks

LLMs inherit biases from training data, and these can manifest in problematic recommendations:

Types of bias:

Popularity bias: Over-recommending popular items from training data
Demographic bias: Different recommendations across demographic groups
Cultural bias: Western-centric training data may fail for other cultures
Filter bubbles: Reinforcing existing preferences without diversity

Monitoring:

Disaggregate metrics by user demographics
Track recommendation diversity
Audit explanations for biased reasoning

Mitigation:

Diversity constraints: Require recommendations from underrepresented categories
Fairness-aware ranking: Adjust scores for equitable exposure
Bias mitigation during fine-tuning: Include counter-examples
Prompt engineering: Instruct LLMs to consider diverse options

Multi-agent systems decompose recommendation into specialized subtasks handled by different agents.

Advantages of multi-agent:

Specialization: Each agent optimized for its task
Modularity: Easier to update individual components
Debuggability: Clear attribution of failures
Different tools: Agents can have different capabilities

Advantages of single-agent:

Simplicity: One model, one prompt, easier to maintain
Latency: No inter-agent communication overhead
Coherence: Single context maintains conversation flow
Cost: One LLM call instead of multiple

Practical guidance: Start with a single-agent approach. Add agents only when you have clear evidence that complexity is needed and you understand which subtasks benefit from specialization. Multi-agent adds operational complexity that must be justified by quality improvements.

Most organizations have existing systems that work. The challenge is adding LLMs without disruption.

Integration patterns:

Augmentation: LLMs enhance existing outputs (add explanations, re-rank candidates). Lowest risk.
Parallel systems: Run LLM recommendations alongside existing. Use query routing. Gradually shift traffic.
Feature extraction: LLMs generate features for existing models (semantic features from descriptions).
Full replacement: Highest risk, rarely the right first step.

Recommendations:

Start with augmentation—lowest risk, quick wins
Use A/B testing to measure actual impact
Maintain fallbacks during rollout
Monitor for regressions while adding capabilities

LLM-powered recommendations raise privacy concerns:

Data sent to LLM providers:

User queries contain sensitive information
Providers may log or train on this data
Check terms of service carefully

Mitigation:

Self-hosted LLMs for sensitive applications
Anonymize user data before sending
Data minimization (only necessary context)
Choose providers with strong privacy commitments

Regulatory considerations:

GDPR, CCPA apply to LLM-generated recommendations
Users have rights to explanation and deletion
Automated decision-making regulations may apply

Best practices:

Be transparent about LLM usage
Provide opt-out options
Implement data retention policies
Consider privacy-by-design from the start

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

RecSysPersonalization

Recommendation Systems: From Collaborative Filtering to Deep Learning

In-depth journey through recommendation system architectures. From the Netflix Prize and matrix factorization to neural collaborative filtering and two-tower models—understand the foundations before the transformer revolution.

30 min read

RecSysPersonalization

Transformers for Recommendation Systems: From SASRec to HSTU

In-depth tour of transformer-based recommendation systems. From the fundamentals of sequential recommendation to Meta's trillion-parameter HSTU, understand how attention mechanisms revolutionized personalization.

30 min read

EducationAgentic AI

Building Agentic AI Systems: A Complete Implementation Guide

Hands-on guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.

29 min read

EducationLLMs

LLM-Powered Search for E-Commerce: Beyond NER and Elasticsearch

A deep dive into building intelligent e-commerce search systems that understand natural language, leverage metadata effectively, and support multi-turn conversations—moving beyond classical NER + Elasticsearch approaches.

30 min read

LLMsAgentic AI

Structured Outputs and Tool Use: Patterns for Reliable AI Applications

Master structured output generation and tool use patterns—JSON mode, schema enforcement, Instructor library, function calling best practices, error handling, and production patterns for reliable AI applications.

8 min read

EmbeddingsRAG

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

Clear walkthrough of embedding models for RAG, search, and AI applications. Comparison of text-embedding-3, BGE, E5, Cohere Embed v4, and Voyage with guidance on fine-tuning, dimensionality, multimodal embeddings, and production optimization.

15 min read

PromptingLLMs

Advanced Prompt Engineering: From Basic to Production-Grade

Master the techniques that separate amateur prompts from production systems—chain-of-thought, structured outputs, model-specific optimization, and prompt architecture.

10 min read

Table of Contents

The Convergence of LLMs and RecSys

Part I: The LLM-RecSys Taxonomy

Three Paradigms for LLM Integration

Operational Distinctions

Part II: Knowledge Enhancement

LLMs as Feature Generators

LLMRec: Graph Augmentation with LLMs

Part III: LLMs as Recommenders

Direct Recommendation via Prompting

P5: Pretrain, Prompt, and Predict

CoLLM: Collaborative Embeddings in LLMs

Part IV: Conversational Recommendation

Chat-REC: Interactive LLM Recommendations

Proactive Preference Elicitation

Part V: Agentic Recommendations

LLM Agents for Recommendations

Multi-Agent Recommendation Systems

Part VI: User Simulation for Evaluation

Synthetic Users via LLMs

CRAVE: Collaborative Verbalized Experience

Part VII: Production Considerations

Latency and Cost Management

Evaluation Challenges

Part VIII: Future Directions

Emerging Research Areas

Part IX: LLM RecSys in Production (2024-2025)

Industry Deployments

Netflix: From Static Models to LLM-Powered Personalization

Spotify: Domain-Aware LLMs with Semantic IDs

Key Frameworks and Tools

InteRecAgent (Microsoft, TOIS 2025)

TALLRec (RecSys 2023)

MSRBench: Evaluating LVLMs for Recommendations

RecSys 2025 Best Paper Insights

Part X: Prompt Engineering for Recommendations

Why Prompting Matters for RecSys

Core Prompt Patterns

Optimization Techniques

Versioned Prompt Templates

Common Mistakes to Avoid

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Recommendation Systems: From Collaborative Filtering to Deep Learning

Transformers for Recommendation Systems: From SASRec to HSTU

Building Agentic AI Systems: A Complete Implementation Guide

LLM-Powered Search for E-Commerce: Beyond NER and Elasticsearch

Structured Outputs and Tool Use: Patterns for Reliable AI Applications

Embedding Models & Strategies: Choosing and Optimizing Embeddings for AI Applications

Advanced Prompt Engineering: From Basic to Production-Grade