How is LLM personalization different from traditional recommendation systems?

Traditional recommendation systems (collaborative filtering, content-based filtering) learn from user-item interactions to suggest items. LLM personalization is broader—it adapts the entire behavior of the AI: tone, detail level, topics emphasized, reasoning style, and more. Rather than recommending what to consume, personalized LLMs change how they communicate and assist based on who you are.

Does ChatGPT actually remember everything about me?

As of April 2025, ChatGPT Plus/Pro users have comprehensive memory that references all past conversations. However, it's not unlimited—the system pre-computes summaries and selectively includes relevant memories in context. You can view, edit, and delete memories through settings. Free users get a "lightweight" version with only recent conversation memory.

How can I build personalization without storing user data on servers?

On-device personalization is increasingly viable. Use small language models (1-3B parameters) that run locally, store user profiles on-device, and use federated learning to improve models without centralizing data. The XPerT framework reduces on-device fine-tuning compute by 83%, making personal adapters practical on mobile devices.

Back to Blog

Personalization LLMs Agentic AI ML Engineering Research Production

LLM Personalization: Building AI That Adapts to Individual Users

Q: What's the cold-start problem in LLM personalization and how do I solve it?

Cold start refers to personalizing for new users with no interaction history. Solutions include: (1) Language-based preference elicitation—ask users about preferences in natural language, which LLMs can interpret zero-shot; (2) Meta-learning approaches that adapt quickly from few examples; (3) Transfer from similar users via clustering; (4) PRIME's training-free personalized thinking that works even without history.

Q: Are personalized LLMs biased?

Yes, and it's a significant concern. Standard RLHF optimizes for majority preferences, disadvantaging minority viewpoints. Personalization can create filter bubbles where users only see content matching their existing beliefs. Mitigations include: transparency about when responses are personalized, deliberate diversity injection, demographic bias auditing, and giving users control over personalization level.

Q: How do I evaluate if my personalization is working?

Academic benchmarks like PersonaMem, PersonalLLM, and PersonaLens provide standardized evaluation. For production, track: preference accuracy (does output match known preferences?), consistency (same preferences yield same style over time?), adaptation speed (turns to learn new preference), and user satisfaction (direct feedback). Current frontier models achieve only ~50% on academic benchmarks—there's significant room for improvement.

Q: What's the difference between Claude's memory and ChatGPT's memory?

ChatGPT uses a four-layer architecture with pre-computed summaries injected into context. It's sophisticated but opaque—you can see memories but not how they're selected. Claude uses transparent Markdown files (CLAUDE.md) that users can directly read and edit. Claude's approach is more transparent but can suffer from "fading memory" as files grow large. Both support user control over what's remembered.

Q: Should I use text profiles or user embeddings for personalization?

It depends on your use case. **Text profiles** are interpretable, editable, and can be reasoned about by LLMs—ideal when users need to understand and control what's remembered. **User embeddings** (like USER-LLM or Albatross) capture fine-grained behavioral patterns more efficiently and support mathematical operations like user comparison. Embeddings achieve up to 78X speedup over text prompts and improve with longer histories (while text prompts degrade). For production systems handling many users with rich interaction data, embeddings are increasingly the better choice. For transparency-first applications, text profiles remain valuable.

Q: What is real-time sequential embedding and why does it matter?

Traditional personalization uses static embeddings updated in batches (daily/weekly). Real-time sequential embeddings (pioneered by Albatross and based on transformer architectures like BERT4Rec/SASRec) update continuously as users interact—capturing intent shifts within a single session. If you browse running shoes, then yoga mats, then protein powder, a real-time system understands your current fitness intent rather than recommending based on yesterday's browsing. This matters for e-commerce, content discovery, and any domain where in-session intent differs from historical patterns.

Clear walkthrough of personalizing Large Language Models. From memory architectures to preference learning, understand how to build AI systems that truly adapt to individual users—and the challenges that remain.

January 8, 202612 min read

The Personalization Imperative

LLMs are fundamentally stateless. Each conversation starts fresh—no memory of past interactions, no understanding of who you are. This "conversational amnesia" is the single biggest barrier to truly useful AI assistants.

Consider what happens today: you explain your coding style to ChatGPT. Next session, you explain it again. You tell Claude about your project architecture. Tomorrow, you start over. This isn't just inconvenient—it's a fundamental limitation that prevents AI from becoming genuinely helpful over time.

The shift from generic AI to personalized AI represents the next major evolution in how we interact with language models. By 2026, personalization has moved from research curiosity to production necessity.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    GENERIC AI vs PERSONALIZED AI                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  GENERIC AI (Stateless):                                                │
│  ────────────────────────                                                │
│                                                                          │
│  User → [Query] → LLM → [Generic Response]                              │
│                                                                          │
│  • Same response for everyone                                           │
│  • No memory across sessions                                            │
│  • Requires re-explaining context                                       │
│  • Cannot learn from interactions                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PERSONALIZED AI (Adaptive):                                            │
│  ───────────────────────────                                             │
│                                                                          │
│  User → [Query + Memory + Profile] → LLM → [Tailored Response]          │
│                                                                          │
│  • Adapts to individual preferences                                     │
│  • Maintains context across sessions                                    │
│  • Learns and improves over time                                        │
│  • Anticipates needs proactively                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE BUSINESS CASE:                                                     │
│                                                                          │
│  • 40-70% higher user retention with remembered preferences             │
│  • Fewer clarification questions needed                                 │
│  • 3.7x ROI on GenAI investments (McKinsey 2025)                       │
│  • 80% of enterprises increasing AI investment through 2026             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

2025-2026 State of the Art: Two comprehensive surveys—one from arXiv analyzing progress in personalized LLMs and another providing a formalization of personalization foundations—have established the theoretical framework. Meanwhile, every major AI provider has shipped memory features: ChatGPT, Claude, and Gemini all now remember users across sessions.

Part I: The Taxonomy of LLM Personalization

Three Technical Approaches

Research has converged on three primary methods for personalizing LLMs, each with distinct tradeoffs:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│               LLM PERSONALIZATION APPROACHES                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. PROMPTING FOR PERSONALIZATION                                       │
│  ──────────────────────────────────                                      │
│                                                                          │
│  Inject user context into system prompts                                │
│                                                                          │
│  Methods:                                                                │
│  • User profile injection ("You're helping a senior Python developer") │
│  • Conversation history summarization                                   │
│  • Retrieved preferences from memory store                              │
│  • Dynamic context assembly                                             │
│                                                                          │
│  Pros: No training required, flexible, immediate                        │
│  Cons: Context window limits, prompt engineering complexity             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. FINE-TUNING FOR PERSONALIZATION                                     │
│  ────────────────────────────────────                                    │
│                                                                          │
│  Adapt model weights to individual users                                │
│                                                                          │
│  Methods:                                                                │
│  • LoRA adapters per user or user cluster                              │
│  • Federated fine-tuning across devices                                │
│  • Continual learning from interactions                                │
│  • Personalized reward models                                          │
│                                                                          │
│  Pros: Deep behavioral adaptation, persistent                           │
│  Cons: Expensive, catastrophic forgetting, storage at scale            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. ALIGNMENT FOR PERSONALIZATION                                       │
│  ──────────────────────────────────                                      │
│                                                                          │
│  Learn user preferences through feedback                                │
│                                                                          │
│  Methods:                                                                │
│  • Personalized RLHF (P-RLHF)                                          │
│  • Direct Preference Optimization with user signal                     │
│  • Implicit preference extraction                                       │
│  • User-specific reward models                                         │
│                                                                          │
│  Pros: Captures nuanced preferences, principled approach               │
│  Cons: Requires feedback data, alignment tax                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Personalization Granularity Spectrum

Not all personalization is equal. Systems operate at different levels:

Granularity	Description	Example
Population	Same model for everyone	Base GPT-4
Segment	Models per user group	Enterprise vs Consumer Claude
Cohort	Models per user cluster	Power users vs casual users
Individual	Per-user adaptation	ChatGPT with your memory
Contextual	Per-session adaptation	Different style for work vs personal

Production systems typically combine levels—population-level base model, segment-level fine-tuning, individual-level prompt personalization.

Part II: Memory Architectures in Production

The Cognitive Science Foundation

Effective LLM memory mirrors human cognition. The dual-memory model from cognitive psychology—distinguishing episodic memory (personal experiences) from semantic memory (general knowledge)—maps directly to LLM personalization:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              COGNITIVE MEMORY MODEL FOR LLMs                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EPISODIC MEMORY                     SEMANTIC MEMORY                    │
│  ───────────────────                 ─────────────────                   │
│                                                                          │
│  "What happened"                     "What I know about you"            │
│                                                                          │
│  • Specific conversations            • User preferences                 │
│  • Past interactions                 • Beliefs and opinions             │
│  • Session histories                 • Working style                    │
│  • Tool usage patterns               • Domain expertise                 │
│                                                                          │
│  Implementation:                     Implementation:                    │
│  • Conversation logs                 • Extracted facts                  │
│  • RAG over chat history            • User profile store               │
│  • Recency-based retrieval          • Fine-tuned adapters              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE PRIME FRAMEWORK (EMNLP 2025):                                      │
│                                                                          │
│  Integrates both memory types with "personalized thinking"              │
│                                                                          │
│  User Input → Episodic Retrieval → Semantic Context → Personalized     │
│               (recent history)      (user beliefs)     Thinking →      │
│                                                        Response         │
│                                                                          │
│  Key insight: Generic chain-of-thought can HURT personalization.       │
│  Model needs to reason through user's lens, not generic lens.          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PRIME (arXiv:2507.04607) from University of Michigan demonstrates that combining episodic and semantic memory with personalized reasoning significantly outperforms single-memory approaches. The key finding: LLMs must learn to think in a personalized way, not just retrieve personalized context.

ChatGPT's Memory Architecture

OpenAI's approach, reverse-engineered by security researchers, reveals a four-layer system:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                   CHATGPT MEMORY ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Layer 1: USER PROFILE MEMORY                                           │
│  ─────────────────────────────────                                       │
│  Persistent facts: name, preferences, demographics                      │
│  Storage: Explicit "saved memories" + extracted facts                   │
│  Priority: HIGHEST (always in context)                                  │
│                                                                          │
│  Layer 2: CONVERSATION HISTORY                                          │
│  ─────────────────────────────────                                       │
│  Complete logs of past interactions                                     │
│  As of April 2025: References ALL past chats (paid users)              │
│  Free users: "Lightweight" recent-only memory                          │
│                                                                          │
│  Layer 3: EXTRACTED KNOWLEDGE                                           │
│  ─────────────────────────────────                                       │
│  Unstructured → Structured transformation                               │
│  Patterns, preferences, recurring topics                                │
│  Pre-computed summaries for efficiency                                  │
│                                                                          │
│  Layer 4: ACTIVE CONTEXT                                                │
│  ─────────────────────────────────                                       │
│  Currently relevant memories for this session                           │
│  Dynamic selection based on query                                       │
│  When limit hit: Recent messages trimmed, profile stays                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY DESIGN CHOICE:                                                     │
│  Long-term personalization > Short-term context                         │
│  Profile preserved even when conversation truncated                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The architecture is simpler than many expected—no complex vector databases or semantic search. OpenAI pre-computes summaries and injects them directly. This trades sophistication for reliability and latency.

Claude's Memory System

Anthropic took a different approach with Claude Memory, favoring transparency over complexity:

File-Based Architecture: Memory stored in plain Markdown files (CLAUDE.md) rather than opaque databases. Users can read, edit, and understand exactly what Claude remembers.

Hierarchical Structure:

Enterprise-level policies
Team-level standards
Project-level context
User-level preferences

Client-Side Operations: Memory tool operates locally—Anthropic doesn't store your memories on their servers. The agent makes tool calls, your application executes them.

Claude Opus 4.5 Memory Advances (November 2025): The latest Claude models dramatically improved memory capabilities:

Endless Chat: When conversations hit context limits, the model automatically compresses memory without disrupting flow or notifying users
Cross-File Context: Better leveraging memory to maintain consistency across multiple files
Multi-Agent Coordination: Opus can manage multiple Haiku-powered sub-agents, maintaining context across the orchestration
Proactive Memory Files: When given file access, Opus 4.5 creates and maintains memory files autonomously—in one test, it played Pokémon Red by generating its own navigation guide mid-game

As Anthropic's head of product noted: "Context windows are not going to be sufficient by themselves. Knowing the right details to remember is really important."

Limitation: The file-based approach can introduce a "fading memory" phenomenon. As CLAUDE.md grows large and monolithic, the model struggles to pinpoint relevant information in the massive context block. Best practice: keep memory files minimal, store project-specific knowledge in separate documentation.

Gemini's Personal Context

Google's approach emphasizes automatic learning from interactions:

Automatic Memory: Gemini learns preferences without explicit "remember this" commands
Personal Context Setting: On by default, remembers key details from past chats
Privacy Controls: Temporary chats (72-hour auto-delete) for sensitive conversations
Cross-Device: Memory persists across phone, web, and smart home devices

As of 2026, Gemini is becoming Google's default AI layer across all surfaces, replacing Google Assistant with persistent personalization.

Part III: User Modeling and Preference Learning

Extracting User Preferences

The challenge: users rarely state preferences explicitly. They reveal them implicitly through behavior, word choice, and interaction patterns.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                 PREFERENCE EXTRACTION METHODS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXPLICIT SIGNALS                    IMPLICIT SIGNALS                   │
│  ──────────────────                  ─────────────────                   │
│                                                                          │
│  • "Remember that I prefer..."       • Response length preferences      │
│  • Settings and configurations       • Vocabulary and formality level  │
│  • Direct feedback (thumbs up/down)  • Topics they engage with         │
│  • Custom instructions               • Questions they ask              │
│  • Profile information               • Editing patterns                │
│                                                                          │
│  Pros: Clear, reliable               Pros: Abundant, natural           │
│  Cons: Users rarely provide          Cons: Noisy, requires inference   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXTRACTION TECHNIQUES:                                                 │
│                                                                          │
│  1. Direct LLM Inference                                                │
│     "Based on this conversation, what are the user's preferences?"     │
│                                                                          │
│  2. Structured Classification                                           │
│     BERT/encoder models for multi-label preference extraction          │
│                                                                          │
│  3. Embedding Clustering                                                │
│     Group similar users, infer preferences from cluster                │
│                                                                          │
│  4. Behavior Pattern Analysis                                           │
│     Track: response ratings, regeneration requests, edit patterns      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

POPI Framework (arXiv:2510.17881) demonstrates that natural language preference summaries are more effective than embeddings for personalization. The key insight: interpretable text profiles outperform dense vectors because they can be reasoned about.

Difference-Aware User Modeling

A breakthrough in 2025: modeling what makes users different from each other, not just their absolute preferences.

The Problem: Prior methods model users in isolation. But personalization is fundamentally comparative—what distinguishes this user from others?

DRP Framework (arXiv:2511.15389) introduces:

Selective User Comparison: Cluster users, identify meaningful comparisons
Structured Difference Extraction: Compare along dimensions (writing style, emotional style, semantic style)
System-2 Reasoning: Slow, deliberate reasoning about user differences

Python

# Conceptual example: Difference-aware preference extraction
def extract_user_differences(target_user: User, comparison_users: list[User]) -> dict:
    """
    Extract what makes target_user unique compared to similar users.
    This captures personalization-relevant differences.
    """

    # Cluster to find meaningful comparisons
    cluster = find_user_cluster(target_user)
    comparison_set = sample_from_cluster(cluster, k=5)

    differences = {
        "writing_style": compare_dimension(
            target_user, comparison_set, dimension="writing"
        ),
        "emotional_tone": compare_dimension(
            target_user, comparison_set, dimension="emotion"
        ),
        "topic_preferences": compare_dimension(
            target_user, comparison_set, dimension="topics"
        ),
        "interaction_patterns": compare_dimension(
            target_user, comparison_set, dimension="behavior"
        )
    }

    return differences

def personalize_response(query: str, user: User, differences: dict) -> str:
    """
    Generate response considering user's distinguishing characteristics.
    """

    prompt = f"""
    User Query: {query}

    This user differs from typical users in these ways:
    - Writing style: {differences['writing_style']}
    - Emotional preference: {differences['emotional_tone']}
    - Topic interests: {differences['topic_preferences']}

    Generate a response tailored to these specific characteristics.
    """

    return llm.generate(prompt)

Personalized RLHF (P-RLHF)

Standard RLHF assumes homogeneous preferences—everyone wants the same thing. This bakes in biases toward majority viewpoints.

P-RLHF addresses this by learning user-specific reward models:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                STANDARD RLHF vs PERSONALIZED RLHF                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STANDARD RLHF:                                                         │
│  ─────────────────                                                       │
│                                                                          │
│  Feedback from    →  Single Reward  →  One Model   →  Same Output      │
│  All Users            Model             for All        for Everyone    │
│                                                                          │
│  Problem: Optimizes for "average" preference                            │
│  Result: Biased toward majority demographic                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PERSONALIZED RLHF:                                                     │
│                                                                          │
│  User A Feedback → User A Reward Model → Personalized                  │
│  User B Feedback → User B Reward Model → for Each                      │
│  User C Feedback → User C Reward Model → User                          │
│                                                                          │
│  Solution: Lightweight user models capture individual preferences       │
│  Scales: User embeddings + shared base = efficient personalization     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PLUS Framework (2025) achieves 11-77% improvement in reward model accuracy by learning text summaries of user preferences that condition the reward model. Users don't need to articulate preferences—the system learns from interactions.

User Embeddings: The Emerging Paradigm

While text-based profiles are interpretable, user embeddings offer a more powerful approach for capturing complex behavioral patterns. Rather than describing a user in words, encode their entire interaction history into a dense vector representation.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              USER EMBEDDINGS FOR LLM PERSONALIZATION                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TEXT-BASED PROFILES:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  "User prefers technical depth, concise responses,                      │
│   interested in ML/AI, senior developer"                                │
│                                                                          │
│  Pros: Interpretable, editable, reasoned about                         │
│  Cons: Lossy, hard to capture nuance, verbose in context               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  USER EMBEDDINGS:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  [0.23, -0.87, 0.45, ..., 0.12]  (d-dimensional vector)                │
│                                                                          │
│  Pros: Captures fine-grained patterns, compact, supports comparison    │
│  Cons: Not interpretable, requires training infrastructure             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INTEGRATION METHODS:                                                   │
│                                                                          │
│  1. Soft Prompting: Prepend embedding to LLM input                     │
│  2. Cross-Attention: Attend to embeddings during generation            │
│  3. Adapter Conditioning: Route through user-specific adapter          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

USER-LLM (Google Research, ACM Web Conference 2025) is the landmark framework for integrating user embeddings with LLMs:

Architecture:

Transformer Encoder: Creates user embeddings from multi-modal ID-based features (items viewed, categories, ratings)
Perceiver Compression: Distills long user histories into fixed-length 32-token embeddings
Cross-Attention Integration: User embeddings cross-attend with intermediate LLM representations
Alternative: Soft-Prompting: Embeddings prepended as prefix to LLM input

Results: 78.1X inference speedup vs text-prompt methods, 16.33% performance improvement on deep user understanding tasks. Unlike text prompts that degrade with sequence length, USER-LLM improves as history grows.

Python

# Conceptual USER-LLM architecture
class UserLLM:
    """
    Google's USER-LLM: Efficient personalization via user embeddings.
    """

    def __init__(self, llm: LLM, user_encoder: TransformerEncoder):
        self.llm = llm
        self.user_encoder = user_encoder  # Pretrained on user sequences
        self.perceiver = PerceiverResampler(output_tokens=32)
        self.cross_attention = CrossAttentionLayer()

    def encode_user(self, user_history: list[dict]) -> torch.Tensor:
        """
        Encode variable-length user history into fixed 32-token embedding.
        """
        # Each interaction -> embedding (item ID, category, rating, etc.)
        interaction_embeddings = [
            self.embed_interaction(item) for item in user_history
        ]

        # Transformer encodes sequence
        sequence_repr = self.user_encoder(interaction_embeddings)

        # Perceiver compresses to fixed length
        user_embedding = self.perceiver(sequence_repr)  # [32, d]

        return user_embedding

    def generate(self, query: str, user_embedding: torch.Tensor) -> str:
        """
        Generate personalized response using cross-attention to user embedding.
        """
        # LLM processes query, cross-attending to user embedding
        # at intermediate layers
        text_hidden = self.llm.encode(query)
        personalized_hidden = self.cross_attention(
            query=text_hidden,
            key=user_embedding,
            value=user_embedding
        )

        return self.llm.decode(personalized_hidden)

DEP: Difference-aware Embedding Personalization (EMNLP 2025 Oral) takes embeddings further by modeling inter-user differences:

The key insight: personalization is fundamentally comparative. What makes User A different from similar User B? DEP constructs soft prompts by contrasting a user's embedding with peer embeddings, then uses a sparse autoencoder to filter task-relevant features.

Why embeddings beat text for comparison: Vector operations naturally support inter-user comparison (subtraction, similarity), while comparing text profiles requires LLM reasoning. Embeddings encode fine-grained patterns in compact form.

PERSOMA (GenAI Personalization Workshop 2024) compresses user history into soft prompt embeddings using a perceiver architecture, steering a frozen LLM toward user preferences without fine-tuning.

Real-Time Sequential Embeddings: The Albatross Approach

While USER-LLM works with historical data, Albatross (founded by ex-Amazon AI leaders, $16M raised) pioneered real-time sequential embeddings that update as users interact:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│            ALBATROSS: REAL-TIME SEQUENTIAL EMBEDDINGS                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL RECSYS:                                                    │
│  ─────────────────────                                                   │
│                                                                          │
│  User History (batch) → Train Model → Static Embeddings → Serve        │
│                                                                          │
│  • Embeddings updated daily/weekly                                      │
│  • Can't capture in-session intent shifts                              │
│  • "You looked at running shoes yesterday" ≠ current intent            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ALBATROSS REAL-TIME:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Live Events → Sequential Transformer → Updated Embeddings → Serve     │
│       ↑              (BERT4Rec/SASRec style)        │                   │
│       └─────────────── milliseconds ────────────────┘                   │
│                                                                          │
│  • Embeddings update 4,000+ times per second                           │
│  • Captures intent evolution within session                            │
│  • "Running shoes → yoga mats → protein powder" = fitness intent NOW   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SCALE:                                                                 │
│  • 1B+ events processed monthly                                        │
│  • Predictions in <100ms                                                │
│  • Triple-digit engagement uplifts in pilots                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Technical Foundation: Albatross uses transformer architectures (similar to BERT4Rec and SASRec) but trained on live event streams rather than batch historical data. The key innovation is treating browsing sequences as a "language" where the order matters:

SASRec: Left-to-right unidirectional attention, predicts next item sequentially
BERT4Rec: Bidirectional attention via Cloze task (masked item prediction)
Albatross: Extends these with real-time streaming updates

Cold-Start Solution: Their work on cold-start discovery was presented at RecSys 2025. For new items with no interaction history, DenseRec-style approaches learn projections from content embeddings into the ID embedding space.

Products:

Real-Time Discovery Feed: Dynamically curates content as intent evolves
Multimodal Search: Refines results based on evolving intent, including image input

This represents the frontier of personalization: systems that understand not just who you are, but who you are right now in this moment.

Part IV: Fine-Tuning for Personalization

LoRA Adapters for User-Specific Models

Full fine-tuning per user is impractical—you'd need billions of parameters stored per user. LoRA (Low-Rank Adaptation) makes personalization feasible:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                  LORA FOR PERSONALIZATION                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FULL FINE-TUNING:                                                      │
│  ─────────────────────                                                   │
│                                                                          │
│  Base Model (7B params) → Fine-tune ALL params → User Model (7B)       │
│                                                                          │
│  Storage per user: 7B parameters = ~14GB                                │
│  1M users = 14 PETABYTES                                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LORA FINE-TUNING:                                                      │
│  ─────────────────────                                                   │
│                                                                          │
│  Base Model (frozen) + LoRA Adapter (0.05-1% params) → User Model      │
│                                                                          │
│  Storage per user: ~35MB (rank 16 on 7B model)                         │
│  1M users = 35 TERABYTES (still large but manageable)                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CLUSTERED LORA (Practical):                                            │
│  ────────────────────────────                                            │
│                                                                          │
│  Cluster users → Train LoRA per cluster → Mix for individuals          │
│                                                                          │
│  Storage: k clusters × 35MB = feasible                                  │
│  Personalization: Weighted mix of cluster adapters                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Federated Fine-Tuning

For privacy-preserving personalization, federated learning lets models learn from user data without centralizing it.

FedALT (arXiv:2503.11880) introduces a novel approach:

Individual LoRA: Each client trains their own adapter
Rest-of-World LoRA: Shared knowledge from other users
Adaptive Mixer: Dynamically balances personal vs global knowledge

Python

# Conceptual FedALT architecture
class FedALTClient:
    def __init__(self, base_model, user_data):
        self.base_model = base_model  # Frozen
        self.individual_lora = LoRAAdapter(rank=8)  # User-specific
        self.row_lora = LoRAAdapter(rank=8)  # Rest-of-world (shared)
        self.mixer = AdaptiveMixer()  # Learned weighting

    def forward(self, x):
        base_out = self.base_model(x)
        individual_out = self.individual_lora(x)
        global_out = self.row_lora(x)

        # Adaptive mixing based on input
        alpha = self.mixer(x)  # Per-input weight

        return base_out + alpha * individual_out + (1 - alpha) * global_out

    def local_update(self, batch):
        """Train individual LoRA on local data"""
        # Only update individual_lora and mixer
        # row_lora comes from server aggregation
        pass

PF2LoRA (OpenReview) adds automatic rank selection—different users need different adapter capacities based on their data distribution.

On-Device Personalization

The privacy gold standard: personalization that never leaves the device.

2025 Breakthroughs:

Mobile GPU Fine-Tuning: First production frameworks for training on Qualcomm Adreno and ARM Mali GPUs
XPerT: 83% reduction in on-device fine-tuning compute, 51% better data efficiency
TZ-LLM: 90.9% reduction in time-to-first-token via secure enclave optimization

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              ON-DEVICE PERSONALIZATION STACK                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  APPLICATION LAYER                                                      │
│  ─────────────────────                                                   │
│  Personal assistant, email, notes, calendar                             │
│                                                                          │
│  PERSONALIZATION LAYER                                                  │
│  ─────────────────────────                                               │
│  User profile, preference store, interaction history                    │
│                                                                          │
│  ADAPTATION LAYER                                                       │
│  ─────────────────────                                                   │
│  On-device LoRA training, prompt caching, KV-cache persistence         │
│                                                                          │
│  MODEL LAYER                                                            │
│  ───────────────                                                         │
│  Quantized SLM (1-3B params), edge-optimized architecture              │
│                                                                          │
│  HARDWARE LAYER                                                         │
│  ─────────────────                                                       │
│  Mobile NPU, GPU, secure enclave for sensitive data                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PRIVACY GUARANTEES:                                                    │
│  • User data never leaves device                                        │
│  • Federated learning for shared improvements                          │
│  • Differential privacy for gradient updates                           │
│  • Secure enclave for model parameters                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The SLM (Small Language Model) market is projected to grow from $0.93B in 2025 to$ 5.45B by 2032. NVIDIA researchers argue that the future of agentic AI is small: "a federation of smaller, faster, privacy-friendly agents—running on the edge, in your browser, or even offline."

Part V: Solving Cold Start

The hardest personalization problem: what do you do with new users who have no history?

LLMs as Cold-Start Solvers

Traditional collaborative filtering fails completely for new users. LLMs offer a path forward:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│             COLD START SOLUTIONS WITH LLMs                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL CF:                                                        │
│  ─────────────────                                                       │
│  New user → No history → No similar users → NO RECOMMENDATIONS         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LLM-BASED SOLUTIONS:                                                   │
│                                                                          │
│  1. LANGUAGE-BASED PREFERENCE ELICITATION                              │
│     ─────────────────────────────────────                                │
│     "What topics interest you?" → LLM interprets → Preferences         │
│                                                                          │
│     Finding: Language-based elicitation is FASTER than item-based      │
│     and achieves competitive accuracy (RecSys 2023)                    │
│                                                                          │
│  2. ZERO-SHOT TRANSFER                                                  │
│     ─────────────────────                                                │
│     LLM world knowledge → Infer preferences from minimal signals       │
│                                                                          │
│     "User is a Python developer" → Infer: prefers code examples,       │
│     technical depth, concise answers                                   │
│                                                                          │
│  3. META-LEARNING                                                       │
│     ──────────────────                                                   │
│     Train model to QUICKLY adapt from few examples                     │
│                                                                          │
│     Meta-learned prompt-tuning: Personalize from 5-10 interactions     │
│                                                                          │
│  4. PERSONALIZED THINKING (Training-Free)                              │
│     ──────────────────────────────────────                               │
│     PRIME shows personalized reasoning works even without history      │
│     LLM reasons through user's likely perspective                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

LLMTreeRec (COLING 2025) achieves state-of-the-art cold-start performance by structuring items into a tree for efficient LLM retrieval.

Meta-Learning for Cold-Start (arXiv:2507.16672) proposes parameter-efficient prompt-tuning that adapts in few-shot scenarios.

Active Preference Elicitation

Rather than waiting for users to reveal preferences, actively ask:

Python

def intelligent_preference_elicitation(user: NewUser) -> UserProfile:
    """
    Actively elicit preferences from new users using decision-tree strategy.
    Minimize questions while maximizing information gain.
    """

    # Start with high-information questions
    questions = [
        {
            "question": "How technical should my explanations be?",
            "options": ["High-level concepts", "Some code examples", "Deep technical detail"],
            "dimension": "technical_depth"
        },
        {
            "question": "What's your primary use case?",
            "options": ["Learning", "Problem-solving", "Creative work", "Research"],
            "dimension": "intent"
        },
        {
            "question": "How long should my responses typically be?",
            "options": ["Brief and direct", "Balanced", "Comprehensive"],
            "dimension": "verbosity"
        }
    ]

    profile = {}

    for q in questions:
        # Use LLM to determine if question is still informative
        # given what we've already learned
        if should_ask(q, profile):
            answer = ask_user(q)
            profile[q["dimension"]] = answer

            # Early exit if we have enough signal
            if profile_confidence(profile) > 0.8:
                break

    # Fill remaining dimensions with LLM inference
    profile = llm_infer_remaining(profile)

    return UserProfile(profile)

Part VI: Evaluation and Benchmarks

The Benchmarking Challenge

How do you measure if personalization is working? Traditional metrics (accuracy, perplexity) don't capture it.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              PERSONALIZATION BENCHMARKS (2025)                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PERSONAMEM (COLM 2025)                                                 │
│  ─────────────────────────                                               │
│  • 180+ simulated user-LLM interaction histories                       │
│  • 60 multi-turn sessions per user                                     │
│  • 15 personalized task scenarios                                      │
│  • Tests: memory, tracking evolution, personalized response            │
│                                                                          │
│  Key Finding: Even GPT-4.5 achieves only ~52% accuracy                 │
│  Models are better at recall (60-70%) than adaptation                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PERSONALLLM (ICLR 2025)                                                │
│  ─────────────────────────                                               │
│  • 10K+ open-ended prompts                                             │
│  • 8 high-quality responses per prompt                                 │
│  • Simulates diverse user preferences via reward models                │
│  • Tests: adaptation to individual preference patterns                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PERSONALENS (ACL Findings 2025)                                        │
│  ─────────────────────────────────                                       │
│  • Task-oriented assistant evaluation                                  │
│  • Rich user profiles with preferences and history                     │
│  • LLM-as-Judge for personalization assessment                        │
│  • Tests: task success + personalization quality                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PREFEVAL (2025)                                                        │
│  ─────────────────                                                       │
│  • 3,000 user preference + query pairs                                 │
│  • 20 topic categories                                                 │
│  • Explicit AND implicit preference forms                              │
│  • Tests: preference following over long contexts                      │
│                                                                          │
│  Key Finding: <10% accuracy at just 10 turns in zero-shot              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

What the Benchmarks Reveal

The research paints a sobering picture:

Frontier models struggle: GPT-4.5, Gemini-2.0, and o1 achieve only ~50% on PersonaMem
Static recall > Dynamic adaptation: Models remember facts but don't incorporate evolving preferences
Long-context degrades fast: Performance drops below 10% after just 10 turns without retrieval
RAG significantly helps: External memory systems consistently improve performance

Metrics for Production

Beyond academic benchmarks, production systems need:

Metric	Description	Target
Preference Accuracy	Does output match known preferences?	>80%
Consistency	Same preferences → same style over time	>90%
Adaptation Speed	Turns to learn new preference	<5 turns
User Satisfaction	Direct feedback on personalization	>4.5/5
Retention Impact	Users with personalization vs without	+40-70%

Part VII: Enterprise Applications

Customer Service Personalization

LLM-powered customer service that remembers each customer:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│            PERSONALIZED CUSTOMER SERVICE ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CUSTOMER CONTACT                                                       │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────┐     ┌──────────────────────────────────────┐          │
│  │   Channel   │────▶│         PERSONALIZATION LAYER        │          │
│  │  (Chat/Call)│     │                                      │          │
│  └─────────────┘     │  • CRM Integration (purchase history)│          │
│                      │  • Past tickets and resolutions      │          │
│                      │  • Communication preferences         │          │
│                      │  • Sentiment from past interactions  │          │
│                      └──────────────────────────────────────┘          │
│                                     │                                   │
│                                     ▼                                   │
│                      ┌──────────────────────────────────────┐          │
│                      │           LLM AGENT                   │          │
│                      │                                      │          │
│                      │  Context: "Returning customer,       │          │
│                      │  prefers technical detail, had       │          │
│                      │  shipping issue last month (resolved),│          │
│                      │  VIP tier, prefers email follow-up"  │          │
│                      └──────────────────────────────────────┘          │
│                                     │                                   │
│                                     ▼                                   │
│                      ┌──────────────────────────────────────┐          │
│                      │      PERSONALIZED RESPONSE           │          │
│                      │                                      │          │
│                      │  "Hi [Name], I see you're a valued  │          │
│                      │  customer. Given the shipping issue  │          │
│                      │  last month, let me expedite this..."│          │
│                      └──────────────────────────────────────┘          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Bank of America's Erica: Handles 1B+ customer interactions annually with personalized financial tips based on individual spending patterns and goals.

Impact Statistics (2025):

24/7 support without proportional staffing
2+ hours saved daily per service representative
Personalized responses increase CSAT by 25%

E-Commerce Personalization

Beyond recommendations—personalized search, product descriptions, and shopping assistance:

Python

class PersonalizedEcommerceAgent:
    """
    E-commerce agent that personalizes the entire shopping experience.
    """

    def __init__(self, user_profile: UserProfile, llm: LLM):
        self.profile = user_profile
        self.llm = llm

    def personalize_search(self, query: str) -> list[Product]:
        """
        Rerank search results based on user preferences.
        """

        # Standard search
        results = self.search_engine.search(query)

        # Personalized reranking prompt
        prompt = f"""
        User Profile:
        - Style preferences: {self.profile.style}
        - Price sensitivity: {self.profile.price_range}
        - Brand preferences: {self.profile.brands}
        - Past purchases: {self.profile.recent_purchases}

        Query: {query}

        Rerank these results for this specific user:
        {format_results(results)}

        Consider: Their style matches items 3, 7. Their price range matches 1, 2, 5.
        Their past purchases suggest preference for items 2, 7.
        """

        return self.llm.rerank(prompt, results)

    def personalize_product_page(self, product: Product) -> str:
        """
        Generate personalized product description highlighting
        features this user cares about.
        """

        prompt = f"""
        Product: {product.name}
        Full Description: {product.description}
        Features: {product.features}

        User cares about: {self.profile.priorities}
        User's use case: {self.profile.use_case}

        Rewrite the key selling points emphasizing what matters to THIS user.
        """

        return self.llm.generate(prompt)

    def shopping_assistant(self, message: str) -> str:
        """
        Conversational shopping with personalized recommendations.
        """

        context = f"""
        You're a personal shopping assistant for {self.profile.name}.

        Their preferences:
        - Budget: {self.profile.budget}
        - Style: {self.profile.style}
        - Sizes: {self.profile.sizes}
        - Previous purchases they loved: {self.profile.favorites}
        - Items they returned: {self.profile.returns}

        Provide personalized recommendations and advice.
        """

        return self.llm.chat(context, message)

Results: Platforms with AI-powered personalization see 40% increase in session-to-click conversion, 25% reduction in search abandonment.

RAG for Personal Knowledge

Retrieval-Augmented Generation over personal data:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PERSONAL KNOWLEDGE RAG                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USER'S PERSONAL DATA SOURCES:                                          │
│  ─────────────────────────────────                                       │
│  • Emails and calendar                                                  │
│  • Notes and documents                                                  │
│  • Meeting transcripts                                                  │
│  • Slack/Teams messages                                                 │
│  • Browser bookmarks                                                    │
│                                                                          │
│           │                                                              │
│           ▼                                                              │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    PERSONAL VECTOR STORE                         │   │
│  │                                                                  │   │
│  │  Chunks embedded and indexed per user                           │   │
│  │  Privacy: Local-only or encrypted cloud                         │   │
│  │  Refresh: Incremental as new data arrives                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│           │                                                              │
│           ▼                                                              │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    PERSONALIZED QUERIES                          │   │
│  │                                                                  │   │
│  │  "What did I discuss with Sarah about the Q3 budget?"           │   │
│  │  → Retrieves: Meeting notes, email thread, Slack messages       │   │
│  │  → LLM synthesizes personalized answer                          │   │
│  │                                                                  │   │
│  │  "What's my schedule conflict for next Tuesday?"                │   │
│  │  → Retrieves: Calendar events, committed meetings               │   │
│  │  → LLM identifies conflicts based on YOUR priorities            │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  USE CASES:                                                             │
│  • Personal assistants (remembering your life)                         │
│  • Enterprise copilots (organizational memory)                         │
│  • Health assistants (medical history + guidelines)                    │
│  • Educational AI (student progress + curriculum)                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VIII: Challenges and Limitations

The Privacy-Personalization Tradeoff

The fundamental tension: better personalization requires more data, but users increasingly demand privacy.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              PRIVACY-PERSONALIZATION SPECTRUM                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MAXIMUM PRIVACY                              MAXIMUM PERSONALIZATION   │
│  ◄───────────────────────────────────────────────────────────────────► │
│                                                                          │
│  On-device     Federated    Encrypted    Opt-in     Full Cloud         │
│  Only          Learning     Cloud        Sharing    Processing         │
│                                                                          │
│  • No server   • Gradients  • Data       • User     • All data         │
│    contact       shared,      encrypted    consents   on servers       │
│  • Limited       not data     at rest      per use  • Best models      │
│    capability  • Privacy    • Keys with  • Partial  • Privacy          │
│  • Full          preserved    user         sharing    concerns         │
│    control     • Some       • Balanced   • Good     • GDPR/CCPA        │
│                  learning                   results    compliance       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  REGULATORY LANDSCAPE (2025):                                           │
│                                                                          │
│  • GDPR: Right to erasure conflicts with LLM training                  │
│  • CCPA: Disclosure requirements for AI-processed data                 │
│  • EU AI Act: High-risk system requirements                            │
│  • Emerging: State-level AI privacy laws                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Practical Approaches:

Differential privacy: Add noise to prevent individual identification
Federated learning: Train on data without centralizing it
On-device processing: Never send data to servers
Temporary processing: Process then delete (Gemini's temporary chats)

Filter Bubbles and Bias

Personalization can create echo chambers and reinforce biases:

Filter bubbles: Users only see content matching existing preferences
Polarization: Personalization amplifies existing beliefs
Demographic bias: RLHF optimizes for majority preferences
Representation harm: Minorities get worse personalization

Mitigations:

Transparency: Tell users when responses are personalized
Diversity injection: Deliberately include diverse perspectives
Bias auditing: Regular evaluation across demographic groups
User control: Let users adjust personalization level

Technical Limitations

Even with perfect data and no privacy constraints:

Limitation	Description	Current State
Context window	Can't fit all history	128K-1M tokens, still finite
Catastrophic forgetting	Fine-tuning loses old knowledge	Active research area
Consistency	Same user, different sessions, different style	~70% consistency
Evaluation	Hard to measure "good" personalization	Benchmark accuracy ~50%
Scale	Storage/compute for millions of users	Clustering approaches help

Part IX: Implementation Guide

Building a Personalized LLM System

A practical architecture for production personalization:

Python

from dataclasses import dataclass
from typing import Optional
import hashlib

@dataclass
class UserProfile:
    user_id: str
    preferences: dict
    history_summary: str
    last_updated: str

@dataclass
class Memory:
    episodic: list[dict]  # Recent conversations
    semantic: dict        # Extracted facts/preferences

class PersonalizedLLM:
    """
    Production-ready personalized LLM system.
    Combines multiple personalization techniques.
    """

    def __init__(
        self,
        base_llm: LLM,
        user_store: UserStore,
        memory_store: MemoryStore,
        embedding_model: EmbeddingModel
    ):
        self.llm = base_llm
        self.users = user_store
        self.memories = memory_store
        self.embedder = embedding_model

    def get_personalized_response(
        self,
        user_id: str,
        query: str,
        conversation_history: list[dict]
    ) -> str:
        """
        Generate a response personalized to this specific user.
        """

        # 1. Load user profile and memory
        profile = self.users.get(user_id)
        memory = self.memories.get(user_id)

        # 2. Retrieve relevant episodic memories
        relevant_episodes = self._retrieve_relevant_episodes(
            query, memory.episodic, k=5
        )

        # 3. Build personalized system prompt
        system_prompt = self._build_system_prompt(
            profile, memory.semantic, relevant_episodes
        )

        # 4. Generate response
        response = self.llm.generate(
            system_prompt=system_prompt,
            messages=conversation_history + [{"role": "user", "content": query}]
        )

        # 5. Update memories asynchronously
        self._update_memories_async(user_id, query, response, conversation_history)

        return response

    def _build_system_prompt(
        self,
        profile: UserProfile,
        semantic_memory: dict,
        relevant_episodes: list[dict]
    ) -> str:
        """
        Construct personalized system prompt with user context.
        """

        prompt_parts = [
            "You are a helpful assistant personalized to this user.",
            "",
            "USER PROFILE:",
            f"- Name: {semantic_memory.get('name', 'Unknown')}",
            f"- Expertise level: {semantic_memory.get('expertise', 'general')}",
            f"- Communication style preference: {semantic_memory.get('style', 'balanced')}",
            f"- Key interests: {', '.join(semantic_memory.get('interests', []))}",
            "",
            "RELEVANT CONTEXT FROM PAST CONVERSATIONS:"
        ]

        for episode in relevant_episodes:
            prompt_parts.append(f"- {episode['summary']}")

        prompt_parts.extend([
            "",
            "PERSONALIZATION GUIDELINES:",
            f"- Response length: {profile.preferences.get('length', 'balanced')}",
            f"- Technical depth: {profile.preferences.get('technical_depth', 'moderate')}",
            f"- Tone: {profile.preferences.get('tone', 'professional')}",
            "",
            "Adapt your response to match these preferences while being helpful and accurate."
        ])

        return "\n".join(prompt_parts)

    def _retrieve_relevant_episodes(
        self,
        query: str,
        episodes: list[dict],
        k: int = 5
    ) -> list[dict]:
        """
        Retrieve most relevant past conversations using embedding similarity.
        """

        if not episodes:
            return []

        query_embedding = self.embedder.embed(query)

        scored = []
        for episode in episodes:
            similarity = cosine_similarity(query_embedding, episode['embedding'])
            scored.append((similarity, episode))

        scored.sort(reverse=True)
        return [ep for _, ep in scored[:k]]

    def _update_memories_async(
        self,
        user_id: str,
        query: str,
        response: str,
        history: list[dict]
    ):
        """
        Asynchronously update user memories based on this interaction.
        """

        # Extract new facts/preferences from conversation
        extraction_prompt = f"""
        Based on this conversation, extract any new information about the user:

        User: {query}
        Assistant: {response}

        Extract:
        1. Any stated preferences
        2. Expertise indicators
        3. Topics of interest
        4. Communication style signals

        Return as JSON or "nothing new" if no new information.
        """

        # Queue for async processing
        self.memory_update_queue.put({
            "user_id": user_id,
            "query": query,
            "response": response,
            "extraction_prompt": extraction_prompt
        })

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = sum(x ** 2 for x in a) ** 0.5
    norm_b = sum(x ** 2 for x in b) ** 0.5
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

Deployment Considerations

Consideration	Recommendation
Storage	User profiles: PostgreSQL. Memories: Vector DB (Pinecone, Weaviate). Episodes: Object storage.
Caching	Cache user profiles in Redis. Hot users get priority.
Latency	Profile lookup must be <50ms. Async memory updates.
Privacy	Encrypt PII at rest. Audit logging for data access.
Scale	Cluster users for LoRA adapters. 100-1000 clusters typical.
Fallback	If personalization fails, fall back to generic response.

The Future of Personalized AI

MCP: The Infrastructure Layer for Personalization

The Model Context Protocol (MCP), donated to the Linux Foundation in December 2025, has become the standard infrastructure for agentic AI—and personalization is a core use case.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              MCP: THE "USB-C FOR AI" PERSONALIZATION                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BEFORE MCP (2024):                                                     │
│  ─────────────────────                                                   │
│                                                                          │
│  Each AI tool → Custom API → Custom integration → Fragmented memory    │
│                                                                          │
│  • Every app implements memory differently                              │
│  • No portability of user context between tools                        │
│  • Developers rebuild personalization for each integration             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WITH MCP (2025+):                                                      │
│  ─────────────────────                                                   │
│                                                                          │
│  AI Agent ←→ MCP Protocol ←→ [Tools, Data Sources, Memory Systems]     │
│                                                                          │
│  • Standardized context exchange between agents                        │
│  • Persistent Agent Profile: User identity across sessions             │
│  • Portable personalization between MCP-compatible tools               │
│  • Rich context without custom integration per tool                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ADOPTION (January 2026):                                               │
│                                                                          │
│  • OpenAI, Anthropic, Block → co-founded Agentic AI Foundation        │
│  • Google, Microsoft, AWS, Cloudflare → supporting members             │
│  • MCP = de facto standard in <12 months                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

MCP enables portable personalization: your preferences, context, and memory can travel with you across MCP-compatible AI tools. This is the "USB-C for AI"—a universal connector that makes personalization composable.

2026: The "Show Me the Money" Year

The trajectory is clear—2026 is when personalization must deliver ROI:

Memory as baseline: All major AI assistants remember users (ChatGPT, Claude, Gemini—shipped)
Proactive AI: Systems anticipate needs without prompts (ChatGPT Pulse researches based on past conversations)
Personal models: Small models trained on your data, running locally (SLM market: $0.93B →$ 5.45B by 2032)
MCP everywhere: Standardized context sharing between agents and tools
Memory in its "GPT-2 era": Current memory systems are primitive—massive improvement ahead

Market Context (January 2026):

OpenAI targeting $30B revenue (up from$ 13B in 2025)
Anthropic aiming for $15B revenue (up from ~$ 5B)
Enterprise LLM market projected: $6.85B (2025) →$ 55.60B (2032)
Domain-specific LLMs growing fastest—personalization is specialization

The Personal AI Stack

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                  FUTURE PERSONAL AI STACK (2026+)                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INTERFACE LAYER                                                        │
│  ─────────────────                                                       │
│  Voice, text, gesture, ambient sensors                                  │
│                                                                          │
│  PERSONAL AI AGENT                                                      │
│  ─────────────────────                                                   │
│  • On-device SLM (1-3B) for instant response                           │
│  • Cloud LLM for complex reasoning                                      │
│  • Personal LoRA adapters                                              │
│                                                                          │
│  MEMORY LAYER                                                           │
│  ───────────────                                                         │
│  • Episodic: Conversation history                                       │
│  • Semantic: User knowledge graph                                       │
│  • Procedural: Learned workflows and habits                            │
│                                                                          │
│  PERSONAL KNOWLEDGE                                                     │
│  ───────────────────                                                     │
│  • Emails, calendar, documents                                          │
│  • Health data, financial records                                       │
│  • Social connections, preferences                                      │
│                                                                          │
│  PRIVACY LAYER                                                          │
│  ───────────────                                                         │
│  • On-device processing default                                         │
│  • Selective, consented cloud sync                                      │
│  • Federated learning for improvements                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The goal: AI that knows you as well as a long-time assistant—but respects your privacy, runs on your terms, and gets better every day.

Frequently Asked Questions

Cold start refers to personalizing for new users with no interaction history. Solutions include: (1) Language-based preference elicitation—ask users about preferences in natural language, which LLMs can interpret zero-shot; (2) Meta-learning approaches that adapt quickly from few examples; (3) Transfer from similar users via clustering; (4) PRIME's training-free personalized thinking that works even without history.

Yes, and it's a significant concern. Standard RLHF optimizes for majority preferences, disadvantaging minority viewpoints. Personalization can create filter bubbles where users only see content matching their existing beliefs. Mitigations include: transparency about when responses are personalized, deliberate diversity injection, demographic bias auditing, and giving users control over personalization level.

Academic benchmarks like PersonaMem, PersonalLLM, and PersonaLens provide standardized evaluation. For production, track: preference accuracy (does output match known preferences?), consistency (same preferences yield same style over time?), adaptation speed (turns to learn new preference), and user satisfaction (direct feedback). Current frontier models achieve only ~50% on academic benchmarks—there's significant room for improvement.

ChatGPT uses a four-layer architecture with pre-computed summaries injected into context. It's sophisticated but opaque—you can see memories but not how they're selected. Claude uses transparent Markdown files (CLAUDE.md) that users can directly read and edit. Claude's approach is more transparent but can suffer from "fading memory" as files grow large. Both support user control over what's remembered.

It depends on your use case. Text profiles are interpretable, editable, and can be reasoned about by LLMs—ideal when users need to understand and control what's remembered. User embeddings (like USER-LLM or Albatross) capture fine-grained behavioral patterns more efficiently and support mathematical operations like user comparison. Embeddings achieve up to 78X speedup over text prompts and improve with longer histories (while text prompts degrade). For production systems handling many users with rich interaction data, embeddings are increasingly the better choice. For transparency-first applications, text profiles remain valuable.

Traditional personalization uses static embeddings updated in batches (daily/weekly). Real-time sequential embeddings (pioneered by Albatross and based on transformer architectures like BERT4Rec/SASRec) update continuously as users interact—capturing intent shifts within a single session. If you browse running shoes, then yoga mats, then protein powder, a real-time system understands your current fitness intent rather than recommending based on yesterday's browsing. This matters for e-commerce, content discovery, and any domain where in-session intent differs from historical patterns.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

LLMsPersonalization

LLM Memory Systems: From MemGPT to Long-Term Agent Memory

Understanding memory architectures for LLM agents—MemGPT's hierarchical memory, Letta's agent framework, and patterns for building agents that learn and remember across conversations.

30 min read

RecSysPersonalization

Generative AI for Recommendation Systems: LLMs Meet Personalization

Practical guide to LLM-powered recommendation systems. From feature augmentation to conversational agents, understand how generative AI is transforming personalization.

9 min read

EducationRAG

Building Production-Ready RAG Systems: Lessons from the Field

Production-focused guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.

16 min read

LLMsTraining

Fine-Tuning Workflows & Best Practices: A Practical Guide for LLM Customization

Field guide to fine-tuning LLMs including LoRA, QLoRA, and full fine-tuning. Covers data preparation, hyperparameter selection, evaluation strategies, common pitfalls, and 2025 tools like Unsloth, Axolotl, and LLaMA-Factory.

11 min read

Table of Contents

The Personalization Imperative

Part I: The Taxonomy of LLM Personalization

Three Technical Approaches

The Personalization Granularity Spectrum

Part II: Memory Architectures in Production

The Cognitive Science Foundation

ChatGPT's Memory Architecture

Claude's Memory System

Gemini's Personal Context

Part III: User Modeling and Preference Learning

Extracting User Preferences

Difference-Aware User Modeling

Personalized RLHF (P-RLHF)

User Embeddings: The Emerging Paradigm

Real-Time Sequential Embeddings: The Albatross Approach

Part IV: Fine-Tuning for Personalization

LoRA Adapters for User-Specific Models

Federated Fine-Tuning

On-Device Personalization

Part V: Solving Cold Start

LLMs as Cold-Start Solvers

Active Preference Elicitation

Part VI: Evaluation and Benchmarks

The Benchmarking Challenge

What the Benchmarks Reveal

Metrics for Production

Part VII: Enterprise Applications

Customer Service Personalization

E-Commerce Personalization

RAG for Personal Knowledge

Part VIII: Challenges and Limitations

The Privacy-Personalization Tradeoff

Filter Bubbles and Bias

Technical Limitations

Part IX: Implementation Guide

Building a Personalized LLM System

Deployment Considerations

The Future of Personalized AI

MCP: The Infrastructure Layer for Personalization

2026: The "Show Me the Money" Year

The Personal AI Stack

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

LLM Memory Systems: From MemGPT to Long-Term Agent Memory

Generative AI for Recommendation Systems: LLMs Meet Personalization

Building Production-Ready RAG Systems: Lessons from the Field

Fine-Tuning Workflows & Best Practices: A Practical Guide for LLM Customization