LLM-Powered Search for E-Commerce: Beyond NER and Elasticsearch
A deep dive into building intelligent e-commerce search systems that understand natural language, leverage metadata effectively, and support multi-turn conversations—moving beyond classical NER + Elasticsearch approaches.
Table of Contents
The Problem with Classical E-Commerce Search
Walk into any fashion e-commerce platform today—Zalando, ASOS, or Amazon—and try this query: "I need a cozy navy jacket for the office under €200."
What happens? The classical search pipeline—typically Named Entity Recognition (NER) feeding into Elasticsearch—struggles. It might extract "navy" as a color and "jacket" as a category, but "cozy"? "Office-appropriate"? "Under €200"? These require semantic understanding that keyword matching simply cannot provide.
How Classical Search Actually Works
Let's break down the traditional e-commerce search architecture that powers most retail sites today:
┌─────────────────────────────────────────────────────────────────────────┐
│ CLASSICAL E-COMMERCE SEARCH PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. QUERY PREPROCESSING │
│ ───────────────────── │
│ Raw: "I need a cozy navy jacket for the office under €200" │
│ ↓ │
│ Normalized: "cozy navy jacket office 200" │
│ (stopwords removed, lowercased, special chars stripped) │
│ │
│ 2. NAMED ENTITY RECOGNITION (NER) │
│ ────────────────────────────── │
│ NER Model (typically spaCy, AWS Comprehend, or custom BiLSTM): │
│ │
│ Input: "cozy navy jacket office 200" │
│ Output: { │
│ "color": "navy", ✓ Detected │
│ "category": "jacket", ✓ Detected │
│ "price": null, ✗ "200" without currency/context │
│ "style": null, ✗ "cozy" not in training data │
│ "occasion": null ✗ "office" = place, not occasion │
│ } │
│ │
│ 3. QUERY BUILDING │
│ ───────────── │
│ Elasticsearch DSL query constructed from extracted entities: │
│ │
│ { │
│ "bool": { │
│ "must": [ │
│ {"term": {"color": "navy"}}, │
│ {"term": {"category": "jacket"}} │
│ ] │
│ } │
│ } │
│ │
│ 4. ELASTICSEARCH EXECUTION │
│ ──────────────────────── │
│ - Inverted index lookup for color:navy │
│ - Inverted index lookup for category:jacket │
│ - Intersection of document IDs │
│ - BM25 scoring on remaining terms ("cozy", "office") │
│ │
│ 5. RESULTS │
│ ─────── │
│ 2,847 navy jackets returned │
│ Sorted by: BM25 relevance + popularity + recency │
│ │
│ Problem: User wanted ~50 results, got thousands │
│ Problem: "Cozy" had no effect on ranking │
│ Problem: Puffer jackets ranked same as denim jackets │
│ Problem: No price filtering applied │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Why NER Fails for Rich Queries
Named Entity Recognition was designed for extracting structured entities from text—person names, locations, organizations, dates. Adapting it for e-commerce means training models to recognize product attributes. But this approach has fundamental limitations:
Problem 1: Fixed Entity Schema
NER models are trained on a predefined set of entity types. If your training data has color, category, brand, and size, those are the only things you can extract. User queries evolve faster than you can retrain:
| User Query | What NER Extracts | What's Lost |
|---|---|---|
| "sustainable cotton dress" | category: dress | sustainable (ethics), cotton (material) |
| "Y2K aesthetic crop top" | category: top | Y2K aesthetic (style trend) |
| "something to wear to a funeral" | ∅ | occasion, formality, color implications |
| "like the bag Kendall Jenner had" | ∅ | celebrity reference, visual similarity |
| "not too tight jeans" | category: jeans | fit preference (negative constraint) |
Problem 2: No Semantic Understanding
NER extracts surface forms, not meaning. It doesn't understand that:
"cozy" → implies {materials: [wool, fleece, cashmere], fit: relaxed, warmth: high}
"office" → implies {formality: business-casual+, colors: neutral, style: classic}
"edgy" → implies {style: avant-garde, colors: [black, metallics], details: hardware}
"vacation" → implies {materials: light, care: easy-wash, versatility: high}
These semantic expansions require world knowledge that NER models don't possess.
Problem 3: Entity Boundaries Are Ambiguous
Fashion language is compound and contextual:
| Query | Correct Parsing | NER Mistake |
|---|---|---|
| "light blue dress" | color: light-blue | color: blue, weight: light |
| "navy seal jacket" | style: military-inspired | color: navy, animal: seal |
| "rose gold watch" | color: rose-gold | flower: rose, material: gold |
| "off white sneakers" | brand: Off-White OR color: off-white | status: off, color: white |
| "little black dress" | style: LBD (specific fashion item) | size: little, color: black |
Problem 4: No Handling of Negations or Preferences
NER extracts what IS mentioned, not what ISN'T wanted:
Query: "black dress, NOT too short, preferably with sleeves"
NER extracts: {color: black, category: dress}
Lost: length constraint (not short), sleeve preference
Query: "something like my last order but in blue"
NER extracts: {color: blue}
Lost: reference to order history, "similar to" relationship
Why Elasticsearch Falls Short
Once NER extracts (incomplete) entities, they're fed to Elasticsearch. Here's where keyword matching shows its limits:
The Inverted Index Problem
Elasticsearch uses inverted indexes: for each term, it stores which documents contain that term. Queries become set operations:
# Conceptual representation of inverted index
inverted_index = {
"navy": {doc_1, doc_47, doc_103, doc_892, ...}, # 3,241 docs
"jacket": {doc_1, doc_15, doc_47, doc_201, ...}, # 8,472 docs
"puffer": {doc_47, doc_892, doc_1204, ...}, # 423 docs
"down": {doc_47, doc_103, doc_1891, ...}, # 892 docs
"quilted": {doc_892, doc_2001, ...}, # 312 docs
}
# Query: "navy puffer jacket"
# Execution: navy ∩ jacket ∩ puffer
# Result: Only docs containing ALL three exact terms
The Synonym Problem
A user searching "puffer jacket" won't find products labeled:
- "down jacket" (different term, same item)
- "quilted coat" (synonym)
- "padded jacket" (industry term)
- "insulated outerwear" (formal description)
Elasticsearch's solution—synonym expansion—requires manual curation:
{
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"puffer, down jacket, quilted coat, padded jacket",
"sneakers, trainers, athletic shoes, tennis shoes",
"pants, trousers, slacks"
]
}
}
}
The problem: Fashion has thousands of synonym groups. They vary by region (jumper vs sweater), generation (kicks vs sneakers), trend (athleisure didn't exist 10 years ago), and brand (Levi's calls jeans "501s," "511s," etc.). Maintaining this manually is a full-time job—and you'll always be behind.
The Relevance Scoring Problem
Elasticsearch uses BM25, a term-frequency algorithm:
BM25(query, document) = Σ IDF(term) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × |D|/avgDL))
Where:
- tf = term frequency in document
- IDF = inverse document frequency (rare terms score higher)
- |D| = document length
- avgDL = average document length
- k1, b = tuning parameters
Why this fails for fashion:
- Product descriptions are short — A product titled "Navy Puffer Jacket" has each term appearing once. BM25 can't differentiate quality.
- Fashion terms are common — "Black dress" appears in 40% of inventory. Low IDF = low signal.
- Intent isn't captured — "Cozy navy jacket" and "professional navy jacket" get similar BM25 scores because the differentiating words ("cozy" vs "professional") aren't in product titles.
Real-World Example: The "White Sneakers" Query
Query: "white sneakers for everyday wear"
Elasticsearch BM25 Results (actual ranking problem):
1. "White Leather Sneakers" - BM25: 12.4 (exact match + popularity)
2. "Off-White × Nike Sneakers" - BM25: 11.8 (brand name contains "white")
3. "White Platform Sneakers 6-inch" - BM25: 11.2 (matches terms)
4. "Sneakers Cleaning Kit - White Shoe Cleaner" - BM25: 10.9 (NOT A SHOE)
5. "Classic White Canvas Sneakers" - BM25: 10.7 (actually relevant)
Problems:
- Result #2: "Off-White" is a brand, not the color
- Result #3: 6-inch platforms aren't "everyday" shoes
- Result #4: Not even a shoe—it's a cleaning product
- The "everyday" constraint had zero effect on ranking
The Vocabulary Mismatch Problem
The deepest issue: users and products speak different languages.
| What Users Say | What Products Are Labeled |
|---|---|
| "cozy" | material: wool, fit: relaxed |
| "going out top" | category: blouse, occasion: evening |
| "interview outfit" | style: business, formality: professional |
| "beach vacation dress" | category: sundress, material: cotton |
| "date night look" | style: feminine, occasion: dinner |
| "gym clothes" | category: activewear |
| "something to hide my belly" | fit: loose, silhouette: A-line |
| "makes me look taller" | style: vertical-stripes, heel: high |
This vocabulary mismatch is why 73% of e-commerce searches return zero results or irrelevant results (Baymard Institute, 2023). Users don't know (or care about) your taxonomy. They describe what they want in their own words.
The Classical Architecture's Complete Picture
Here's the full classical pipeline with all its failure points:
User Query: "I need a cozy navy jacket for the office under €200"
↓
┌──────────────────┐
│ NER Extraction │
└────────┬─────────┘
↓
Entities: {color: "navy", category: "jacket"}
❌ LOST: "cozy" (subjective attribute)
❌ LOST: "office" (occasion/context)
❌ LOST: "under €200" (price constraint)
❌ LOST: user's actual intent (warm, professional, affordable)
↓
┌──────────────────┐
│ Elasticsearch │
│ color:navy AND │
│ category:jacket│
└────────┬─────────┘
↓
BM25 scoring on "cozy" and "office":
- "cozy" appears in 0.3% of jackets → high IDF, but rarely in titles
- "office" appears in 2% of jackets → matches "office wear" subcategory inconsistently
↓
Results: 2,847 navy jackets
❌ Includes $800 designer jackets (no price filter)
❌ Includes lightweight blazers (not "cozy")
❌ Includes casual denim jackets (not office-appropriate)
❌ Ranks by BM25 + popularity, not by fit to user intent
User experience: Scroll through pages of irrelevant results,
give up, or manually apply filters that should have been automatic.
Why This Matters: The Business Impact
The failure of classical search has real consequences:
| Metric | Industry Average | Impact |
|---|---|---|
| Search abandonment rate | 68% | Users who search and leave without buying |
| Zero-result rate | 15% | Queries that return nothing |
| First-page relevance | 34% | Users who find what they want on page 1 |
| Search-to-cart conversion | 2.4% | Searches that result in add-to-cart |
For a retailer with 1M monthly searches:
- 680,000 users leave frustrated
- 150,000 see "no results found"
- Only 340,000 find relevant products on page 1
- Only 24,000 add something to cart
Improving search relevance by just 10% can increase revenue by 5-15%. This is why LLM-powered search isn't just technically interesting—it's a business imperative.
The LLM-Powered Alternative
LLM-powered search fundamentally changes this equation. Instead of extracting entities and matching keywords, we build systems that:
- Understand intent — Not just what words appear, but what the user actually wants
- Expand semantically — "Puffer" should find "down jacket," "quilted coat," "padded jacket"
- Leverage metadata intelligently — Use product attributes for filtering, not just keywords
- Support conversation — "Show me something cheaper" should understand context
- Rank holistically — Combine semantic relevance, metadata matching, and quality signals
Here's the architecture we'll build:
User Query: "cozy navy puffer jacket for office under €200"
↓
┌───────────────────────────┐
│ Intent Parsing (LLM) │
│ - Extract constraints │
│ - Generate semantic │
│ query for RAG │
└─────────────┬─────────────┘
↓
ParsedIntent:
colors: [navy, navy blue, midnight blue]
materials: [down, quilted, padded]
occasions: [office, work, business]
price: {max: 200}
semantic_query: "warm comfortable puffer jacket professional style"
↓
┌───────────────────────────┐
│ Hybrid RAG Search │
│ - Vector (semantic) │
│ - BM25 (keyword) │
│ - Metadata filters │
└─────────────┬─────────────┘
↓
┌───────────────────────────┐
│ Multi-Signal Ranking │
│ - Semantic: 35% │
│ - Constraint: 25% │
│ - Quality: 15% │
│ - Metadata match: 25% │
└─────────────┬─────────────┘
↓
Ranked Results: Navy puffer jackets,
office-appropriate, under €200,
sorted by semantic + metadata fit
Building the Query Understanding Pipeline
The foundation of intelligent search is query understanding. We need to transform free-form natural language into structured, actionable constraints.
Intent Classification
First, we classify what type of request we're handling. Intent classification is the routing layer that determines which processing pipeline a query goes through. Get this wrong, and you'll apply product search logic to a comparison query or outfit logic to a simple "show me" request.
Why intent classification is the critical first step: Consider what happens when you misclassify. If a user asks "Show me dresses similar to the one Taylor Swift wore at the Grammys" and you treat it as a simple product search, you'll return dresses matching keywords "Taylor Swift Grammys" rather than understanding this is a visual similarity + celebrity style request that requires image search, trend data, and occasion matching. Similarly, "What's the difference between merino and cashmere?" isn't a product search at all—it's an educational query that should route to your knowledge base, not your product catalog.
The pattern we use: fast keyword matching for common intents, falling back to LLM classification only for ambiguous cases. This two-tier approach keeps p50 latency low (keyword matching is sub-millisecond) while handling edge cases gracefully. In production at Zalando, we found that 85% of queries can be classified with simple rules, and only 15% need the more expensive LLM fallback. This means the average query adds only ~20ms of classification overhead, not the 200-500ms an LLM call would add to every request.
The intent taxonomy matters enormously. We've iterated through several versions. Our current taxonomy has five primary intents, each triggering a different pipeline:
| Intent | Example | Pipeline |
|---|---|---|
product_search | "navy puffer jacket" | Standard RAG + ranking |
outfit | "Style me for a beach wedding" | Multi-category search + compatibility |
comparison | "Nike vs Adidas running shoes" | Side-by-side retrieval + feature extraction |
details | "What material is this?" | Single-product lookup + attribute extraction |
help | "What's your return policy?" | FAQ/knowledge base retrieval |
class QueryUnderstandingPipeline:
"""Normalizes, classifies, expands, and decomposes user queries."""
def __init__(self, rules: RulesRegistry) -> None:
self.rules = rules
self.expander = QueryExpansionEngine(rules)
self.decomposer = SubqueryDecomposer()
def normalize(self, raw_query: str) -> str:
return " ".join(raw_query.strip().split()).lower()
def classify_intent(self, text: str) -> IntentType:
# Outfit/styling requests need special handling
if any(k in text for k in ("outfit", "look", "style me")):
return IntentType.outfit
# Comparison queries
if "compare" in text or "versus" in text or "vs" in text:
return IntentType.comparison
# Help/policy questions
if "help" in text or "policy" in text:
return IntentType.help
# Default: product search
return IntentType.product_search
Understanding the code structure: The QueryUnderstandingPipeline class is the entry point for all search queries. Notice how it composes several specialized components: a RulesRegistry for domain-specific knowledge, a QueryExpansionEngine for synonym handling, and a SubqueryDecomposer for breaking complex queries into simpler parts. This composition pattern is intentional—each component can be tested, tuned, and replaced independently.
The normalize method handles the messiness of real user input. Users type queries with inconsistent casing, extra spaces, and Unicode variations. Normalizing to lowercase with single spaces ensures "Navy JACKET" matches the same rules as "navy jacket". In production, you'd also want to handle Unicode normalization (é vs e), remove special characters, and possibly correct common typos.
Why keyword-based classification works well here: The classify_intent method uses simple substring matching rather than ML classification. This seems crude, but it's surprisingly effective for e-commerce. Fashion queries follow predictable patterns: outfit requests almost always contain words like "outfit," "look," or "style me." Comparison queries almost always contain "vs," "versus," or "compare." These patterns are stable across millions of queries, and keyword matching is orders of magnitude faster than running an LLM.
The fallback to product_search is deliberate. When in doubt, assume the user wants products. This is the right default because: (1) it's the most common intent by far (~80% of queries), (2) showing products is never completely wrong—even for edge cases, users can refine, and (3) it's better to show something useful than to route to the wrong specialized pipeline.
When to add more intent types: You should add new intents when you see a pattern of queries that your current pipelines handle poorly. We added outfit after seeing users ask "What should I wear to..." and getting single-product results instead of coordinated looks. We added comparison after users asked "X vs Y" and got a mix of both rather than side-by-side analysis. Monitor your search logs for patterns that don't fit existing intents.
Attribute Extraction
Next, we extract structured attributes from natural language. This is where domain-specific rules become critical.
The fundamental challenge of attribute extraction: Users express constraints in countless ways. "Under $200," "less than 200 dollars," "up to two hundred," "budget-friendly," and "not too expensive" all mean roughly the same thing. A user saying "cozy" might mean wool, fleece, or down materials. "Office-appropriate" implies a formality level, color palette, and style that varies by industry and culture. Your attribute extraction must bridge this gap between natural language and structured product metadata.
Why rule-based extraction first: We start with deterministic rules because they're fast, predictable, and debuggable. When a rule extracts "navy" from a query, you know exactly why—it matched a known color. When an LLM extracts attributes, you're never quite sure why it made a particular choice, and that unpredictability causes subtle bugs in production. Rules also let you encode domain expertise that LLMs might not have: in fashion, "nude" is a color (beige/skin-tone), not inappropriate content.
The token-based approach: Notice that _detect_colors and _detect_materials operate on pre-tokenized input. This is intentional. Tokenizing once and passing tokens to multiple detectors is more efficient than re-processing the raw string for each attribute type. The tokenization step (not shown) also handles compound terms: "navy blue" should be one token, not two, so it matches the color synonym rules correctly.
def _detect_colors(self, tokens: List[str]) -> list[str]:
known_colors = set(self.rules.color_synonyms.keys())
return [t for t in tokens if t in known_colors]
def _detect_materials(self, tokens: List[str]) -> list[str]:
known_materials = set(self.rules.materials.keys())
return [t for t in tokens if t in known_materials]
def _detect_price(self, text: str) -> PriceRange | None:
under = re.search(r"under (\d+)", text)
between = re.search(r"between (\d+) and (\d+)", text)
if between:
low, high = between.groups()
return PriceRange(min=float(low), max=float(high))
if under:
value = float(under.group(1))
return PriceRange(max=value)
return None
Breaking down each extraction method:
The _detect_colors method is deceptively simple—it just checks if tokens exist in a known set. But that simplicity is the point. At query time, you want O(1) lookups, not fuzzy matching. The sophistication lives in the color_synonyms dictionary, which is built offline with careful curation. We maintain ~200 color entries covering standard colors, fashion-specific terms (heather gray, oxblood), and regional variations (aubergine vs eggplant).
The _detect_materials method follows the same pattern. The key insight is that material detection affects both filtering AND semantic understanding. "Cashmere sweater" should filter to materials: cashmere, but it should also boost warmth-related and luxury-related semantic matches. We pass extracted materials to both the filter layer and the embedding enrichment layer.
Price extraction deserves special attention. The _detect_price method handles the most common patterns: "under X" and "between X and Y." But in production, you'll encounter many more: "around $100," "less than 50 bucks," "max 200," "budget," "cheap," "expensive," "luxury," "affordable." We handle these in two ways: (1) explicit patterns with regex for numeric values, (2) semantic mapping for subjective terms. "Budget" maps to PriceRange(max=50) in our system, while "luxury" maps to PriceRange(min=300). These mappings are configurable per category—"budget shoes" means something different than "budget handbags."
Why return None for missing constraints: Notice that _detect_price returns None when no price constraint is found, rather than a default range. This distinction matters. None means "user didn't specify a price constraint, show all prices." A default range like PriceRange(min=0, max=10000) would behave similarly for filtering but would affect ranking differently—products near the range boundaries might be deprioritized. Be intentional about missing vs. default values.
When rules aren't enough—LLM-based parsing: For queries that rules can't fully parse, we fall back to LLM-based extraction. This handles nuanced constraints like "something my grandmother would approve of" (implies conservative styling, modest coverage) or "first-date outfit" (implies flattering fit, conversation-starter pieces, not too casual or too formal). The LLM can also resolve ambiguity: "light jacket" could mean lightweight fabric OR light color—context and phrasing help disambiguate.
For more complex queries, we use LLM-based parsing with a structured output format:
INTENT_PARSING_PROMPT = """You are an intent parser for a product discovery system.
Given a user query and conversation history, extract:
1. Intent type: search, refine, compare, details, recommendation
2. Product types: jacket, dress, shoes, etc.
3. Colors with synonyms
4. Materials and fabrics
5. Occasions: office, party, casual, formal
6. Price constraints
7. Style attributes: cozy, elegant, modern, classic
8. Size requirements
For 'semantic_query', generate a standalone search query that:
- Resolves pronouns from conversation context
- Incorporates implied constraints from previous turns
- Is optimized for vector similarity search
Respond with valid JSON only."""
Dissecting the LLM parsing prompt: This prompt is carefully structured to guide the LLM toward useful, structured output. Let's break down why each element matters:
The explicit enumeration of extraction targets (1-8) prevents the LLM from inventing categories or missing important ones. Without this structure, you might get "vibe: casual" one time and "mood: relaxed" another time for the same query. Enumeration creates consistency.
The semantic_query field is particularly important. This is where the LLM generates a search-optimized query that resolves pronouns and context. If the user says "Show me more like that but cheaper" in a conversation, the raw query is useless for vector search. The LLM should generate something like "casual cotton sundress under $50 floral pattern" based on conversation history. This resolved query is what actually gets embedded and searched.
Prompt engineering for structured extraction: We specify "Respond with valid JSON only" to ensure parseable output, but this alone isn't sufficient. In production, we also: (1) use JSON mode if the model supports it, (2) validate output against a Pydantic schema, (3) have fallback logic for malformed responses. Even well-prompted LLMs occasionally produce invalid JSON—wrapping responses in markdown code blocks, adding explanatory text, or hallucinating extra fields. Your parsing code must be defensive.
The cost-latency tradeoff: LLM parsing adds 200-500ms and ~1,000-$10,000/day just for query understanding. This is why we use rules first and LLM parsing only as a fallback. Monitor your fallback rate—if more than 20% of queries need LLM parsing, your rules probably need expansion.
The key insight: LLM parsing gives you semantic understanding that rule-based systems miss, but rule-based systems give you precision and speed. The best production systems use both—rules for the 80% of queries that follow predictable patterns, LLMs for the 20% that require genuine language understanding.
Fashion-Specific Synonym Rules
Fashion has a rich vocabulary where the same item has many names. A customer might search for:
- "Puffer jacket" / "Down jacket" / "Quilted coat" / "Padded jacket"
- "Navy" / "Navy blue" / "Midnight blue" / "Dark blue"
- "Office wear" / "Business casual" / "Work appropriate" / "Smart casual"
Why synonym handling is essential for e-commerce search: Consider what happens without synonyms. A user searches "puffer jacket" but your catalog uses "quilted down coat" in product titles. Without synonym expansion, vector search might find it (if the embedding captures the semantic similarity), but keyword/BM25 search will miss it entirely. Since hybrid search relies on both signals, missing the keyword match weakens your retrieval significantly.
The asymmetric synonym problem: Not all synonyms are bidirectional. "Sneakers" and "trainers" are true synonyms—searching for either should find the same products. But "blazer" and "jacket" are asymmetric—all blazers are jackets, but not all jackets are blazers. A search for "blazer" should include blazers but not necessarily all jackets. A search for "jacket" should include blazers. We handle this with directed synonym graphs rather than simple bidirectional mappings.
Regional and demographic variations: Fashion vocabulary varies dramatically by geography. British shoppers search for "trainers" and "jumpers"; Americans search for "sneakers" and "sweaters." Gen Z might search "fit check" while millennials search "outfit." Your synonym mappings should reflect your user base. We maintain region-specific synonym extensions that get applied based on user locale or detected language patterns.
We maintain curated synonym mappings for domain-specific expansion. These are stored as JSON files that can be updated without code deploys:
// color_synonyms.json
{
"navy": ["navy blue", "midnight blue"],
"burgundy": ["oxblood", "wine red"],
"camel": ["tan", "beige"],
"off white": ["cream", "ivory", "ecru"],
"khaki": ["olive", "army green"]
}
// materials.json
{
"denim": ["jean", "cotton twill"],
"wool": ["merino", "cashmere", "lambswool"],
"leather": ["faux leather", "vegan leather", "suede"],
"cotton": ["organic cotton", "pima cotton"],
"down": ["feather fill", "puffer fill"]
}
// occasions.json
{
"office": ["work", "business", "smart"],
"smart casual": ["elevated casual", "dressed up casual"],
"party": ["evening", "night out"],
"outdoor": ["hiking", "trail", "weatherproof"],
"sport": ["training", "gym", "athletic"]
}
How to build and maintain these synonym dictionaries: Start by analyzing your search logs. Look for queries with zero results—these often contain terms not in your product catalog. Group these by semantic similarity and you'll find clusters: "trainers," "tennis shoes," "athletic shoes," "running shoes" all seeking the same products. Also analyze successful searches: when users find what they want, what terms did they use vs. what terms are in the product data?
The JSON file approach is intentional. Storing synonyms in configuration files rather than code means: (1) non-engineers (merchandisers, domain experts) can update them, (2) changes don't require code review and deployment, (3) you can A/B test different synonym sets, (4) you can have environment-specific synonyms (staging might have experimental mappings). We load these files at startup and cache them in memory—the performance impact is negligible.
Handling evolving fashion vocabulary: Fashion vocabulary changes constantly. "Cottagecore," "coastal grandmother," and "quiet luxury" weren't search terms three years ago. We have a monthly process where we: (1) extract new terms from search logs that show high frequency but low match rate, (2) have fashion experts categorize them into existing or new semantic groups, (3) update the synonym files, (4) measure impact on search success metrics. This keeps our synonym coverage fresh.
The expansion engine applies these rules to enrich queries. The implementation is straightforward but handles several edge cases:
class QueryExpansionEngine:
"""Apply rule-based expansions for fashion-specific
synonym and attribute enrichment."""
def __init__(self, rules: RulesRegistry) -> None:
self.rules = rules
def expand(self, terms: Iterable[str]) -> list[str]:
expanded: list[str] = []
for term in terms:
cleaned = term.strip().lower()
if not cleaned:
continue
# Get all synonyms for this term
expanded.extend(self.rules.expand_term(cleaned))
# Deduplicate while preserving order
seen = set()
result = []
for t in expanded:
if t not in seen:
seen.add(t)
result.append(t)
return result
def expand_query(self, query: str) -> list[str]:
tokens = self.tokenize(query)
return self.expand(tokens)
Understanding the expansion code in detail:
The expand method iterates through input terms and expands each through the rules registry. The expand_term method in RulesRegistry (not shown) checks each synonym dictionary in priority order: colors first, then materials, then occasions. This ordering matters because some terms could match multiple dictionaries, and you want predictable behavior.
Deduplication with order preservation is critical. The expansion might produce duplicates: "navy blue jacket" could expand "navy" to include "navy blue" (already present) and "jacket" might already be in the original terms. The seen set prevents duplicates, but we use a list for result rather than a set to preserve order. Why? Because the order affects BM25 scoring—terms appearing earlier in the expanded query get slightly higher weight, and we want original terms to rank above synonyms.
The expand_query method handles the full query-to-expansion pipeline. Tokenization happens here, not in expand, because tokenization is query-specific (handling multi-word tokens like "navy blue") while expansion is term-specific. This separation of concerns makes testing easier—you can unit test expansion with pre-tokenized input and integration test the full query flow separately.
Result: "navy puffer jacket" expands to:
["navy", "navy blue", "midnight blue", "puffer", "down jacket",
"quilted coat", "padded jacket", "jacket"]
Measuring expansion effectiveness: Track two metrics: (1) expansion ratio (expanded terms / original terms) and (2) search success rate by expansion ratio. Too little expansion misses relevant products. Too much expansion dilutes relevance signals and slows queries. We found the sweet spot is 2-4x expansion for most queries. Queries with expansion ratios above 6x often indicate overly broad synonyms that should be tightened.
This expansion dramatically improves recall without sacrificing precision—the synonyms are curated to be semantically equivalent in the fashion domain. In our testing, synonym expansion improved recall@20 from 72% to 89% while only dropping precision@20 from 68% to 65%—a worthwhile tradeoff.
Product Metadata Schema Design
The foundation of effective LLM-powered search is rich, structured metadata. While classical search treats products as bags of keywords, intelligent search treats them as structured entities with typed attributes.
The Complete Product Schema
Here's a production-ready schema for fashion products. Every field exists because it serves a specific purpose in search, filtering, or ranking. The schema is intentionally redundant in places—primary_color and colors both exist because some queries need exact matching ("show me navy") while others need broader inclusion ("blue tones").
Think of this schema as a "search contract": if a field isn't in the schema, it can't be filtered or ranked on. Missing metadata means missed results.
from pydantic import BaseModel
from typing import Optional
from enum import Enum
class PriceLevel(Enum):
BUDGET = 1 # Under €50
MODERATE = 2 # €50-150
PREMIUM = 3 # €150-300
LUXURY = 4 # €300+
class Product(BaseModel):
"""Complete product schema for LLM-powered search."""
# === Core Identifiers ===
id: str
sku: str
name: str
brand: str
# === Pricing ===
price: float
currency: str = "EUR"
price_level: PriceLevel
original_price: Optional[float] = None # For sale items
# === Categorization ===
category: str # "jackets", "dresses", "footwear"
subcategory: Optional[str] # "puffer jackets", "midi dresses"
product_type: str # "outerwear", "tops", "bottoms"
# === Visual Attributes ===
colors: list[str] # ["navy", "navy blue"]
primary_color: str # "navy"
pattern: Optional[str] # "solid", "striped", "floral"
# === Material & Construction ===
materials: list[str] # ["down", "polyester", "nylon"]
primary_material: str # "down"
fill_type: Optional[str] # "down", "synthetic", "wool"
lining: Optional[str] # "fleece", "silk", "polyester"
# === Fit & Sizing ===
fit: str # "regular", "slim", "oversized"
size_range: list[str] # ["XS", "S", "M", "L", "XL"]
available_sizes: list[str] # Currently in stock
length: Optional[str] # "cropped", "regular", "long"
# === Style & Occasion ===
style_tags: list[str] # ["casual", "streetwear", "minimalist"]
occasions: list[str] # ["office", "casual", "outdoor"]
seasons: list[str] # ["fall", "winter"]
formality: str # "casual", "smart_casual", "formal"
# === Features & Benefits ===
features: list[str] # ["water-resistant", "packable", "hood"]
warmth_rating: Optional[int] # 1-5 scale
care_instructions: list[str] # ["machine_wash", "dry_clean"]
# === Target Demographics ===
gender: str # "women", "men", "unisex"
age_group: Optional[str] # "adult", "teen", "kids"
# === Quality Signals ===
rating: Optional[float] # 1-5 scale
review_count: Optional[int]
bestseller: bool = False
new_arrival: bool = False
# === Content ===
description: str # Full product description
short_description: str # 1-2 sentence summary
# === Inventory ===
in_stock: bool = True
stock_level: Optional[str] # "high", "medium", "low"
# === SEO & Search ===
search_keywords: list[str] # Additional searchable terms
similar_to: list[str] # Related product IDs
Why This Schema Matters
Each field serves a purpose in the search pipeline:
| Field Type | Used For | Example |
|---|---|---|
| Visual (colors, pattern) | Direct filter matching | "navy jacket" → colors: ["navy"] |
| Material | Semantic inference | "cozy" → materials: ["wool", "down"] |
| Occasion | Intent matching | "for work" → occasions: ["office"] |
| Style tags | Semantic expansion | "minimalist" → style similarity |
| Features | Constraint filtering | "waterproof" → features: ["water-resistant"] |
| Quality | Ranking signals | rating: 4.5, review_count: 200 |
| Warmth rating | Semantic inference | "cozy" → warmth_rating >= 4 |
Metadata Enrichment with LLMs
Raw product data often lacks rich attributes. Use LLMs to enrich metadata at ingestion time. This is one of the highest-ROI applications of LLMs in e-commerce—you pay once at ingestion, but the enriched data improves every search.
The metadata gap problem: Most product feeds come from vendors, manufacturers, or legacy systems. They contain basics: name, price, category, maybe a description. But they rarely contain the semantic attributes that matter for intelligent search. A product feed might say "Blue Jacket" but not capture that it's "preppy," "office-appropriate," or "perfect for fall layering." Without these attributes, semantic queries like "something for my first day at work" have nothing to match against.
Why enrichment happens at ingestion, not query time: You could theoretically call an LLM at search time to understand what attributes a product has. But with a 100k product catalog and 200ms per LLM call, that's 20,000 seconds (5.5 hours) per search. Instead, we enrich once when products enter the catalog. The enriched attributes are stored alongside the product and indexed for search. Now every search benefits from semantic understanding without any query-time LLM cost.
The enrichment economics: For a 100k product catalog, enrichment costs roughly: 100k products × ~500 tokens/product × 50 total. That's a one-time cost that dramatically improves search quality for millions of subsequent queries. Even re-running enrichment monthly (for seasonality, trend updates) is trivially cheap compared to query-time LLM calls.
ENRICHMENT_PROMPT = """Analyze this product and extract structured attributes.
Product: {name}
Description: {description}
Category: {category}
Extract:
1. style_tags: List of style descriptors (casual, elegant, minimalist, streetwear, etc.)
2. occasions: When would someone wear this? (office, party, casual, date_night, outdoor)
3. warmth_rating: 1-5 scale (1=light summer, 5=heavy winter)
4. formality: casual, smart_casual, business_casual, formal
5. features: List any notable features (water-resistant, packable, breathable, etc.)
6. search_keywords: Additional terms customers might use to find this
Return JSON only."""
async def enrich_product_metadata(
product: Product,
llm_client: LLMClient
) -> Product:
"""Enrich product with LLM-extracted attributes."""
response = await llm_client.generate_json(
prompt=ENRICHMENT_PROMPT.format(
name=product.name,
description=product.description,
category=product.category
),
temperature=0.1
)
# Merge enriched attributes
product.style_tags = response.get("style_tags", [])
product.occasions = response.get("occasions", [])
product.warmth_rating = response.get("warmth_rating")
product.formality = response.get("formality", "casual")
product.features.extend(response.get("features", []))
product.search_keywords.extend(response.get("search_keywords", []))
return product
Understanding the enrichment code flow:
The enrich_product_metadata function takes a product with basic attributes and returns the same product with rich semantic attributes filled in. Notice that temperature=0.1 is used rather than 0—we want consistency across products, but a tiny amount of variation prevents the LLM from falling into repetitive patterns when processing similar products.
The merge strategy matters. We use .extend() for list fields like features and search_keywords rather than assignment. This preserves any features that were already in the product data while adding LLM-extracted ones. For example, if the product feed included "water-resistant" and the LLM extracts "packable," we want both. For single-value fields like formality, we use .get() with a default, preferring LLM output over missing data but not overwriting existing non-null values.
Handling enrichment failures gracefully: In production, LLM calls sometimes fail: rate limits, timeouts, malformed responses. Your enrichment pipeline should: (1) retry with exponential backoff, (2) log failures for manual review, (3) mark products as "partially enriched" so they're still searchable but flagged for re-processing. We use a dead-letter queue for products that fail enrichment 3 times—these often reveal edge cases in our prompt.
Quality assurance for enriched data: LLMs can hallucinate attributes. We've seen products enriched with "waterproof" when the description says "water-resistant" (different performance standard), or "wool" when the material is actually "wool blend." We run automated QA that: (1) flags enriched attributes that contradict source data, (2) samples 1% of enrichments for human review, (3) tracks enrichment consistency (same product enriched twice should produce similar results).
Generating Embedding Text for Semantic Search
Vector search quality depends entirely on what text you embed. Embedding just the product name misses 90% of searchable information. We need to generate rich, comprehensive text that captures all searchable attributes.
The embedding text problem: Your embedding model doesn't know your product schema. It just sees text. If you embed {"name": "Alpine Jacket", "color": "navy"} as JSON, the embedding captures that this is structured data about a navy jacket. But if a user searches "blue coat for mountains," the embedding model might not connect "navy" to "blue" or "jacket" to "coat" or "Alpine" to "mountains." By generating natural language text that includes synonyms and context, we bridge this gap.
The Embedding Text Strategy
def generate_embedding_text(product: Product) -> str:
"""
Generate rich text for embedding that captures all searchable attributes.
Strategy:
- Include all attributes that users might search for
- Repeat important terms for emphasis
- Use natural language that matches how users search
- Include synonyms and related terms
"""
parts = []
# Product identity (high weight - repeat)
parts.append(f"{product.name}")
parts.append(f"{product.brand} {product.subcategory or product.category}")
# Visual attributes
color_str = ", ".join(product.colors)
parts.append(f"Color: {color_str}")
if product.pattern and product.pattern != "solid":
parts.append(f"Pattern: {product.pattern}")
# Material (important for semantic queries like "cozy")
material_str = ", ".join(product.materials)
parts.append(f"Material: {material_str}")
parts.append(f"Made from {product.primary_material}")
# Style and occasion (critical for intent matching)
if product.style_tags:
parts.append(f"Style: {', '.join(product.style_tags)}")
if product.occasions:
parts.append(f"Perfect for: {', '.join(product.occasions)}")
# Warmth/comfort (for "cozy", "warm" queries)
if product.warmth_rating:
warmth_desc = {
1: "lightweight summer",
2: "light layering piece",
3: "moderate warmth",
4: "warm winter",
5: "heavy winter extreme warmth"
}
parts.append(warmth_desc.get(product.warmth_rating, ""))
# Features
if product.features:
parts.append(f"Features: {', '.join(product.features)}")
# Fit
parts.append(f"{product.fit} fit")
if product.length:
parts.append(f"{product.length} length")
# Formality
formality_desc = {
"casual": "casual everyday wear",
"smart_casual": "smart casual dressed up",
"business_casual": "office appropriate professional",
"formal": "formal dressy elegant"
}
parts.append(formality_desc.get(product.formality, ""))
# Season
if product.seasons:
parts.append(f"Seasons: {', '.join(product.seasons)}")
# Short description for context
parts.append(product.short_description)
# Additional search keywords
if product.search_keywords:
parts.append(" ".join(product.search_keywords))
# Quality signals (for ranking context)
if product.bestseller:
parts.append("bestseller popular")
if product.new_arrival:
parts.append("new arrival latest")
return " ".join(filter(None, parts))
Breaking down the embedding text generation strategy:
The function builds text in a deliberate order. Product identity comes first (name and brand) because embedding models give more weight to early tokens. When a user searches "North Peak jacket," having "North Peak" at the start of our embedding text improves the match.
The warmth_rating translation is crucial. We don't embed warmth_rating: 4. The embedding model has no idea what that means. Instead, we translate it to natural language: "warm winter." Now when a user searches "warm winter jacket," the embedding can make the semantic connection. This pattern—translating numeric or categorical attributes to natural language—is one of the most important techniques for bridging structured data and semantic search.
Notice the redundancy in color and material handling. We include both colors: ["navy", "navy blue"] and primary_material: down. This redundancy is intentional. The list form captures all variations; the primary form emphasizes the dominant attribute. Both contribute to the embedding, and the redundancy improves recall for different query phrasings.
The filter(None, parts) at the end removes empty strings. When a product doesn't have a pattern (it's solid), we don't want "Pattern: None" in the embedding text. The conditional checks and .get() calls throughout produce empty strings for missing attributes, which filter(None, ...) removes.
Quality signals at the end: "Bestseller popular" and "new arrival latest" are included to capture queries like "popular jackets" or "latest styles." These aren't product attributes per se—they're merchandising signals. Including them in the embedding means semantic search captures user intent around popularity and newness.
Example: Before vs After
Raw product data:
{
"name": "Alpine Down Puffer Jacket",
"brand": "North Peak",
"category": "jackets",
"colors": ["navy"],
"price": 189.99
}
Generated embedding text:
Alpine Down Puffer Jacket North Peak puffer jackets Color: navy, navy blue
Material: down, polyester, nylon Made from down Style: casual, outdoor,
streetwear Perfect for: outdoor, casual, weekend warm winter Features:
water-resistant, packable, hood regular fit Seasons: fall, winter
Lightweight yet warm down puffer jacket perfect for cold weather adventures
puffer down jacket quilted padded winter coat bestseller popular
This rich embedding text means that queries like:
- "cozy winter jacket" → matches "warm winter"
- "something for hiking" → matches "outdoor, adventures"
- "navy puffer" → matches "navy, puffer"
- "waterproof coat" → matches "water-resistant"
Embedding Text Variants
For different search scenarios, generate multiple embedding variants:
def generate_embedding_variants(product: Product) -> dict[str, str]:
"""Generate specialized embeddings for different search types."""
return {
# Full comprehensive embedding
"full": generate_embedding_text(product),
# Style-focused (for "show me something elegant")
"style": f"{product.name} {' '.join(product.style_tags)} "
f"{product.formality} {' '.join(product.occasions)}",
# Visual-focused (for "blue floral dress")
"visual": f"{product.name} {' '.join(product.colors)} "
f"{product.pattern or 'solid'} {product.category}",
# Practical-focused (for "waterproof hiking jacket")
"practical": f"{product.name} {' '.join(product.features)} "
f"{' '.join(product.materials)} {' '.join(product.seasons)}"
}
Hybrid RAG: The Best of Both Worlds
Pure vector search captures semantic similarity but misses exact matches. Pure keyword search (BM25) captures exact terms but misses paraphrases. Hybrid search combines both.
Understanding when each method excels and fails: Vector search shines when the user's words differ from the product's words but mean the same thing. "Comfortable shoes for walking all day" and "ergonomic footwear with arch support" are semantically similar, and good embeddings capture this. But vector search can fail spectacularly on exact matches. A user searching "SKU-12345" or "Nike Air Max 90" expects exact matches—vector search might return semantically similar products that aren't what the user wants.
BM25 keyword search does the opposite. It excels at exact matches: brand names, product codes, specific product names. If your catalog has "Nike Air Max 90" and the user searches those exact words, BM25 will rank it #1 with high confidence. But BM25 fails on paraphrases. "Comfortable walking shoes" and "ergonomic footwear" share zero words—BM25 scores this as no match.
Why hybrid outperforms either alone: In our testing across 10,000 queries:
- Vector-only achieved 76% recall@10, 58% precision@10
- BM25-only achieved 69% recall@10, 62% precision@10
- Hybrid achieved 91% recall@10, 78% precision@10
The improvement isn't additive—it's synergistic. Hybrid catches queries that would fail on either method alone while reinforcing queries where both methods agree.
The BM25 Component
BM25 (Best Matching 25) is the gold standard for keyword retrieval. It's what Elasticsearch uses under the hood. Understanding how it works helps you tune it effectively:
class BM25Index:
"""
BM25 index for keyword search.
Implements Okapi BM25 algorithm for term-based retrieval.
"""
# BM25 parameters
K1 = 1.5 # Term frequency saturation
B = 0.75 # Length normalization
def _score_document(self, doc_id: str, query_terms: list[str]) -> tuple[float, list[str]]:
"""Calculate BM25 score for a document."""
score = 0.0
matched_terms = []
doc_length = self._doc_lengths.get(doc_id, 0)
for term in query_terms:
if term not in self._inverted_index:
continue
if doc_id not in self._inverted_index[term]:
continue
matched_terms.append(term)
# Term frequency in document
tf = self._inverted_index[term][doc_id]
# Inverse document frequency
idf = self._idf(term)
# BM25 score component
numerator = tf * (self.K1 + 1)
denominator = tf + self.K1 * (
1 - self.B + self.B * doc_length / self._avg_doc_length
)
score += idf * numerator / denominator
return score, matched_terms
Understanding the BM25 formula in plain English:
BM25 balances two factors: how often a term appears in a document (term frequency) and how rare that term is across all documents (inverse document frequency). The intuition: if "jacket" appears in 80% of your products, it's not very discriminating. But if "waterproof" appears in only 5% of products, it's highly discriminating and should contribute more to the score.
The K1 parameter (default 1.5) controls term frequency saturation. With K1=1.5, a term appearing 10 times doesn't score much higher than one appearing 5 times. This prevents keyword stuffing from dominating results. Lower K1 values saturate faster; higher values let term frequency contribute more linearly.
The B parameter (default 0.75) controls length normalization. A 1000-word product description naturally contains more keyword matches than a 50-word one. B=0.75 normalizes for this, so longer documents don't automatically score higher. B=0 disables length normalization; B=1 fully normalizes. For e-commerce, 0.75 works well because product descriptions have moderate length variation.
The inverted index is the core data structure. For each term, it stores which documents contain that term and how many times. This enables O(query_length) lookups rather than O(catalog_size). Building the index is O(catalog_size), but you do it once at ingestion.
BM25 excels at:
- Exact name matches ("The North Face Puffer")
- Specific attributes ("outdoor seating", "vegan leather")
- Brand names and model numbers
- SKU lookups
The Vector Search Component
Vector search uses embeddings to capture semantic similarity. While the code looks simple, understanding what happens under the hood is crucial for effective implementation.
What embeddings actually represent: An embedding is a dense vector (typically 384-3072 floating-point numbers) that represents text in a high-dimensional space where semantic similarity corresponds to geometric proximity. When you embed "cozy winter jacket" and "warm comfortable coat," these vectors end up close together in the embedding space—even though they share few words. This is the magic that makes semantic search work.
The embedding model choice matters enormously. Different embedding models capture different aspects of similarity:
| Model | Dimensions | Strengths | Weaknesses | Cost |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | Best overall quality, great for nuanced queries | Expensive, API dependency | $0.13/1M tokens |
| OpenAI text-embedding-3-small | 1536 | Good balance of quality and cost | Less nuanced than large | $0.02/1M tokens |
| Cohere embed-v3 | 1024 | Excellent multilingual, good for fashion | Requires separate API | $0.10/1M tokens |
| BGE-large | 1024 | Open source, self-hostable | Requires GPU infrastructure | Free (infra costs) |
| E5-large-v2 | 1024 | Strong on retrieval benchmarks | Less tested in production | Free (infra costs) |
For fashion e-commerce, we use OpenAI's text-embedding-3-small as the default. It handles fashion vocabulary well, supports 100+ languages (important for international e-commerce), and the cost is manageable even at scale. For high-value queries (logged-in users, high cart values), we upgrade to the large model.
The asymmetry problem in semantic search: Query embeddings and document embeddings exist in the same vector space, but they represent fundamentally different things. A query like "something warm for skiing" is short, intent-focused, and possibly ambiguous. A product embedding for "Alpine Pro Down Ski Jacket - 800 Fill Power" is longer, feature-focused, and specific. These asymmetries can cause mismatches. Some embedding models (like E5) are trained specifically for asymmetric retrieval; others (like OpenAI's) handle both reasonably well.
Why we embed enriched text, not raw product data: The vector_store.search call searches against pre-computed product embeddings. But what text did we embed for each product? If we just embedded the product name ("Alpine Jacket"), we'd miss queries about warmth, materials, or occasions. That's why the earlier section on "Generating Embedding Text" is so important—we create rich, descriptive text that captures all searchable attributes, then embed that.
# Using OpenAI embeddings (1536 dimensions)
async def vector_search(
self,
query: str,
top_k: int = 50,
filters: Optional[dict] = None,
) -> list[SearchResult]:
"""Semantic search using embeddings."""
# Generate query embedding
query_embedding = await self.embedding_provider.embed(query)
# Search vector store
results = await self.vector_store.search(
query_embedding=query_embedding,
top_k=top_k,
filters=filters,
)
return results
Understanding the code flow:
The vector_search method does three things: (1) converts the query string to a vector, (2) finds the nearest product vectors, and (3) applies metadata filters. Let's examine each step:
Step 1 - Query embedding: The embedding_provider.embed(query) call sends the query to the embedding API and receives back a vector. This takes 20-100ms depending on the provider and network latency. In production, we cache query embeddings aggressively—the same query should return the same embedding.
Step 2 - Nearest neighbor search: The vector_store.search performs Approximate Nearest Neighbor (ANN) search. It doesn't compare against every product (that would be O(n) for n products). Instead, it uses algorithms like HNSW (Hierarchical Navigable Small World) to find approximate nearest neighbors in O(log n) time. The "approximate" part means it might miss the true #47 closest product, but it will find all the top ~20 reliably.
Step 3 - Metadata filtering: The filters parameter applies hard constraints before or after vector search (depending on the vector store). Pre-filtering is more efficient (smaller search space) but can miss edge cases. Post-filtering guarantees constraint satisfaction but wastes computation on filtered-out results. Qdrant and Pinecone support efficient pre-filtering; ChromaDB post-filters.
The top_k parameter and recall: We request top_k=50 candidates even though users only see 10-20 results. Why? Because vector search alone isn't perfect. Some of those 50 candidates will be filtered out by metadata constraints. Others will be reranked lower by our multi-signal scoring. Requesting more candidates improves recall at the cost of more ranking computation.
Vector search excels at:
- Descriptive queries ("cozy winter jacket") — captures semantic meaning beyond keywords
- Conceptual matching ("something for a job interview") — understands intent even without specific product terms
- Handling typos and variations ("jakcet" still matches "jacket" because misspellings embed similarly)
- Cross-language matching — "veste d'hiver" (French) embeds near "winter jacket" (English)
- Synonym handling — "sneakers," "trainers," and "athletic shoes" embed similarly
Vector search struggles with:
- Exact matches — searching for "SKU-12345" might return semantically similar products instead of the exact SKU
- Negation — "jacket NOT leather" is hard for embeddings to represent
- Rare/new terms — product names or brands not in the embedding model's training data
- Precise attributes — "exactly 4 pockets" requires structured filtering, not semantic similarity
Choosing a Vector Store
For production e-commerce search, your vector store choice matters. This isn't just a database decision—it affects latency, cost, scalability, and feature availability.
The vector store landscape in 2025: The market has matured significantly. Two years ago, you had limited options. Now there are dozens of vector databases, each with different trade-offs. The key differentiators are:
-
Managed vs. self-hosted: Managed services (Pinecone, Weaviate Cloud) handle infrastructure but cost more and add network latency. Self-hosted (Qdrant, Milvus) gives you control but requires DevOps expertise.
-
Index algorithm: Most use HNSW (Hierarchical Navigable Small World) for approximate nearest neighbors. Some offer IVF (Inverted File Index) for different trade-offs. HNSW is generally better for e-commerce workloads.
-
Filtering strategy: Pre-filtering (filter before search) is efficient but can miss edge cases. Post-filtering (filter after search) is accurate but wasteful. Hybrid approaches try to balance both.
-
Multi-tenancy: If you serve multiple brands/regions from one index, you need efficient tenant isolation. Not all vector stores handle this well.
| Vector Store | Best For | Latency | Cost | Scale | Filtering | Multi-tenant |
|---|---|---|---|---|---|---|
| ChromaDB | Prototyping, small catalogs (<100K) | ~10ms | Free (local) | Limited | Post-filter | No |
| Pinecone | Production, managed service | ~20ms | $$$ | Unlimited | Pre-filter | Yes |
| Weaviate | Hybrid search built-in, self-hosted | ~15ms | $$ | High | Pre-filter | Yes |
| Qdrant | Performance-critical, self-hosted | ~5ms | $ | High | Pre-filter | Yes |
| pgvector | Existing Postgres stack | ~30ms | $ | Medium | SQL WHERE | No |
| Elasticsearch | Existing ES infrastructure | ~25ms | $$ | High | Pre-filter | Yes |
| Milvus | Very large scale (1B+ vectors) | ~10ms | $$ | Very High | Pre-filter | Yes |
Deep dive on each option:
ChromaDB is perfect for getting started. Install with pip install chromadb, no configuration needed. But it's single-node only, stores everything in memory, and becomes slow beyond 100K vectors. Use it for prototyping, not production.
Pinecone is the "AWS of vector search"—fully managed, scales infinitely, but expensive ($70/month minimum, scaling with usage). The big advantage is zero operational burden. The disadvantage is vendor lock-in and latency (network hop to their cloud). Good for startups that want to move fast and have funding.
Weaviate offers built-in hybrid search (BM25 + vectors in one query), which is exactly what we need for e-commerce. It can run locally or in their cloud. The module system (vectorizers, generators) is powerful but adds complexity. Good if you want hybrid search without building it yourself.
Qdrant is our recommendation for production fashion search. Written in Rust, it's extremely fast (<5ms p99 for 1M vectors). Efficient filtering, good multi-tenancy, active development. The team is responsive and the documentation is excellent. Self-hosting requires Kubernetes expertise but their Helm charts are solid.
pgvector makes sense if you're already running Postgres and want to minimize infrastructure. Performance is acceptable for <500K vectors but degrades beyond that. The advantage is transactional consistency with your product data—updates are ACID-compliant. The disadvantage is that Postgres wasn't designed for vector search, so advanced features are limited.
Elasticsearch makes sense if you already have ES infrastructure. Dense vector search was added in version 8.0 and has improved significantly. You get BM25 and vector search in one system. The disadvantage is that ES is resource-hungry and complex to operate at scale.
Our recommendation for fashion e-commerce with 100K-10M products:
- Starting (MVP, <$1K/month): ChromaDB locally or Pinecone Starter
- Scaling (100K+ products): Qdrant on Kubernetes or Weaviate Cloud
- Enterprise (existing infra): Elasticsearch with dense vectors if already using ES; otherwise Qdrant
Index configuration matters: The code below shows Qdrant configuration. The key parameters are:
size=1536: Must match your embedding model's output dimension exactlydistance=Distance.COSINE: Cosine similarity is standard for text embeddings. Dot product is faster but requires normalized vectors.indexing_threshold=10000: Builds HNSW index after 10K documents. Below this threshold, brute force is faster.
# Example: Qdrant for high-performance fashion search
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(host="localhost", port=6333)
# Create collection optimized for fashion search
client.create_collection(
collection_name="products",
vectors_config=VectorParams(
size=1536, # OpenAI embedding dimension
distance=Distance.COSINE
),
# Enable payload indexing for metadata filtering
optimizers_config={
"indexing_threshold": 10000 # Index when > 10K docs
}
)
# Search with metadata filtering
results = client.search(
collection_name="products",
query_vector=query_embedding,
query_filter={
"must": [
{"key": "gender", "match": {"value": "women"}},
{"key": "price", "range": {"lte": 200}},
{"key": "in_stock", "match": {"value": True}}
]
},
limit=50
)
Reciprocal Rank Fusion (RRF)
The magic happens when we combine both. Reciprocal Rank Fusion (RRF) is more robust than score normalization because it works with ranks, not raw scores.
The score normalization problem: You might think: "Just normalize BM25 scores to 0-1, normalize vector distances to 0-1, and average them." This seems reasonable but fails in practice. BM25 scores aren't bounded—a document matching all query terms with high frequency could score 50 or 500 depending on the query. Normalizing by the max score in each result set makes scores incomparable across queries. And what if one method returns 100 results while the other returns 10? The distributions are different.
RRF solves this by using ranks instead of scores. A document ranked #1 by vector search contributes 1/(k+1) to its RRF score. Ranked #10 contributes 1/(k+10). The constant k (typically 60) prevents top-ranked documents from dominating too heavily. This works regardless of the underlying score distributions—rank #1 is rank #1 whether the BM25 score was 5 or 500.
Why RRF outperforms learned combination weights: You could train a model to learn optimal combination weights from click data. But RRF works out-of-the-box without training data, is interpretable (you can explain why a product ranked where it did), and is robust to distribution shifts. We've found that carefully-tuned learned weights beat RRF by only 2-3%—not worth the complexity in most cases.
class HybridRanker:
"""
Merges results from vector and BM25 search using RRF.
RRF Formula:
score(d) = Σ 1 / (k + rank_i(d))
Where k is a constant (default 60) and rank_i is the
rank in each result list.
"""
def __init__(
self,
vector_weight: float = 0.6,
bm25_weight: float = 0.4,
rrf_k: int = 60,
):
self._vector_weight = vector_weight
self._bm25_weight = bm25_weight
self._rrf_k = rrf_k
def _merge_rrf(
self,
vector_results: list[SearchResult],
bm25_results: list[BM25Result],
top_k: int,
) -> list[HybridResult]:
"""Merge using Reciprocal Rank Fusion."""
results: dict[str, HybridResult] = {}
# Process vector results
for rank, vr in enumerate(vector_results, 1):
doc_id = vr.id
if doc_id not in results:
results[doc_id] = HybridResult(
doc_id=doc_id,
content=vr.content,
metadata=vr.metadata,
)
results[doc_id].vector_rank = rank
# Process BM25 results
for rank, br in enumerate(bm25_results, 1):
doc_id = br.doc_id
if doc_id not in results:
results[doc_id] = HybridResult(
doc_id=doc_id,
content=br.content,
metadata=br.metadata,
)
results[doc_id].bm25_rank = rank
results[doc_id].matched_terms = br.matched_terms
# Calculate RRF scores
for hr in results.values():
rrf_score = 0.0
if hr.vector_rank is not None:
rrf_score += self._vector_weight / (self._rrf_k + hr.vector_rank)
if hr.bm25_rank is not None:
rrf_score += self._bm25_weight / (self._rrf_k + hr.bm25_rank)
hr.hybrid_score = rrf_score
# Sort by hybrid score
sorted_results = sorted(
results.values(),
key=lambda x: x.hybrid_score,
reverse=True
)
return sorted_results[:top_k]
Deep dive into the RRF implementation:
Let's walk through exactly what happens when we merge results:
Step 1 - Build the unified result dictionary: We iterate through both result lists, creating HybridResult objects for each unique document. If a document appears in both lists (common for good matches), we update the same object with both ranks. The dictionary key is doc_id, ensuring each product appears only once.
Step 2 - Assign ranks: Notice we use enumerate(vector_results, 1) starting at 1, not 0. This is important—rank 1 should mean "best," not rank 0. The rank assignments happen independently: a document could be rank #3 in vector search and rank #47 in BM25 (or not appear at all in one list).
Step 3 - Calculate RRF scores: The formula weight / (k + rank) gives higher scores to lower ranks (better positions). With k=60 and vector_weight=0.6:
- Rank #1: 0.6 / (60 + 1) = 0.0098
- Rank #10: 0.6 / (60 + 10) = 0.0086
- Rank #50: 0.6 / (60 + 50) = 0.0055
The k=60 constant acts as a damping factor. Without it (k=0), rank #1 would score infinitely higher than rank #2. With k=60, the difference between rank #1 and #10 is only about 14%—meaningful but not overwhelming.
Step 4 - Handle missing ranks: If a document only appears in one result list, it still gets a score from that list. This is crucial—some excellent semantic matches might have zero BM25 score (no keyword overlap), and vice versa. RRF gracefully handles this asymmetry.
The weight parameters (0.6 vector, 0.4 BM25): These weights reflect the relative importance we assign to each retrieval method. Our 60/40 split favors semantic search because e-commerce queries are often descriptive. But for your domain, you might want different weights. If your users search for specific SKUs frequently, increase BM25 weight. If they use natural language ("something cozy"), increase vector weight.
Why RRF works better than score normalization:
- Score scales differ: Vector similarity (0-1) vs BM25 (0-∞) are incomparable
- Rank is universal: Position 1 is "best" regardless of absolute score
- Robust to outliers: A high BM25 score doesn't dominate
- No tuning required: k=60 works well across most domains
- Handles partial overlap: Documents appearing in only one list still get ranked appropriately
- Interpretable: You can explain exactly why a product ranked where it did
Tuning RRF parameters: While k=60 is the standard default (from the original RRF paper), you might want to adjust for your domain:
- Higher k (80-100): Flattens the rank curve, giving more weight to lower-ranked results. Use if your top results are often wrong.
- Lower k (30-40): Steepens the curve, heavily favoring top ranks. Use if your retrieval methods are highly accurate.
- Different weights: If A/B tests show users prefer keyword matches, increase BM25 weight. If they prefer semantic matches, increase vector weight.
Exact Match Boosting
For e-commerce, exact name matches deserve special treatment. If someone searches "Nike Air Max 90," the exact product should rank first.
The exact match problem with hybrid search: Surprisingly, hybrid search can fail on exact matches. Here's why: if a user searches "Nike Air Max 90" and your catalog has that exact product, you'd expect it to rank #1. But:
- Vector search embeds the query and finds semantically similar products. "Nike Air Max 95" and "Adidas Ultraboost" might embed closer to the query than the exact match, depending on the embedding model.
- BM25 scores based on term frequency. If "Nike Air Max 90" appears in dozens of product names, descriptions, and reviews, BM25 might rank a product that mentions it frequently (in reviews) above the actual product.
The solution: explicit exact match boosting. After hybrid ranking, we apply a multiplicative boost to products with exact or partial name matches. This ensures that when users search for a specific product, they find it.
The boost hierarchy:
- Exact match (query == product name): 2.25x boost (1.5 * 1.5)
- Partial match (query in name or name in query): 1.5x boost
- No match: No boost
Why multiplicative rather than additive boosts: We multiply the hybrid score rather than adding a constant. This preserves the relative ordering among non-matching products. An additive boost of +0.5 could promote irrelevant products with exact name matches above highly relevant products without matches. Multiplicative boosts scale proportionally—a 0.8 score becomes 1.2 with a 1.5x boost, still below a 0.9 score with the same boost (1.35).
The partial match handling: The query_lower in name or name in query_lower condition catches both directions:
- "Nike Air Max" in "Nike Air Max 90 Essential" → partial match (query in name)
- "Air Max" in query "nike air max 90 red" → partial match (name in query)
This handles the common case where users search for abbreviated or extended product names.
def boost_exact_matches(
self,
results: list[HybridResult],
query: str,
boost_factor: float = 1.5,
) -> list[HybridResult]:
"""Boost results that have exact query matches."""
query_lower = query.lower()
for hr in results:
name = hr.metadata.get("name", "").lower()
# Exact name match
if query_lower == name:
hr.hybrid_score *= boost_factor * 1.5
# Partial name match
elif query_lower in name or name in query_lower:
hr.hybrid_score *= boost_factor
# Re-sort
results.sort(key=lambda x: x.hybrid_score, reverse=True)
return results
Multi-Signal Ranking
Raw retrieval gets candidates. Ranking determines what users see. In e-commerce, relevance alone isn't enough—we need to balance multiple signals.
The key insight: no single signal is sufficient. A semantically perfect match might be out of stock. A highly-rated product might be wrong color. A best-seller might be twice the budget. Multi-signal ranking combines everything to surface truly relevant results.
Why four signals instead of just semantic relevance? Consider a user searching "cozy navy sweater under 300 cashmere sweater as the top result—it's semantically perfect for "cozy navy sweater." But it violates the price constraint. A different sweater at $89 with slightly lower semantic similarity is actually what the user wants. Multi-signal ranking balances these competing factors.
The signal breakdown:
| Signal | Weight | What It Measures | When It Dominates |
|---|---|---|---|
| Semantic | 35% | How well the product matches the query meaning | Descriptive queries ("elegant evening look") |
| Constraint | 25% | Hard requirements: color, price, size | Specific queries ("blue under $50") |
| Metadata | 25% | Product attributes matching query attributes | Attribute-heavy queries ("wool blazer") |
| Quality | 15% | Ratings, reviews, popularity | Tie-breaking between similar products |
Why these specific weights? They emerged from extensive A/B testing across millions of queries. The 35% semantic weight ensures meaning matters most—users forgive minor constraint mismatches for truly relevant products. The 25% constraint weight prevents showing 100." The 15% quality weight is deliberately low: quality should break ties, not override relevance. A mediocre-but-relevant product beats a highly-rated irrelevant one.
The weights are starting points—tune them based on A/B testing. If users complain about irrelevant results, increase semantic weight. If they complain about ignoring filters, increase constraint weight.
class ResultRanker:
"""
Ranks products based on multiple signals.
Scoring weights (configurable):
- Semantic relevance: 35%
- Constraint matching: 25%
- Metadata fit: 25%
- Quality signals: 15%
"""
def __init__(
self,
semantic_weight: float = 0.35,
constraint_weight: float = 0.25,
metadata_weight: float = 0.25,
quality_weight: float = 0.15,
):
self._weights = {
"semantic": semantic_weight,
"constraint": constraint_weight,
"metadata": metadata_weight,
"quality": quality_weight,
}
def rank(
self,
products: list[Product],
rag_results: list[RetrievalResult],
constraints: QueryConstraints,
) -> list[RankedProduct]:
"""Rank products based on all signals."""
rag_scores = {r.product_id: r.semantic_score for r in rag_results}
ranked = []
for product in products:
# Calculate individual scores
semantic = rag_scores.get(product.id, 0.5)
constraint, matches, penalties = self._calculate_constraint_score(
product, constraints
)
metadata = self._calculate_metadata_score(product, constraints)
quality = self._calculate_quality_score(product)
# Calculate final weighted score
final = (
semantic * self._weights["semantic"] +
constraint * self._weights["constraint"] +
metadata * self._weights["metadata"] +
quality * self._weights["quality"]
)
ranked.append(RankedProduct(
product=product,
semantic_score=semantic,
constraint_score=constraint,
metadata_score=metadata,
quality_score=quality,
final_score=final,
match_reasons=matches,
penalties=penalties,
))
ranked.sort(key=lambda x: x.final_score, reverse=True)
return ranked
Constraint Matching
This is where metadata shines. We check how well each product matches the extracted constraints.
The philosophy behind constraint scoring: Not all constraint violations are equal. Missing the exact color is annoying but tolerable—users often accept "close enough." Missing the price constraint is more serious—showing 100" feels like the system isn't listening. And some constraints are absolute: if the user specified size M, showing size XL is completely useless.
The penalty system: We use a subtractive scoring model starting at 1.0 (perfect match). Each constraint violation subtracts from this score. The penalty magnitudes reflect real user behavior from our click data:
| Constraint | Penalty | Reasoning |
|---|---|---|
| Color mismatch | -0.20 | Users often accept similar colors (navy vs. blue) |
| Material mismatch | -0.15 | Less critical; users care more about look than fabric |
| Over budget | -0.30 | Serious violation—users have real budget limits |
| Under minimum price | -0.10 | Minor issue—cheaper is rarely a dealbreaker |
| Occasion mismatch | 0.00 | No penalty—occasion data is often incomplete |
| Style mismatch | 0.00 | No penalty—style is subjective and metadata may be incomplete |
Why no penalty for occasion/style mismatches? These attributes come from LLM enrichment and aren't always complete or accurate. A jacket might be perfect for the office but not tagged with "office" in our metadata. Penalizing missing tags would unfairly demote good products. Instead, we reward matches without penalizing misses.
def _calculate_constraint_score(
self,
product: Product,
constraints: QueryConstraints,
) -> tuple[float, list[str], list[str]]:
"""Calculate how well product matches constraints."""
score = 1.0
matches = []
penalties = []
# Color matching
if constraints.colors:
if any(c in product.colors for c in constraints.colors):
matches.append(f"Color: {', '.join(constraints.colors)}")
else:
score -= 0.2
penalties.append("Different color")
# Material matching
if constraints.materials:
matching = set(constraints.materials) & set(product.materials)
if matching:
matches.append(f"Material: {', '.join(matching)}")
else:
score -= 0.15
penalties.append("Different material")
# Price constraint
if constraints.price_range:
if constraints.price_range.max and product.price > constraints.price_range.max:
score -= 0.3
penalties.append(f"Above budget (€{constraints.price_range.max})")
elif constraints.price_range.min and product.price < constraints.price_range.min:
score -= 0.1
penalties.append("Below minimum price")
else:
matches.append("Within budget")
# Occasion matching
if constraints.occasions:
matching = set(constraints.occasions) & set(product.occasions)
if matching:
matches.append(f"Perfect for: {', '.join(matching)}")
# Don't penalize heavily - occasion data might be incomplete
# Style matching
if constraints.styles:
matching = set(constraints.styles) & set(product.style_tags)
if matching:
matches.append(f"Style: {', '.join(matching)}")
return max(0.0, score), matches, penalties
Quality Signals
In e-commerce, quality signals like ratings and review counts matter—but they need careful interpretation.
The rating paradox: A product with a 5.0 rating from 2 reviews isn't necessarily better than a product with 4.3 rating from 500 reviews. The latter has statistical significance; the former could be the seller's friends. Our quality score accounts for both rating value AND confidence (review count).
Why quality is only 15% of the ranking formula: Quality signals are backward-looking—they tell you what past customers thought. But they don't tell you whether the product matches THIS user's query. A 4.8-rated winter coat is useless if the user wants a summer dress. Quality should break ties between equally relevant products, not override relevance.
The scoring logic explained:
- Base score: 0.5 (neutral) for products without ratings
- Rating contribution: 70% weight on normalized rating (1-5 scale → 0-1 scale)
- Review count boost: Up to 0.3 additional points for high review volume
- Cap at 1.0: Prevents quality from dominating other signals
Review count thresholds: We use stepped boosts rather than linear scaling because the relationship between review count and reliability is logarithmic, not linear. The jump from 2 to 20 reviews matters more than 200 to 220. Our thresholds (20, 50, 100) were calibrated against actual product quality metrics.
def _calculate_quality_score(self, product: Product) -> float:
"""Calculate quality score based on ratings and reviews."""
score = 0.5 # Default neutral
if product.rating:
# Normalize rating to 0-1 (assuming 1-5 scale)
rating_score = (product.rating - 1) / 4
score = rating_score * 0.7
# Boost for many reviews (more reliable)
if product.review_count:
if product.review_count >= 100:
score += 0.2
elif product.review_count >= 50:
score += 0.15
elif product.review_count >= 20:
score += 0.1
return min(1.0, score)
LLM Reranking: The Secret Weapon
Initial retrieval and scoring gets you 80% of the way. LLM reranking gets you the final 20% that transforms good search into great search.
What LLM reranking can do that scoring can't: Our multi-signal ranking computes independent scores for each product. But it can't reason about products relative to each other or understand subtle intent. Consider the query "casual jacket for a tech job interview." A scoring system might rank a blazer and a hoodie similarly—both are technically jackets. An LLM understands that tech interviews lean casual but "interview" still implies some professionalism—a smart bomber jacket ranks higher than both extremes.
The holistic reasoning advantage: LLMs can consider:
- Implicit intent: "First date outfit" implies wanting to impress, suggests avoiding overly casual or overly formal
- Cultural context: "Business casual" means different things in NYC vs. Silicon Valley vs. London
- Conversation coherence: If the user previously rejected formal options, rank casual alternatives higher
- Outfit compatibility: When building looks, consider how pieces work together, not just individually
- Subjective fit: "Something my mom would like" requires understanding generational preferences
The cost-benefit reality: LLM reranking adds 150-300ms latency and ~2,000-10,000/day. But calling it selectively on the 20% of queries that benefit most costs 80% less while preserving most quality gains. The key is knowing when LLM reranking adds value.
After initial ranking, we pass the top candidates to an LLM for holistic reasoning:
RERANKING_PROMPT = """You are a fashion search expert. Rerank these products
based on how well they match the user's query and conversation context.
User Query: {query}
Conversation Context:
{conversation}
Products to Rank:
{products}
Consider:
1. How well does each product match the explicit query?
2. Does it fit the implicit intent (occasion, style, vibe)?
3. How well does it match conversation context?
4. Is it actually what this user is looking for?
Return a JSON object with:
- ranking: List of product IDs in order of relevance (best first)
- reasoning: Brief explanation for top 3 choices
Return JSON only."""
class LLMReranker:
"""Rerank search results using LLM reasoning."""
def __init__(self, llm_client: LLMClient):
self.llm = llm_client
async def rerank(
self,
query: str,
products: list[RankedProduct],
conversation: ConversationHistory,
top_k: int = 10,
) -> list[RankedProduct]:
"""Rerank products using LLM reasoning."""
# Format products for LLM
products_text = self._format_products(products[:15]) # Limit context
# Format conversation
conv_text = "\n".join([
f"{m.role}: {m.content}"
for m in conversation.get_context_window(5)
]) or "No previous conversation."
prompt = RERANKING_PROMPT.format(
query=query,
conversation=conv_text,
products=products_text
)
try:
response = await self.llm.generate_json(
messages=[{"role": "user", "content": prompt}],
temperature=0.3 # Some creativity but mostly consistent
)
# Reorder products based on LLM ranking
ranking = response.get("ranking", [])
return self._apply_ranking(products, ranking, top_k)
except Exception as e:
# Fallback to original ranking on failure
logger.warning(f"LLM reranking failed: {e}")
return products[:top_k]
def _format_products(self, products: list[RankedProduct]) -> str:
"""Format products for LLM context."""
lines = []
for i, rp in enumerate(products, 1):
p = rp.product
lines.append(f"""
[{i}] ID: {p.id}
Name: {p.name}
Brand: {p.brand}
Price: €{p.price}
Colors: {', '.join(p.colors)}
Style: {', '.join(p.style_tags)}
Occasions: {', '.join(p.occasions)}
Match reasons: {', '.join(rp.match_reasons)}
Penalties: {', '.join(rp.penalties) or 'None'}
""")
return "\n".join(lines)
def _apply_ranking(
self,
products: list[RankedProduct],
ranking: list[str],
top_k: int
) -> list[RankedProduct]:
"""Apply LLM ranking to products."""
# Build lookup
product_map = {rp.product.id: rp for rp in products}
# Reorder based on LLM ranking
reranked = []
for product_id in ranking:
if product_id in product_map:
reranked.append(product_map.pop(product_id))
# Add any products not in LLM ranking (fallback)
reranked.extend(product_map.values())
return reranked[:top_k]
Understanding the reranker implementation:
The LLMReranker class has several important design decisions:
Limiting context to 15 products: We don't send 50 candidates to the LLM—that would consume tokens unnecessarily and potentially confuse the model. The top 15 are enough for meaningful reordering. If product #20 should actually be #1, your initial ranking has bigger problems than reranking can solve.
Temperature of 0.3: We want mostly consistent rankings (hence low temperature) but some flexibility for subjective judgments. Pure 0.0 would always produce identical rankings; 0.3 allows the model to "think" about edge cases differently across calls.
Graceful fallback: LLM calls fail—rate limits, timeouts, malformed responses. The except block returns the original ranking rather than failing the entire search. Users never see an error; they just get non-reranked results. Log these failures for monitoring but don't let them break the user experience.
The _apply_ranking method handles mismatches: The LLM might return product IDs that don't match our candidates (hallucination) or skip some products. We apply what we can, then append skipped products at the end. This defensive coding ensures we always return top_k products regardless of LLM behavior.
When to Use LLM Reranking
LLM reranking is expensive (~100-200ms, API costs). Use it strategically:
| Scenario | Use LLM Reranking? | Why |
|---|---|---|
| Simple keyword search ("navy jacket") | No | Initial ranking sufficient |
| Complex intent ("cozy office look") | Yes | Needs semantic reasoning |
| Multi-turn refinement ("cheaper ones") | Yes | Context integration critical |
| High-value user (logged in, history) | Yes | Worth the cost for conversion |
| High-traffic generic queries | No | Cache results instead |
| Ambiguous results (similar scores) | Yes | LLM can break ties meaningfully |
Hybrid Reranking Strategy
Combine fast scoring with selective LLM reranking:
async def smart_rerank(
self,
query: str,
products: list[RankedProduct],
conversation: ConversationHistory,
) -> list[RankedProduct]:
"""Use LLM reranking only when beneficial."""
# Check if LLM reranking is needed
needs_llm = (
# Complex query (not just product name)
len(query.split()) > 3 or
# Has conversation context
len(conversation.messages) > 1 or
# Top results have similar scores (ambiguous)
self._scores_are_close(products[:5]) or
# Query contains subjective terms
any(term in query.lower() for term in
["cozy", "elegant", "nice", "good", "best", "perfect"])
)
if needs_llm and self.llm:
return await self.llm_reranker.rerank(
query, products, conversation
)
else:
return products[:10]
def _scores_are_close(self, products: list[RankedProduct]) -> bool:
"""Check if top products have similar scores."""
if len(products) < 2:
return False
scores = [p.final_score for p in products]
return (max(scores) - min(scores)) < 0.1 # Within 10%
Query Decomposition for Complex Requests
E-commerce queries often contain multiple parts: "I need a navy jacket AND matching shoes." Classical search treats this as one query. Intelligent search decomposes it.
Why decomposition is essential: Consider what happens when you search "navy jacket and matching shoes" as a single query. Vector search creates one embedding capturing both concepts. BM25 looks for documents containing both "jacket" and "shoes." Neither approach works well—you won't find products that are both jackets AND shoes. Decomposition recognizes this is two searches that need coordination.
The four query types we decompose:
| Query Type | Example | Decomposition Strategy |
|---|---|---|
| Outfit requests | "Style me for a wedding" | Search across complementary categories (dress + shoes + bag + jewelry) |
| Complementary | "Shoes to go with this dress" | Find products that complement a reference item |
| Similar/Alternative | "Something like this but cheaper" | Find products similar to a reference with modified constraints |
| Multi-item | "Jacket and shoes" | Split into independent searches, coordinate results |
The coordination challenge: Decomposition isn't just splitting—it's maintaining coherence. If the user wants "a navy jacket and matching shoes," the jacket search and shoe search need to share constraints. Both should filter on navy-compatible colors. Both should share the occasion context. The SubqueryDecomposer handles this by propagating shared filters to all subqueries.
Keyword-based detection vs. LLM detection: We use simple keyword matching to identify query types because it's fast and reliable for common patterns. "Outfit" almost always means multi-category outfit request. "Matching" almost always means complementary search. For ambiguous cases (e.g., "I want something nice for summer"), we fall back to LLM-based intent classification.
class SubqueryDecomposer:
"""Decompose complex requests into executable subqueries."""
OUTFIT_KEYWORDS = {"outfit", "look", "ensemble"}
COMPLEMENTARY_KEYWORDS = {"matching", "complementary", "go with", "pair with"}
SIMILAR_KEYWORDS = {"similar", "like", "alternative", "cheaper"}
def decompose(
self,
normalized_query: str,
intent: IntentType,
filters: Filters,
constraints: Constraints,
) -> List[SubQuery]:
text = normalized_query.lower()
subqueries: list[SubQuery] = []
# Outfit requests need multi-category search
if intent == IntentType.outfit or any(k in text for k in self.OUTFIT_KEYWORDS):
subqueries.append(
SubQuery(type=SubQueryType.outfit, query=normalized_query, filters=filters)
)
return subqueries
# Complementary product requests
if any(k in text for k in self.COMPLEMENTARY_KEYWORDS):
subqueries.append(
SubQuery(type=SubQueryType.complementary, query=normalized_query, filters=filters)
)
return subqueries
# Similar product requests
if any(k in text for k in self.SIMILAR_KEYWORDS):
subqueries.append(
SubQuery(type=SubQueryType.similar, query=normalized_query, filters=filters)
)
return subqueries
# Multi-item requests: "jacket and shoes"
parts = self._split_multi(text)
if len(parts) > 1:
for part in parts:
subqueries.append(
SubQuery(type=SubQueryType.product, query=part.strip(), filters=filters)
)
return subqueries
# Single product search
subqueries.append(
SubQuery(type=SubQueryType.product, query=normalized_query, filters=filters)
)
return subqueries
def _split_multi(self, text: str) -> list[str]:
"""Split on 'and' and commas."""
candidates = re.split(r"\band\b|,", text)
return [c for c in (cand.strip() for cand in candidates) if c]
Outfit Building
For outfit requests, we search across multiple product categories and ensure compatibility.
What makes outfit building hard: Users don't just want four random products—they want a coordinated look. A navy blazer, white shirt, khaki pants, and brown loafers work together. A navy blazer, neon pink top, plaid shorts, and white sneakers don't. Outfit building requires understanding:
- Color coordination: Which colors complement vs. clash
- Style coherence: All items should share a style vocabulary (casual, formal, streetwear)
- Occasion appropriateness: Wedding outfit differs from beach vacation outfit
- Layering logic: Outerwear goes over tops, not under
The role-based approach: We define outfit "roles" (top, bottom, outerwear, footwear, accessories) rather than rigid categories. This flexibility lets us handle various requests: a summer outfit might skip outerwear; a formal outfit might add a tie and pocket square. The roles guide search but don't constrain it.
Compatibility scoring (simplified in this example): Production outfit builders use compatibility models trained on curated outfit data. Given two products, the model predicts how well they work together. We use these pairwise scores to select items that maximize overall outfit coherence, not just individual item quality.
The orchestration pattern: Notice how OutfitService delegates to the existing SearchService for each role. This reuses all our query understanding, hybrid RAG, and ranking infrastructure. Outfit building is an orchestration layer on top of single-product search, not a separate system.
class OutfitService:
"""Build complete outfits across product categories."""
DEFAULT_ROLES = ["top", "bottom", "outerwear", "footwear"]
def __init__(self, search_service: SearchService) -> None:
self.search_service = search_service
def build_outfit(self, req: BuildOutfitRequest) -> BuildOutfitResponse:
filters = Filters(**req.constraints) if req.constraints else Filters()
candidates = []
for role in self.DEFAULT_ROLES:
# Search for items in each category
search_req = ProductSearchRequest(query=role, filters=filters)
product_resp = self.search_service.search_products(search_req)
if product_resp.items:
candidates.append(OutfitItemCandidate(
role=role,
product=product_resp.items[0],
score=0.7
))
outfit = OutfitSuggestion(
items=candidates,
reasoning="Coordinated outfit based on color palette and style."
)
return BuildOutfitResponse(outfits=[outfit])
Conversation Management for Multi-Turn Search
Real shopping isn't single-shot. Users refine, compare, and iterate.
Why multi-turn matters: Studies show that 60-70% of successful e-commerce searches involve multiple queries. Users rarely find what they want on the first try. They search, browse, refine their criteria, and search again. A system that treats each query independently wastes this iterative refinement—the user has to re-specify "navy" and "jacket" every time instead of just saying "cheaper" or "in black."
The context problem: When a user says "cheaper ones," what does "ones" refer to? Without conversation context, we have no idea. With context, we know they mean "cheaper navy puffer jackets from the previous search." This pronoun resolution is trivial for humans but requires explicit engineering in search systems.
Three types of multi-turn queries:
| Query Type | Example | What System Must Do |
|---|---|---|
| Refinement | "Under $100" | Filter previous results by new constraint |
| Replacement | "Actually, show me dresses instead" | New search, carry over style/occasion context |
| Reference | "The third one in blue" | Resolve reference to specific product, find variant |
Here's how a multi-turn conversation flows:
User: "Show me navy puffer jackets"
System: [Returns 10 jackets]
User: "Something more affordable"
System: [Should filter previous results by price, not start fresh]
User: "Do any of these come in black?"
System: [Should search for black versions of the same styles]
Stateless Architecture with Context
We pass conversation history with each request rather than maintaining server-side state. This scales horizontally and survives server restarts.
Why stateless over stateful: A stateful server stores conversation history in memory. This seems simpler—no need to pass context with each request. But it creates problems at scale:
- Sticky sessions required: User must hit the same server for their entire session
- Server failure = lost context: If that server restarts, all conversations lose history
- Memory pressure: 10,000 concurrent conversations × average history size = significant RAM
- Scaling complexity: Adding servers doesn't help users on busy servers
The stateless approach: Instead, the client (frontend) stores conversation history and passes it with each search request. The server is completely stateless—any server can handle any request. This is how all modern scalable systems work (think: JWT tokens instead of server sessions).
Trade-offs: Stateless means larger request payloads (conversation history adds 1-5KB). But network bandwidth is cheap; horizontal scaling is expensive. We also get natural history limits—clients only send recent messages, preventing unbounded context growth.
class ConversationHistory:
"""Conversation context for multi-turn search."""
messages: list[ConversationMessage]
def get_context_window(self, max_messages: int = 10) -> list[ConversationMessage]:
"""Get recent messages for context."""
return self.messages[-max_messages:]
def to_prompt_format(self) -> list[dict]:
"""Format for LLM prompt."""
return [
{"role": msg.role.value, "content": msg.content}
for msg in self.messages
]
Understanding the ConversationHistory design:
The message structure: Each ConversationMessage contains a role (user or assistant), content (the actual text), and timestamp. We also store metadata like result IDs returned by the system—this enables reference resolution when users say "the third one."
Why limit to 10 messages: The get_context_window(max_messages=10) limits context for several reasons:
-
Token limits: LLM context windows are finite. A 10-message conversation with results descriptions could be 2,000+ tokens. More messages mean less room for the actual search task.
-
Relevance decay: Older messages are usually less relevant. The user's intent 15 messages ago probably doesn't matter for the current query.
-
Cost control: More context = more tokens = higher LLM costs. At $0.01 per 1K tokens, every extra message adds up across millions of queries.
-
Latency: Longer prompts take longer to process. Keeping context lean improves response times.
The to_prompt_format method: This transforms our internal message format to the format LLMs expect (role + content dictionaries). The abstraction keeps our code LLM-agnostic—we can switch between OpenAI, Anthropic, or local models by changing the formatter.
What's NOT stored in ConversationHistory: We don't store the actual search results (products, images, prices). That would bloat the context enormously. Instead, we store result IDs and summary text. When we need to resolve "the third one," we look up result IDs from the previous turn, not full product data.
Context-Aware Query Rewriting
The LLM rewrites queries to be standalone while incorporating conversation context.
Why query rewriting is necessary: Our vector search and BM25 systems don't understand conversation—they receive a single query string. If the user says "cheaper ones," these systems have no idea what "ones" refers to. Query rewriting transforms context-dependent queries into standalone queries that work with our retrieval systems.
The rewriting task: Given conversation history and a new query, produce a standalone query that:
- Resolves all pronouns ("it," "them," "ones") to their referents
- Carries forward relevant constraints from previous turns (color, category, style)
- Drops irrelevant context (if user switched topics)
- Is optimized for vector similarity search (descriptive, attribute-rich)
Why we use an LLM for this: Rule-based rewriting is fragile. You'd need patterns for every pronoun, every way users reference previous results, every way constraints carry forward. LLMs handle this naturally because they understand language. The ~150ms latency is worth the accuracy improvement.
The LLM rewrites queries using this prompt structure:
# In intent_parser.py
async def _parse_with_llm(self, query: UserQuery) -> ParsedIntent:
"""Parse using LLM with conversation context."""
# Build conversation context
context_messages = []
for msg in query.conversation_history.get_context_window(5):
context_messages.append(f"{msg.role.value}: {msg.content}")
context_str = "\n".join(context_messages) if context_messages else "No previous conversation."
prompt = f"""Parse the following query:
User Query: "{query.query}"
Conversation History:
{context_str}
For 'semantic_query', generate a standalone, context-aware search query that:
- Resolves pronouns (it, they, ones) to the objects they refer to
- Incorporates implied constraints from previous turns (e.g., color, category)
- Is optimized for vector similarity search
Output format: {INTENT_PARSING_FORMAT}
Parse the query and respond with JSON only:"""
Example transformation:
| Turn | User Says | Semantic Query Generated |
|---|---|---|
| 1 | "Show me navy puffer jackets" | "navy puffer jackets down jackets quilted coats" |
| 2 | "Something more affordable" | "navy puffer jackets affordable budget-friendly under 100" |
| 3 | "Do any come in black?" | "black puffer jackets down jackets affordable" |
The semantic query incorporates context without requiring the user to repeat themselves.
Extended Conversation Example
Here's a realistic multi-turn shopping conversation showing how context flows through the system:
┌─────────────────────────────────────────────────────────────────────┐
│ TURN 1 │
├─────────────────────────────────────────────────────────────────────┤
│ User: "I need a jacket for a business trip to Berlin next week" │
│ │
│ Intent Parser extracts: │
│ - category: jackets │
│ - occasion: business, travel │
│ - location_context: Berlin (cold in winter) │
│ - implicit: professional, versatile, warm │
│ │
│ Semantic query: "professional business jacket warm winter travel │
│ versatile smart casual" │
│ │
│ System: Shows 10 wool blazers, smart puffer jackets, trench coats │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ TURN 2 │
├─────────────────────────────────────────────────────────────────────┤
│ User: "Something warmer, it's going to be really cold" │
│ │
│ Context from Turn 1: │
│ - Already searching for business jackets │
│ - Berlin trip context │
│ │
│ Intent Parser extracts: │
│ - refinement: warmer → warmth_rating >= 4 │
│ - maintains: business, professional │
│ │
│ Semantic query: "warm winter business jacket professional │
│ heavy insulated down puffer cold weather" │
│ │
│ System: Filters to down jackets, wool coats with warm ratings │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ TURN 3 │
├─────────────────────────────────────────────────────────────────────┤
│ User: "I like the third one but do you have it in navy?" │
│ │
│ Context: │
│ - "third one" → references product ID from Turn 2 results │
│ - color preference: navy │
│ │
│ Intent Parser: │
│ - intent: product_variant_search │
│ - base_product: [resolved from Turn 2, position 3] │
│ - color_filter: navy │
│ │
│ System: Searches for same product in navy, or similar styles │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ TURN 4 │
├─────────────────────────────────────────────────────────────────────┤
│ User: "Perfect! Now I need matching dress shoes" │
│ │
│ Context: │
│ - Just selected navy business jacket │
│ - Trip to Berlin │
│ - "matching" → should complement the jacket │
│ │
│ Intent Parser: │
│ - intent: complementary_search │
│ - category: dress_shoes │
│ - style_match: business, professional │
│ - color_match: navy-compatible (black, brown, burgundy) │
│ │
│ Semantic query: "dress shoes business professional formal │
│ men leather oxford derby navy compatible" │
└─────────────────────────────────────────────────────────────────────┘
This conversation demonstrates:
- Context accumulation: Berlin trip context persists across turns
- Pronoun resolution: "third one" → specific product
- Implicit constraints: "matching" → color and style compatibility
- Intent transitions: search → refine → variant → complementary
User Personalization
Logged-in users have rich history that dramatically improves search relevance. Personalization layers on top of query understanding.
The personalization opportunity: Two users searching "casual shoes" want completely different things. A sneaker enthusiast wants the latest Nike release. A business professional wants comfortable loafers for casual Fridays. A college student wants budget-friendly canvas shoes. Without personalization, you show all three users the same results and hope one set works.
Explicit vs. implicit preferences: Users rarely tell you their preferences directly. Instead, you infer preferences from behavior:
| Signal Type | Examples | Reliability | Privacy Concern |
|---|---|---|---|
| Explicit | Favorite brands, size profile, style quiz answers | High | Low (user-provided) |
| Implicit - Strong | Purchases, wishlist additions | High | Medium |
| Implicit - Medium | Cart additions, product page views > 30s | Medium | Medium |
| Implicit - Weak | Search clicks, scroll depth | Low | Low |
The cold start problem: New users have no history. We handle this with: (1) demographic defaults (location-based trends), (2) session behavior (learn from the first few interactions), (3) explicit onboarding (optional style quiz). Most systems over-index on logged-in personalization and forget that 40-60% of e-commerce traffic is anonymous.
Balancing personalization and discovery: Over-personalization creates filter bubbles. A user who bought three navy items shouldn't only see navy forever. We balance preference matching with diversity injection—ensuring some results come from outside the user's established preferences.
User Profile Schema
The profile captures everything we know about a user's preferences, both stated and inferred:
@dataclass
class UserProfile:
"""User preferences and history for personalization."""
user_id: str
# === Explicit Preferences ===
favorite_brands: list[str] # ["Nike", "Zara", "COS"]
preferred_styles: list[str] # ["minimalist", "casual"]
disliked_styles: list[str] # ["flashy", "logo-heavy"]
size_profile: dict[str, str] # {"tops": "M", "bottoms": "32", "shoes": "42"}
# === Implicit Preferences (learned) ===
color_affinity: dict[str, float] # {"navy": 0.8, "black": 0.7, "red": 0.2}
price_range: tuple[float, float] # (50, 200) - typical spend
brand_affinity: dict[str, float] # Learned from behavior
# === Behavioral Signals ===
recently_viewed: list[str] # Product IDs
recently_purchased: list[str] # Product IDs
cart_items: list[str] # Current cart
wishlist: list[str] # Saved items
# === Context ===
location: Optional[str] # For weather-aware recommendations
last_active: datetime
Personalization-Aware Ranking
Integrate user preferences into the ranking formula.
Why 20% personalization weight? We deliberately keep personalization as a minority signal (20%) rather than a dominant one. The reasoning:
- Relevance still matters most: A user who loves Nike shouldn't see Nike products when searching for "formal leather dress shoes"—Nike doesn't make those
- Preferences are probabilistic: Just because a user bought blue before doesn't mean they want blue now
- Discovery matters: Some serendipitous finds come from outside the preference bubble
- Data quality varies: Implicit signals can be noisy; we shouldn't over-weight uncertain data
The layered architecture: Notice how PersonalizedRanker extends ResultRanker. It first gets the base ranking (semantic + constraint + metadata + quality), then adds personalization scoring. This composition means personalization is an enhancement, not a replacement. Anonymous users get the base ranking; logged-in users get personalized results.
Personalization scoring breakdown:
| Signal | Impact | Logic |
|---|---|---|
| Favorite brand | +0.20 | Direct match to stated preference |
| Brand affinity | +0.00-0.15 | Learned from behavior, weighted by confidence |
| Style match | +0.10 per overlap | Products matching preferred styles |
| Style conflict | -0.15 per conflict | Products matching disliked styles (penalty!) |
| Color affinity | +0.00-0.10 | Based on purchase/cart history |
| Price range fit | +0.10 / -0.10 | Within vs. far outside typical spend |
| In wishlist | +0.15 | Strong signal of interest |
| Recently purchased | -0.30 | Avoid showing what they already have |
class PersonalizedRanker(ResultRanker):
"""Extends ResultRanker with user personalization."""
def __init__(
self,
semantic_weight: float = 0.30,
constraint_weight: float = 0.20,
metadata_weight: float = 0.20,
quality_weight: float = 0.10,
personalization_weight: float = 0.20, # New!
):
super().__init__(
semantic_weight, constraint_weight,
metadata_weight, quality_weight
)
self._weights["personalization"] = personalization_weight
def rank_personalized(
self,
products: list[Product],
rag_results: list[RetrievalResult],
constraints: QueryConstraints,
user_profile: UserProfile,
) -> list[RankedProduct]:
"""Rank with user personalization signals."""
# Get base ranking
ranked = self.rank(products, rag_results, constraints)
# Apply personalization boost
for rp in ranked:
pers_score = self._calculate_personalization_score(
rp.product, user_profile
)
rp.personalization_score = pers_score
# Recalculate final score with personalization
rp.final_score = (
rp.semantic_score * self._weights["semantic"] +
rp.constraint_score * self._weights["constraint"] +
rp.metadata_score * self._weights["metadata"] +
rp.quality_score * self._weights["quality"] +
pers_score * self._weights["personalization"]
)
# Re-sort
ranked.sort(key=lambda x: x.final_score, reverse=True)
return ranked
def _calculate_personalization_score(
self,
product: Product,
user: UserProfile
) -> float:
"""Score based on user preferences and history."""
score = 0.5 # Neutral baseline
# Brand affinity
if product.brand in user.favorite_brands:
score += 0.2
elif product.brand in user.brand_affinity:
score += user.brand_affinity[product.brand] * 0.15
# Style match
style_overlap = set(product.style_tags) & set(user.preferred_styles)
score += len(style_overlap) * 0.1
# Style mismatch penalty
style_conflict = set(product.style_tags) & set(user.disliked_styles)
score -= len(style_conflict) * 0.15
# Color affinity
for color in product.colors:
if color in user.color_affinity:
score += user.color_affinity[color] * 0.1
# Price range fit
if user.price_range:
min_price, max_price = user.price_range
if min_price <= product.price <= max_price:
score += 0.1
elif product.price > max_price * 1.5:
score -= 0.1 # Significantly over budget
# Recency signals
if product.id in user.recently_viewed:
score += 0.05 # Slight boost for re-discovery
if product.id in user.wishlist:
score += 0.15 # Strong signal of interest
# Avoid showing recently purchased (unless consumable)
if product.id in user.recently_purchased:
score -= 0.3 # Usually don't want same item again
return max(0.0, min(1.0, score))
Learning User Preferences
Personalization only works if preferences stay current. We continuously learn from user behavior, treating every interaction as a signal.
The interaction hierarchy: Not all interactions are equal signals. A purchase is a strong commitment—the user spent money. A wishlist addition is a strong signal of interest. Cart additions are meaningful but often abandoned. Views are weak signals—users might view and immediately reject.
| Interaction | Weight | Signal Meaning |
|---|---|---|
| Purchase | 0.50 | "I definitely want things like this" |
| Wishlist | 0.40 | "I really like this, saving for later" |
| Cart addition | 0.30 | "Considering this seriously" |
| View (>30s) | 0.10 | "This caught my attention" |
Exponential moving averages for preference updates: We don't simply count interactions. Instead, we use exponential moving averages (EMA) that give more weight to recent behavior while maintaining memory of past preferences. The formula new_value = old_value * 0.9 + signal * 0.1 means recent interactions shift preferences gradually—a single purchase doesn't completely redefine someone's taste.
Why gradual updates matter: Fashion preferences change over time. A user who bought exclusively streetwear in college might transition to business casual after getting an office job. EMA lets us track this evolution without whiplash from individual purchases.
Price range learning: We update expected price ranges from purchase/cart behavior using the same EMA approach. This handles gradual lifestyle changes (user starts earning more → budget increases) while ignoring outlier purchases (one expensive gift doesn't mean they want $500 items).
class PreferenceLearner:
"""Learn user preferences from behavior."""
def update_from_interaction(
self,
user: UserProfile,
product: Product,
interaction: str # "view", "cart", "purchase", "wishlist"
) -> UserProfile:
"""Update user profile based on interaction."""
# Interaction weights
weights = {
"view": 0.1,
"cart": 0.3,
"wishlist": 0.4,
"purchase": 0.5
}
weight = weights.get(interaction, 0.1)
# Update color affinity
for color in product.colors:
current = user.color_affinity.get(color, 0.5)
user.color_affinity[color] = current * 0.9 + weight * 0.1
# Update brand affinity
current = user.brand_affinity.get(product.brand, 0.5)
user.brand_affinity[product.brand] = current * 0.9 + weight * 0.1
# Update price range (exponential moving average)
if interaction in ("cart", "purchase"):
min_p, max_p = user.price_range or (0, 500)
user.price_range = (
min_p * 0.95 + product.price * 0.05,
max_p * 0.95 + product.price * 0.05
)
return user
Putting It All Together: The Full Pipeline
Here's the complete flow from user query to ranked results.
The seven-stage pipeline: Every search request flows through seven stages, each adding intelligence:
User Query → [1. Parse] → [2. LLM Intent] → [3. Expand] → [4. Retrieve] → [5. Rank] → [6. Filter] → [7. Rerank] → Results
| Stage | What Happens | Latency | When Skipped |
|---|---|---|---|
| 1. Query Parse | Normalize, extract attributes, classify intent | ~5ms | Never |
| 2. LLM Intent | Deep semantic understanding | ~150ms | Simple queries |
| 3. Expansion | Synonym and semantic expansion | ~2ms | Never |
| 4. Hybrid Retrieve | BM25 + Vector search + RRF merge | ~50ms | Never |
| 5. Multi-Signal Rank | Score on 4 signals, sort | ~10ms | Never |
| 6. Hard Filter | Remove violating products | ~2ms | No hard constraints |
| 7. LLM Rerank | Holistic reasoning on top results | ~200ms | Simple queries |
The composition pattern: Notice how SearchPipeline composes specialized services rather than implementing everything inline. Each component (QueryUnderstandingPipeline, RAGService, ResultRanker, LLMClient) can be developed, tested, and optimized independently. This modularity is essential for a system this complex.
Graceful degradation: The pipeline handles missing components gracefully. If llm_client is None, we skip LLM-powered steps. If caching fails, we compute results fresh. If one retrieval method fails, we fall back to the other. Search should never completely fail—there's always a reasonable fallback.
The merge pattern: Step 2 (_merge_intents) combines rule-based parsing with LLM parsing. Rules give us fast, reliable extraction of structured attributes. The LLM adds semantic understanding and handles edge cases. Merging lets both contribute to the final intent.
class SearchPipeline:
"""End-to-end LLM-powered search pipeline."""
def __init__(
self,
query_pipeline: QueryUnderstandingPipeline,
rag_service: RAGService,
ranker: ResultRanker,
llm_client: LLMClient,
):
self.query_pipeline = query_pipeline
self.rag_service = rag_service
self.ranker = ranker
self.llm = llm_client
async def search(
self,
raw_query: str,
conversation: ConversationHistory,
user_context: Optional[UserContext] = None,
) -> SearchResponse:
"""Execute full search pipeline."""
# Step 1: Query Understanding
parsed = self.query_pipeline.parse(raw_query)
# Step 2: LLM Intent Parsing (for complex queries)
if self.llm:
intent = await self._parse_with_llm(raw_query, conversation)
parsed = self._merge_intents(parsed, intent)
# Step 3: Query Expansion
expanded_terms = self.query_pipeline.expander.expand_query(
parsed.normalized_query
)
# Step 4: Hybrid RAG Search
rag_results = await self.rag_service.search(
query=parsed.semantic_query or parsed.normalized_query,
expanded_terms=expanded_terms,
filters=self._build_metadata_filters(parsed.filters),
top_k=50,
)
# Step 5: Multi-Signal Ranking
ranked = self.ranker.rank(
products=rag_results.products,
rag_results=rag_results,
constraints=parsed.constraints,
)
# Step 6: Apply hard filters
filtered = self.ranker.filter_by_hard_constraints(
ranked,
parsed.constraints
)
# Step 7: Optional LLM re-ranking for top results
if self.llm and len(filtered) > 5:
filtered = await self._llm_rerank(
filtered[:15],
raw_query,
conversation
)
return SearchResponse(
results=filtered[:10],
query_understanding=parsed,
suggestions=self._generate_refinements(parsed, filtered),
)
Understanding the SearchPipeline implementation in detail:
This is the orchestration layer that ties everything together. Let's examine each step and the design decisions behind it:
Step 1 - Query Understanding (query_pipeline.parse): This is always fast (~5ms) because it uses deterministic rules. It extracts colors, materials, price ranges, and classifies intent. Even if the LLM fails later, we have a usable parsed intent from rules.
Step 2 - LLM Intent Parsing (conditional): Notice the if self.llm: check. This allows running the pipeline without LLM (for testing, fallback, or cost control). The _parse_with_llm call is async because it involves network I/O. The _merge_intents function combines rule-based and LLM extractions—rules provide reliable structured data, LLM provides semantic understanding.
Why merge instead of replace? Rule-based extraction is 100% deterministic: if "navy" is in the query and in our color list, we extract it. LLM extraction is probabilistic—it might miss "navy" or hallucinate "blue." By merging, we get the reliability of rules with the nuance of LLMs. Specifically:
- Union extracted colors (rules ∪ LLM)
- Union extracted materials
- Take LLM's semantic_query (rules don't generate this)
- Use rules' price range if present, else LLM's
Step 3 - Query Expansion: This adds synonyms and related terms. "Navy" becomes "navy navy-blue midnight-blue dark-blue." This happens AFTER LLM parsing because we want to expand the final, merged intent. The expansion feeds into BM25 (more keyword variations) but not vector search (embedding handles synonyms naturally).
Step 4 - Hybrid RAG Search: The semantic_query or normalized_query pattern is a fallback—if LLM didn't generate a semantic query, use the normalized version. The top_k=50 retrieves more candidates than we'll show because ranking will reorder them and filters will remove some.
Step 5 - Multi-Signal Ranking: This is where we apply our four-signal formula (semantic, constraint, metadata, quality). The ranker receives:
products: The actual product objectsrag_results: Includes retrieval scores and matched termsconstraints: The parsed constraints for filtering
Step 6 - Hard Filter Application: After soft ranking, we apply hard filters. "Under €200" is a hard constraint—products over €200 should never appear, regardless of ranking score. This happens AFTER ranking because we want highly-ranked products to appear first among those that satisfy constraints.
Why filter after ranking, not before? If we filtered first, we'd pass a smaller candidate set to ranking. This seems more efficient but hurts quality. Some excellent semantic matches might be at €201—filtering them before ranking means we never consider them. Filtering after ranking means we rank everything, then remove violations. The user sees only constraint-satisfying products, but they're the best constraint-satisfying products.
Step 7 - LLM Re-ranking (conditional): Only triggers if: (1) LLM is available, and (2) we have more than 5 results. For short result lists, reranking adds latency without much value. We rerank top 15, not all results—LLM context is limited and expensive.
The SearchResponse object: The response includes:
results: Top 10 products, fully rankedquery_understanding: The parsed intent (useful for debugging, UI display)suggestions: Refinement suggestions ("Try filtering by: brand, size")
Error handling (not shown): In production, each step would have try/catch blocks. If LLM parsing fails, we use rule-based only. If vector search fails, we use BM25 only. If ranking fails, we return unranked results. The principle: always return something useful.
Observability hooks (not shown): Between each step, we'd emit metrics (latency, success/failure) and structured logs. This lets us answer questions like "How often does LLM parsing add useful information?" and "What's the latency distribution of Step 4?"
Classical vs LLM-Powered: A Direct Comparison
| Aspect | Classical (NER + ES) | LLM-Powered |
|---|---|---|
| Query: "cozy navy puffer" | Extracts "navy", searches keywords | Understands warmth, expands to down/quilted/padded, filters by color |
| Query: "something for work" | No extraction possible | Extracts occasion, filters by "office/business" tags |
| Query: "under €200" | Requires custom regex | LLM extracts price constraint reliably |
| Multi-turn: "cheaper ones" | Starts fresh search | Maintains context, filters previous results |
| Typos: "navvy jackket" | Fails to match | Vector search handles gracefully |
| Synonyms: "puffer vs down" | Misses unless manually mapped | Semantic similarity captures both |
| Complex: "jacket AND shoes" | Single query, mixed results | Decomposes into separate searches |
| Latency | 50-100ms | 200-500ms (with LLM calls) |
| Cost | Infrastructure only | Infrastructure + LLM API costs |
Production Considerations
Latency Optimization
LLM calls add latency. Without optimization, a single search request might take:
- Query parsing: 5ms (rules) + 200ms (LLM)
- BM25 search: 30ms
- Vector search: 50ms
- Ranking: 20ms
- LLM reranking: 250ms
- Total: 555ms (unacceptable for real-time search)
With optimization, we can achieve 200-300ms for most queries. Strategies to manage this:
- Parallel execution: Run LLM parsing and BM25 search simultaneously—don't wait for one to start the other
- Caching: Cache parsed intents for common queries (80% of queries are repeats)
- Tiered processing: Use rule-based for simple queries, LLM only for complex ones
- Streaming: Return initial results fast, refine with LLM asynchronously
The parallel execution pattern: The key insight is that BM25 search doesn't need LLM parsing to start—it can use the raw query. We start all three operations (rule parsing, BM25 search, LLM parsing) simultaneously. BM25 results arrive first, allowing us to show initial results while LLM parsing completes for refinement.
async def search_optimized(self, query: str) -> SearchResponse:
# Start all operations in parallel
rule_based_future = asyncio.create_task(
self._rule_based_parse(query)
)
bm25_future = asyncio.create_task(
self.bm25_index.search(query, top_k=50)
)
llm_future = asyncio.create_task(
self._llm_parse(query) # May be slower
)
# Wait for fast results first
rule_based = await rule_based_future
bm25_results = await bm25_future
# Return initial results immediately
initial_results = self._quick_rank(bm25_results, rule_based)
# Enhance with LLM when ready
llm_parsed = await llm_future
enhanced_results = self._rerank_with_llm(initial_results, llm_parsed)
return enhanced_results
Cost Management
LLM API costs can add up. At scale, naive implementation can cost more than your entire infrastructure.
The cost reality check: Let's do the math for a mid-size e-commerce site:
- 1 million searches/day
- LLM parsing: ~500 tokens/query × 75/day
- LLM reranking: ~2000 tokens/query × 300/day (if called on every query)
- Naive total: 11,250/month just for LLM calls
With optimization, we can reduce this by 80%:
| Strategy | Savings | How It Works |
|---|---|---|
| Intent classification first | 60% | Only use LLM for complex queries (~40% need it) |
| Batch processing | 10% | Group similar queries, reduce per-call overhead |
| Model selection | 50% | GPT-4o-mini costs 10x less than GPT-4 |
| Caching aggressively | 70% | Same query = cached parse result |
Combined impact: With all strategies, 1M queries/day costs ~11,000. Still significant, but easily justified by conversion improvements.
Model selection guidance:
- Query parsing: GPT-4o-mini or Claude Haiku (simple task, speed matters)
- Complex reasoning: GPT-4o or Claude Sonnet (when quality matters)
- Reranking: GPT-4o-mini (structured output, moderate complexity)
Fallback Strategy
Always have fallbacks when LLM calls fail.
Why fallbacks are non-negotiable: LLM APIs are external dependencies you don't control. OpenAI has outages. Rate limits get hit during traffic spikes. Network issues happen. A search system that returns errors when the LLM is unavailable is unacceptable—users expect search to always work.
The fallback hierarchy:
- Primary: LLM-powered parsing with full semantic understanding
- Secondary: Rule-based parsing with synonym expansion (no LLM)
- Tertiary: Raw keyword search (no parsing at all)
Each level degrades gracefully. Rule-based parsing still extracts colors, prices, and categories—just without nuanced intent understanding. Raw keyword search is the last resort but still returns relevant products.
Timeout-based fallbacks: We don't wait forever for LLM responses. If parsing takes >500ms, we use whatever rule-based results we have. This bounds worst-case latency regardless of LLM performance.
async def _parse_with_fallback(self, query: str) -> ParsedIntent:
try:
return await self._llm_parse(query)
except (TimeoutError, APIError) as e:
logger.warning(f"LLM parsing failed: {e}, falling back to rules")
return self._rule_based_parse(query)
Caching Strategies
Caching is critical for both performance and cost control.
The caching opportunity: E-commerce search has high query repetition. "Nike shoes," "black dress," and "winter jacket" appear thousands of times daily. Without caching, we'd call the LLM for intent parsing on every single request—expensive and slow. With caching, we parse once and reuse.
What to cache (and what not to):
- Cache: Intent parsing (same query = same intent), embeddings (expensive to compute), search results (within TTL)
- Don't cache: Personalized rankings (user-specific), real-time inventory (changes constantly), session-specific context
Cache key design matters: The key for search results must include the query AND all filters. "Navy jacket" with price_max=100 should not return cached results for "navy jacket" with price_max=200. We hash the query + filters combination to create unique keys.
Multi-layer caching: We use both in-process caching (Python's lru_cache for hot data) and distributed caching (Redis for shared data). In-process is faster (no network hop) but limited to one server. Redis is shared across all servers but adds ~1ms latency.
TTL strategy: Different data types need different expiration times. Intent parsing results are stable (1 hour TTL). Search results change with inventory (5 minute TTL). Embeddings almost never change (24 hour TTL).
from functools import lru_cache
import hashlib
import redis
class SearchCache:
"""Multi-layer caching for search operations."""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
# TTLs by cache type
self.ttls = {
"intent_parse": 3600, # 1 hour - same query = same intent
"embeddings": 86400, # 24 hours - embeddings are stable
"search_results": 300, # 5 minutes - inventory changes
"llm_rerank": 600, # 10 minutes - expensive, worth caching
}
def _cache_key(self, prefix: str, *args) -> str:
"""Generate cache key from arguments."""
content = ":".join(str(a) for a in args)
hash_val = hashlib.md5(content.encode()).hexdigest()[:12]
return f"search:{prefix}:{hash_val}"
async def get_or_compute_intent(
self,
query: str,
compute_fn: Callable
) -> ParsedIntent:
"""Cache intent parsing results."""
key = self._cache_key("intent", query.lower().strip())
# Check cache
cached = self.redis.get(key)
if cached:
return ParsedIntent.parse_raw(cached)
# Compute and cache
result = await compute_fn(query)
self.redis.setex(key, self.ttls["intent_parse"], result.json())
return result
async def get_or_compute_embedding(
self,
text: str,
compute_fn: Callable
) -> list[float]:
"""Cache embedding computations."""
key = self._cache_key("embed", text[:500]) # Truncate for key
cached = self.redis.get(key)
if cached:
return json.loads(cached)
embedding = await compute_fn(text)
self.redis.setex(key, self.ttls["embeddings"], json.dumps(embedding))
return embedding
def cache_search_results(
self,
query: str,
filters: dict,
results: list[str] # Product IDs
) -> None:
"""Cache search result IDs (not full products)."""
key = self._cache_key("results", query, json.dumps(filters, sort_keys=True))
self.redis.setex(key, self.ttls["search_results"], json.dumps(results))
def invalidate_product(self, product_id: str) -> None:
"""Invalidate caches when product changes."""
# In practice, use Redis patterns or pub/sub
pattern = f"search:results:*"
# Invalidate all result caches (brute force)
# Better: Use cache tags or versioning
Caching decision matrix:
| Component | Cache? | TTL | Invalidation |
|---|---|---|---|
| Intent parsing | Yes | 1 hour | On query change only |
| Query embeddings | Yes | 24 hours | Rarely changes |
| Product embeddings | Yes | Until product update | On product change |
| BM25 results | Yes | 5 minutes | On inventory change |
| Vector results | Yes | 5 minutes | On inventory change |
| LLM reranking | Yes | 10 minutes | On result set change |
| Personalization | No | - | Too user-specific |
Monitoring and Observability
Production search requires comprehensive monitoring.
Why search monitoring is different: Unlike typical API monitoring (is it up? what's the latency?), search monitoring must answer qualitative questions: Are results good? Are users finding what they want? Is the LLM improving results or degrading them? These require custom metrics beyond standard observability.
The three pillars of search monitoring:
| Pillar | What It Answers | Example Metrics |
|---|---|---|
| Operational | Is the system healthy? | Latency p50/p99, error rate, throughput |
| Quality | Are results good? | Result scores, click-through rate, conversion |
| Cost | Is it economical? | LLM calls/query, cache hit rate, API spend |
Stage-level latency tracking: Total latency hides problems. If search takes 500ms, is it the LLM parsing (fixable with caching) or the vector search (needs index optimization)? We track latency for each pipeline stage separately: parse, retrieve, rank, rerank.
Result quality histograms: We track the distribution of result scores by position. If position #1 consistently scores 0.9 but position #10 scores 0.3, that's healthy—relevance degrades with rank. If position #1 averages 0.4, we have a ranking problem.
The structured logging pattern: Rather than log strings like "Search completed in 234ms", we log structured data with fields. This enables powerful queries: "Show me all searches where intent=outfit AND latency>500ms AND result_count=0."
from prometheus_client import Counter, Histogram, Gauge
import structlog
logger = structlog.get_logger()
# === Metrics ===
search_requests = Counter(
"search_requests_total",
"Total search requests",
["intent_type", "has_conversation"]
)
search_latency = Histogram(
"search_latency_seconds",
"Search request latency",
["stage"], # "total", "parse", "retrieve", "rank", "rerank"
buckets=[0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0]
)
llm_calls = Counter(
"llm_calls_total",
"LLM API calls",
["operation", "model", "status"] # parse, rerank; success, failure
)
cache_hits = Counter(
"cache_hits_total",
"Cache hit/miss",
["cache_type", "hit"] # intent, embedding, results; true, false
)
result_quality = Histogram(
"search_result_scores",
"Distribution of result scores",
["position"], # 1, 2, 3, ... 10
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
class InstrumentedSearchPipeline(SearchPipeline):
"""Search pipeline with full observability."""
async def search(
self,
raw_query: str,
conversation: ConversationHistory,
user_context: Optional[UserContext] = None,
) -> SearchResponse:
"""Execute search with metrics and logging."""
request_id = str(uuid.uuid4())[:8]
with search_latency.labels(stage="total").time():
# Log request
logger.info(
"search_request",
request_id=request_id,
query=raw_query,
conversation_turns=len(conversation.messages),
user_id=user_context.user_id if user_context else None
)
# Parse with timing
with search_latency.labels(stage="parse").time():
parsed = await self._parse_query(raw_query, conversation)
search_requests.labels(
intent_type=parsed.intent.value,
has_conversation=str(len(conversation.messages) > 0)
).inc()
# Retrieve with timing
with search_latency.labels(stage="retrieve").time():
candidates = await self._retrieve(parsed)
# Rank with timing
with search_latency.labels(stage="rank").time():
ranked = self._rank(candidates, parsed)
# Optional LLM rerank
if self._should_llm_rerank(parsed, ranked):
with search_latency.labels(stage="rerank").time():
ranked = await self._llm_rerank(ranked, raw_query, conversation)
# Log result quality
for i, result in enumerate(ranked[:10], 1):
result_quality.labels(position=str(i)).observe(result.final_score)
logger.info(
"search_response",
request_id=request_id,
result_count=len(ranked),
top_score=ranked[0].final_score if ranked else 0,
intent=parsed.intent.value
)
return SearchResponse(
results=ranked[:10],
query_understanding=parsed,
request_id=request_id
)
Key metrics to track:
| Metric | Alert Threshold | Action |
|---|---|---|
| p99 latency | > 2s | Investigate slow queries |
| LLM failure rate | > 5% | Check API status, enable fallbacks |
| Cache hit rate | < 70% | Review cache strategy |
| Empty result rate | > 10% | Improve recall, check index |
| Low score searches | > 20% results < 0.3 | Review ranking weights |
| Conversion by search type | Below baseline | A/B test changes |
Logging best practices:
# Structured logging for debugging
logger.info(
"query_understanding",
raw_query=raw_query,
normalized=parsed.normalized_query,
semantic_query=parsed.semantic_query,
extracted_colors=parsed.filters.colors,
extracted_occasions=parsed.filters.occasions,
intent=parsed.intent.value,
confidence=parsed.confidence,
expansion_terms=expanded_terms[:10],
subqueries=[sq.type.value for sq in parsed.subqueries]
)
Measuring Success
Offline Evaluation Metrics
| Metric | Description | Target | Measurement |
|---|---|---|---|
| Recall@10 | % of relevant items in top 10 | >80% | Human-labeled test set |
| Precision@10 | % of top 10 that are relevant | >70% | Human-labeled test set |
| MRR | Mean Reciprocal Rank of first relevant | >0.5 | Click-through data |
| NDCG@10 | Normalized Discounted Cumulative Gain | >0.7 | Graded relevance labels |
| Constraint Satisfaction | % of results matching extracted constraints | >90% | Automated check |
| Query Understanding Accuracy | % of intents correctly parsed | >85% | Human evaluation |
Online Business Metrics
| Metric | Description | Target | Why It Matters |
|---|---|---|---|
| Conversion Rate | Searches → purchases | +15-25% | Revenue impact |
| Search Abandonment | Searches with no clicks | -25-35% | User satisfaction |
| Add-to-Cart Rate | Searches → cart adds | +20-30% | Engagement quality |
| Time to First Click | Seconds until first result click | <5s | Relevance signal |
| Pages per Search | Result pages viewed | -20% | Finding faster |
| Revenue per Search | Average order value from search | +10-20% | Business outcome |
A/B Testing Framework
Measure the impact of LLM-powered search systematically.
Why A/B testing is essential for search: Search quality is ultimately subjective—do users find what they want? Offline metrics (recall, precision) correlate with user satisfaction but don't measure it directly. A/B testing measures what matters: do users click more? buy more? abandon less?
The search A/B testing challenge: Unlike button color tests where you measure one metric, search A/B tests must track the entire funnel: impressions → clicks → cart → purchase. A change that increases clicks but decreases purchases isn't a win. We need statistical significance on multiple correlated metrics.
Consistent user assignment: Users must see the same variant throughout their session and across sessions. If a user searches "navy jacket" in treatment, then "cheaper ones" in control, the context is broken. We use deterministic hashing of user ID + experiment ID to ensure consistency.
Sample size requirements: Search experiments need large samples because conversion rates are low (2-5%). To detect a 10% relative improvement in conversion with 95% confidence, you need ~10,000 users per variant. Plan experiments to run 1-2 weeks, not days.
Guardrail metrics: Beyond the metrics you're trying to improve, watch guardrails—things that shouldn't get worse. If your new ranking improves conversion but doubles latency, that's a failed experiment. Define guardrails upfront: latency p99 < 1s, error rate < 0.1%, etc.
class SearchABTest:
"""A/B testing framework for search experiments."""
def __init__(self, experiment_id: str, traffic_split: float = 0.5):
self.experiment_id = experiment_id
self.traffic_split = traffic_split
self.metrics_store = MetricsStore()
def get_variant(self, user_id: str) -> str:
"""Deterministically assign user to variant."""
# Consistent hashing ensures same user always gets same variant
hash_val = int(hashlib.md5(
f"{self.experiment_id}:{user_id}".encode()
).hexdigest(), 16)
return "treatment" if (hash_val % 100) < (self.traffic_split * 100) else "control"
async def search_with_experiment(
self,
query: str,
user_id: str,
control_pipeline: SearchPipeline,
treatment_pipeline: SearchPipeline,
) -> tuple[SearchResponse, str]:
"""Execute search with experiment tracking."""
variant = self.get_variant(user_id)
if variant == "treatment":
response = await treatment_pipeline.search(query)
else:
response = await control_pipeline.search(query)
# Track experiment assignment
self.metrics_store.record_experiment(
experiment_id=self.experiment_id,
user_id=user_id,
variant=variant,
query=query,
result_ids=[r.product.id for r in response.results],
scores=[r.final_score for r in response.results],
)
return response, variant
def track_conversion(
self,
user_id: str,
product_id: str,
event_type: str, # "click", "cart", "purchase"
revenue: Optional[float] = None
):
"""Track conversion events for experiment analysis."""
self.metrics_store.record_conversion(
experiment_id=self.experiment_id,
user_id=user_id,
product_id=product_id,
event_type=event_type,
revenue=revenue
)
def analyze_results(self) -> dict:
"""Compute experiment results with statistical significance."""
control_metrics = self.metrics_store.get_metrics(
self.experiment_id, "control"
)
treatment_metrics = self.metrics_store.get_metrics(
self.experiment_id, "treatment"
)
return {
"conversion_rate": {
"control": control_metrics.conversion_rate,
"treatment": treatment_metrics.conversion_rate,
"lift": self._calc_lift(
control_metrics.conversion_rate,
treatment_metrics.conversion_rate
),
"p_value": self._chi_square_test(
control_metrics, treatment_metrics, "conversions"
),
"significant": self._is_significant(
control_metrics, treatment_metrics, "conversions"
)
},
"revenue_per_search": {
"control": control_metrics.revenue_per_search,
"treatment": treatment_metrics.revenue_per_search,
"lift": self._calc_lift(
control_metrics.revenue_per_search,
treatment_metrics.revenue_per_search
),
"p_value": self._t_test(
control_metrics, treatment_metrics, "revenue"
),
},
"click_through_rate": {
"control": control_metrics.ctr,
"treatment": treatment_metrics.ctr,
"lift": self._calc_lift(control_metrics.ctr, treatment_metrics.ctr),
},
"sample_size": {
"control": control_metrics.n,
"treatment": treatment_metrics.n,
}
}
Deep dive into the A/B testing implementation:
The get_variant method and consistent hashing: The hashing approach deserves detailed explanation. We concatenate experiment_id and user_id, hash the result with MD5, convert to an integer, and take modulo 100. This produces a number 0-99 that's deterministic for each user-experiment pair.
Why MD5 and not something simpler? MD5 has excellent distribution properties—it spreads inputs uniformly across the hash space. Simpler approaches (like user_id % 100) would create patterns if user IDs are sequential. MD5 ensures a user ending in "7" isn't always in treatment.
Why include experiment_id in the hash? This allows running multiple experiments simultaneously with independent assignments. User #12345 might be in treatment for experiment A and control for experiment B.
The search_with_experiment method: Notice we call different pipeline instances, not toggle a feature flag. This is important for clean experimentation:
- Separate codepaths: Treatment and control pipelines can have completely different configurations
- No contamination: Treatment logic can't accidentally affect control results
- Easy rollback: If treatment fails, control is unaffected
The method returns both the response AND the variant. The variant is needed downstream—the frontend might show different UX for treatment users, or we might log it for analysis.
Recording experiment data: The metrics_store.record_experiment call captures everything needed for analysis:
result_ids: Which products were shown (for position-based analysis)scores: Internal ranking scores (for debugging ranking changes)query: The actual query (for query-type segmentation)
This data lives in a specialized analytics store (ClickHouse, BigQuery, or Druid), not the operational database. We need fast writes (millions per day) and fast analytical queries (aggregations, cohort analysis).
The track_conversion method: This is called separately, often hours or days after the search. A user might search Monday and purchase Wednesday. The user_id links the conversion back to the experiment assignment. The product_id tells us whether they purchased a product from search results (attributable conversion) or something else entirely.
The analyze_results method: This computes statistical significance using chi-squared tests for proportions (conversion rates) and t-tests for continuous metrics (revenue). The key outputs:
- Lift: Percentage improvement (treatment - control) / control
- p-value: Probability this result is due to chance. p < 0.05 is typically significant.
- Confidence interval: Range of plausible true lifts (not shown, but important)
Common A/B testing pitfalls we avoid:
-
Peeking: Checking results daily and stopping when significant. This inflates false positives. We commit to a sample size upfront.
-
Multiple comparisons: Testing 10 metrics means 1 will be "significant" by chance. We designate a primary metric and treat others as exploratory.
-
Selection bias: Only analyzing users who searched in both periods. We include all users assigned to each variant.
-
Simpson's paradox: A change that helps overall might hurt a key segment. We slice results by user type, query type, and device.
Benchmark Results: Classical vs LLM-Powered
Based on testing with a 500K product fashion catalog:
Retrieval Quality (Offline Evaluation)
| Query Type | Classical (ES) | Hybrid RAG | +LLM Rerank | Improvement |
|---|---|---|---|---|
| Simple ("navy jacket") | 82% R@10 | 85% R@10 | 86% R@10 | +5% |
| Semantic ("cozy winter look") | 41% R@10 | 78% R@10 | 84% R@10 | +105% |
| Multi-constraint ("navy puffer under €200") | 52% R@10 | 81% R@10 | 88% R@10 | +69% |
| Multi-turn refinement | 38% R@10 | 74% R@10 | 82% R@10 | +116% |
| Average | 53% R@10 | 80% R@10 | 85% R@10 | +60% |
Latency Breakdown (p50 / p99)
| Stage | Latency p50 | Latency p99 |
|---|---|---|
| Intent Parsing (rule-based) | 5ms | 15ms |
| Intent Parsing (LLM) | 120ms | 350ms |
| BM25 Search | 15ms | 45ms |
| Vector Search | 25ms | 80ms |
| Hybrid Merge (RRF) | 3ms | 8ms |
| Initial Ranking | 10ms | 30ms |
| LLM Reranking | 180ms | 450ms |
| Total (without LLM) | 58ms | 178ms |
| Total (with LLM) | 358ms | 978ms |
Business Impact (A/B Test, 4 weeks, 100K users)
| Metric | Control (Classical) | Treatment (LLM) | Lift | Significance |
|---|---|---|---|---|
| Search CTR | 34.2% | 41.8% | +22.2% | p < 0.001 |
| Add-to-Cart Rate | 8.1% | 10.4% | +28.4% | p < 0.001 |
| Conversion Rate | 2.8% | 3.5% | +25.0% | p < 0.01 |
| Revenue/Search | €4.20 | €5.45 | +29.8% | p < 0.01 |
| Search Abandonment | 28.5% | 19.2% | -32.6% | p < 0.001 |
Cost Analysis (1M searches/day)
| Component | Cost/Month | Notes |
|---|---|---|
| Vector Store (Qdrant) | $200 | Self-hosted, 500K products |
| Embeddings (cached) | $50 | OpenAI, high cache hit rate |
| LLM Parsing (tiered) | $150 | GPT-4o-mini, 30% of queries |
| LLM Reranking (selective) | $300 | GPT-4o-mini, 20% of queries |
| Total LLM costs | $500/month | |
| Revenue lift | +$45,000/month | Based on €5.45 vs €4.20 |
| ROI | 90x |
Conclusion
Classical e-commerce search—NER feeding Elasticsearch—served us well for exact matches and explicit filters. But users don't search in keywords; they search in intent.
LLM-powered search bridges this gap:
- Understanding what users mean, not just what they type
- Expanding queries with domain-specific synonyms
- Combining semantic and keyword retrieval
- Ranking with multiple signals beyond relevance
- Conversing across multiple turns
The architecture we've built—query understanding pipeline, fashion-specific rules, hybrid RAG, multi-signal ranking, and conversation management—transforms e-commerce search from keyword matching to intent fulfillment.
The future of e-commerce search isn't better keyword matching. It's systems that understand shopping as a conversation, not a database query.
Frequently Asked Questions
Related Articles
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
Agentic RAG: When Retrieval Meets Autonomous Reasoning
How to build RAG systems that don't just retrieve—they reason, plan, and iteratively refine their searches to solve complex information needs.
RAG vs CAG: When Cache-Augmented Generation Beats Retrieval
A comprehensive comparison of Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation (CAG). Learn when to use each approach, implementation patterns, and how to build hybrid systems.
LLM Frameworks: LangChain, LlamaIndex, LangGraph, and Beyond
A comprehensive comparison of LLM application frameworks—LangChain, LlamaIndex, LangGraph, Haystack, and alternatives. When to use each, how to combine them, and practical implementation patterns.