Training Embedding Models: From Contrastive Learning to Production Retrieval
A comprehensive guide to training text embedding models—from contrastive learning fundamentals to hard negative mining, multi-stage training, and the architectures behind E5, BGE, and GTE. Understanding the foundation of modern retrieval systems.
Table of Contents
Why Embedding Models Matter
Every RAG system, semantic search engine, and retrieval-based application depends on embedding models. These models transform text into dense vectors where semantic similarity corresponds to geometric proximity. Yet while practitioners obsess over chunking strategies and prompt engineering, the embedding model itself often receives little attention—it's treated as a black box API call.
This matters because embedding quality fundamentally bounds retrieval quality. No amount of sophisticated re-ranking, query expansion, or hybrid search can recover from an embedding model that fails to capture the semantic relationships your application needs. If your embedding model doesn't understand that "myocardial infarction" and "heart attack" are semantically identical, your medical RAG system will miss half the relevant documents.
Understanding how embedding models are trained unlocks several capabilities: fine-tuning for your domain, selecting the right model for your use case, diagnosing retrieval failures, and making informed decisions about the quality-cost-latency trade-offs inherent in embedding selection.
The Embedding Landscape in 2025
The embedding model space has matured significantly. A few years ago, practitioners chose between OpenAI's ada-002 and a handful of open-source options. Today, the landscape includes:
Proprietary APIs:
- OpenAI text-embedding-3-large (3072 dimensions, strong general performance)
- Cohere embed-v3 (multiple size options, multilingual strength)
- Voyage AI (domain-specific variants for code, legal, finance)
- Google's Gecko (tight Vertex AI integration)
Open-Source Leaders:
- BGE (BAAI General Embedding) family from Beijing Academy of AI
- E5 family from Microsoft Research
- GTE (General Text Embeddings) from Alibaba
- Nomic Embed (fully open weights and training data)
- Jina Embeddings (8K context, multilingual)
Specialized Models:
- CodeBERT and derivatives for code
- SciBERT, PubMedBERT for scientific domains
- Legal-BERT, FinBERT for specialized domains
The performance gap between open-source and proprietary models has narrowed dramatically. On the MTEB (Massive Text Embedding Benchmark) leaderboard, open models now match or exceed proprietary options on many tasks. The decision often comes down to factors beyond raw performance: latency requirements, data privacy, fine-tuning needs, and cost at scale.
The Architecture of Embedding Models
Most modern text embedding models share a common architecture: a transformer encoder that processes input text and produces contextual representations, followed by a pooling operation that collapses the sequence into a single vector.
Encoder Architecture
The encoder is typically a BERT-style transformer, though the specific architecture varies:
BERT-base variants (110M parameters) offer good performance-to-cost ratios and remain popular for production deployments where latency matters.
BERT-large variants (340M parameters) provide better quality but significantly higher latency and cost.
Modern architectures like the E5-mistral models use decoder-only transformers (Mistral 7B) as the backbone, achieving state-of-the-art results but with substantial computational requirements.
The encoder processes input text through multiple transformer layers. Each layer applies self-attention (allowing tokens to attend to each other) followed by feed-forward networks. The output is a sequence of contextualized token representations—one vector per input token.
Pooling Strategies
Raw encoder output is a sequence of vectors. To get a single embedding, we must pool these vectors together. The choice of pooling strategy significantly impacts quality:
CLS Token Pooling: Use the representation of the special [CLS] token (first position). This was BERT's original design—the idea being that the [CLS] token, through attention, aggregates information from the entire sequence. In practice, this works reasonably but may not fully capture sequence information.
Mean Pooling: Average all token representations (excluding padding). This is now the dominant approach for embedding models. It ensures every token contributes to the final representation and proves more robust across different text lengths.
Last Token Pooling: Use the final token's representation. This works well for decoder-only models (like E5-mistral) where the last token naturally aggregates information through causal attention.
Attention-Weighted Pooling: Learn attention weights over tokens and compute a weighted average. More expressive but adds parameters and complexity.
The evolution from CLS pooling to mean pooling represents a shift in understanding: rather than trusting a single special token to aggregate information, mean pooling explicitly incorporates all tokens, proving more reliable in practice.
┌─────────────────────────────────────────────────────────────────────────────┐
│ EMBEDDING MODEL ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Input Text: "The quick brown fox jumps over the lazy dog" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TOKENIZATION │ │
│ │ [CLS] The quick brown fox jumps over the lazy dog [SEP] │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TOKEN EMBEDDINGS │ │
│ │ Each token → 768-dimensional vector (from embedding table) │ │
│ │ + Position embeddings (learned or sinusoidal) │ │
│ │ + Token type embeddings (for multi-segment inputs) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TRANSFORMER ENCODER (12 layers) │ │
│ │ │ │
│ │ Each layer: │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ Multi-Head Self-Attention │ │ │
│ │ │ (tokens attend to all other tokens) │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ Feed-Forward Network │ │ │
│ │ │ (independent per-token transformations) │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Output: Contextualized representations for each token │ │
│ │ Shape: [sequence_length, hidden_size] = [11, 768] │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ POOLING │ │
│ │ │ │
│ │ Option 1: CLS Pooling → Take first token [768] │ │
│ │ Option 2: Mean Pooling → Average all tokens [768] │ │
│ │ Option 3: Attention Pooling → Learned weighted average [768] │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ NORMALIZATION │ │
│ │ L2 normalize: vector / ||vector||₂ │ │
│ │ Result: Unit vector on hypersphere │ │
│ │ Cosine similarity = dot product (after normalization) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ │
│ Final Embedding: [0.023, -0.156, 0.089, ..., 0.045] (768 dimensions) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Dimensionality Considerations
Embedding dimensionality involves trade-offs:
Higher dimensions (1024, 1536, 3072) can capture more nuanced semantic distinctions but increase storage costs, retrieval latency, and memory requirements. They also require more training data to avoid overfitting.
Lower dimensions (256, 384, 512) are more efficient but may lose semantic precision. However, modern training techniques have pushed quality at lower dimensions remarkably high.
Matryoshka Representation Learning (MRL) trains embeddings so that truncated prefixes remain useful. A 1024-dimension embedding trained with MRL can be truncated to 256 dimensions with graceful degradation rather than catastrophic failure. This enables runtime flexibility—use full dimensions for high-precision needs, truncated for speed.
OpenAI's text-embedding-3 models support this natively: you can request any dimension up to 3072, and the API returns appropriately truncated embeddings.
Contrastive Learning: The Foundation
The dominant paradigm for training embedding models is contrastive learning. The core idea is elegantly simple: train the model to produce similar embeddings for semantically similar texts and dissimilar embeddings for unrelated texts.
The Contrastive Objective
Contrastive learning requires pairs (or groups) of texts with known relationships:
Positive pairs: Texts that should have similar embeddings. Examples include:
- A query and its relevant document
- A sentence and its paraphrase
- An article title and its body text
- A question and its answer
Negative pairs: Texts that should have dissimilar embeddings. These are typically other texts in the batch that happen to be unrelated.
The training objective pushes positive pairs together in embedding space while pushing negative pairs apart.
InfoNCE Loss
The most common contrastive loss is InfoNCE (Noise Contrastive Estimation), also known as NT-Xent (Normalized Temperature-scaled Cross Entropy) in some papers:
┌─────────────────────────────────────────────────────────────────────────────┐
│ InfoNCE LOSS EXPLAINED │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Given: │
│ - Query embedding: q │
│ - Positive document embedding: d⁺ │
│ - Negative document embeddings: d₁⁻, d₂⁻, ..., dₙ⁻ │
│ - Temperature parameter: τ (typically 0.01-0.1) │
│ │
│ Similarity function: sim(a, b) = cosine_similarity(a, b) / τ │
│ │
│ Loss = -log [ exp(sim(q, d⁺)) / (exp(sim(q, d⁺)) + Σᵢ exp(sim(q, dᵢ⁻))) ] │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ Intuition: │
│ This is softmax cross-entropy where we want to "classify" the query │
│ as belonging to the positive document among all candidates. │
│ │
│ - Numerator: similarity to positive (should be high) │
│ - Denominator: similarity to positive + all negatives (normalization) │
│ │
│ Temperature τ: │
│ - Lower τ (0.01): sharper distribution, harder optimization │
│ - Higher τ (0.1): softer distribution, easier optimization │
│ - Typical: 0.02-0.05 for embedding training │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ Visual representation: │
│ │
│ Before Training After Training │
│ │
│ d₁⁻ d₁⁻ │
│ · · │
│ q · ·d₂⁻ │
│ · │
│ d⁺ d₂⁻ q·d⁺ │
│ · · │
│ d₃⁻ d₃⁻· │
│ │
│ Query and positive are scattered Query and positive are close │
│ Negatives are mixed in Negatives are pushed away │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
In-Batch Negatives
A key efficiency trick is in-batch negatives: within a batch of N query-document pairs, each query's positive document becomes a negative for all other queries. This gives you N-1 negatives "for free" without additional computation.
For a batch size of 1024, each query gets 1023 negative examples. This is remarkably efficient but has implications:
- Larger batch sizes provide more negatives and generally better training
- Training often uses gradient accumulation or distributed training to achieve large effective batch sizes
- The negatives are "random" (whatever else is in the batch), not intentionally selected
The Dual Encoder Architecture
For retrieval applications, we often want to encode queries and documents independently (so we can pre-compute document embeddings). This leads to the dual encoder or bi-encoder architecture:
Two separate encoders (or a shared encoder) process queries and documents independently. Their outputs are compared with cosine similarity or dot product.
This enables efficient retrieval: encode all documents offline, then at query time, encode only the query and find nearest neighbors using vector similarity search.
┌─────────────────────────────────────────────────────────────────────────────┐
│ DUAL ENCODER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ QUERY ENCODER │ │ DOCUMENT ENCODER │ │
│ │ │ │ │ │
│ │ "What is RLHF?" │ │ "RLHF stands for │ │
│ │ │ │ │ Reinforcement..." │ │
│ │ ▼ │ │ │ │ │
│ │ [Transformer] │ │ [Transformer] │ │
│ │ │ │ │ │ │ │
│ │ ▼ │ │ ▼ │ │
│ │ [Mean Pool] │ │ [Mean Pool] │ │
│ │ │ │ │ │ │ │
│ │ ▼ │ │ ▼ │ │
│ │ q ∈ ℝ⁷⁶⁸ │ │ d ∈ ℝ⁷⁶⁸ │ │
│ └──────────┬──────────────┘ └──────────┬──────────────┘ │
│ │ │ │
│ └───────────┬────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ SIMILARITY SCORE │ │
│ │ │ │
│ │ score = q · d │ │
│ │ ───────── │ │
│ │ ||q|| ||d|| │ │
│ │ │ │
│ │ (cosine similarity) │ │
│ └────────────────────────┘ │
│ │
│ Key insight: Encoders can be DIFFERENT or SHARED │
│ │
│ Shared weights: Same transformer encodes both query and document │
│ Simpler, fewer parameters, works well in practice │
│ │
│ Separate weights: Different encoders for query vs document │
│ Can specialize (short queries vs long documents) │
│ More parameters, sometimes better quality │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Hard Negative Mining: The Critical Ingredient
Random in-batch negatives are easy to obtain but often too easy to distinguish. If your query is "symptoms of diabetes" and your random negatives include documents about "JavaScript frameworks" and "Roman history," the model doesn't learn much—the distinction is trivial.
Hard negatives are documents that are superficially similar to the query but actually irrelevant. They force the model to learn subtle semantic distinctions.
Types of Hard Negatives
BM25 Hard Negatives: Use lexical search (BM25) to find documents with high word overlap but that aren't actually relevant. For "symptoms of diabetes," BM25 might return documents about "diabetes medications" or "diabetes prevention"—topically related but not answering the query.
Dense Retrieval Hard Negatives: Use an existing embedding model to find documents that are close in embedding space but irrelevant. These are the hardest negatives—the current model thinks they're similar, but they shouldn't be.
Cross-Encoder Hard Negatives: Use a cross-encoder (which sees query and document together) to find challenging cases. Cross-encoders are more accurate but slower than bi-encoders.
LLM-Mined Hard Negatives: Use an LLM to generate plausible but incorrect answers or to identify near-miss documents.
The Hard Negative Mining Pipeline
Hard negative mining is typically an iterative process:
- Initial training: Train with random negatives or BM25 negatives
- Mining: Use the current model to retrieve hard negatives from a corpus
- Filtering: Remove false negatives (documents that are actually relevant)
- Re-training: Train on the mixture of original data plus hard negatives
- Repeat: Mine harder negatives with the improved model
This iterative approach is used by most state-of-the-art embedding models. E5 and BGE both employ multiple rounds of hard negative mining during training.
┌─────────────────────────────────────────────────────────────────────────────┐
│ HARD NEGATIVE MINING PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Round 1: Random/Easy Negatives │
│ ─────────────────────────────── │
│ Query: "How to train a neural network" │
│ Positive: Tutorial on backpropagation │
│ Easy Negatives: │
│ - Article about cooking recipes ← Trivially different │
│ - Document about ancient history ← Trivially different │
│ - News about sports ← Trivially different │
│ │
│ Model easily learns to distinguish → Limited learning signal │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ Round 2: BM25 Hard Negatives │
│ ─────────────────────────────── │
│ Query: "How to train a neural network" │
│ Positive: Tutorial on backpropagation │
│ BM25 Negatives (high lexical overlap): │
│ - "Neural network architecture overview" ← Related but not how-to │
│ - "Training data requirements" ← About training, not how │
│ - "Network security training course" ← Wrong sense of "train" │
│ │
│ Model must learn semantic nuance → Better learning signal │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ Round 3: Dense Retrieval Hard Negatives │
│ ─────────────────────────────────────── │
│ Use current model to find near-misses: │
│ - "Deep learning optimization techniques" ← Semantically close │
│ - "PyTorch training loop example" ← Related code example │
│ - "Gradient descent explained" ← Sub-topic, not full answer │
│ │
│ These are documents the current model ranks highly but shouldn't │
│ Training on these teaches fine-grained distinctions │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ CRITICAL: False Negative Filtering │
│ ────────────────────────────────── │
│ Some "hard negatives" are actually relevant! │
│ "PyTorch training loop example" might BE a valid answer. │
│ │
│ Solutions: │
│ - Cross-encoder scoring: High cross-encoder score → might be relevant │
│ - LLM verification: Ask LLM if document answers query │
│ - Human spot-checking: Sample and verify │
│ - Conservative threshold: Only use very low-similarity negatives │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
The False Negative Problem
A critical challenge in hard negative mining is false negatives—documents mislabeled as negative that are actually relevant. If your mining retrieves "PyTorch tutorial for training neural networks" as a hard negative for "how to train a neural network," you're teaching the model to push apart things that should be similar.
False negatives are pernicious because they actively degrade model quality. Solutions include:
- Using cross-encoder scores to filter potential false negatives
- LLM-based relevance verification
- Human verification of mined negatives (expensive but reliable)
- Conservative thresholds (only use negatives the model already ranks low)
Multi-Stage Training: The Modern Recipe
State-of-the-art embedding models aren't trained in one shot. They follow a multi-stage recipe, each stage serving a specific purpose:
Stage 1: Pre-training (Self-Supervised)
The foundation is a pre-trained language model, typically BERT or similar. This provides:
- Rich linguistic knowledge
- Contextual understanding
- General semantic representations
Some embedding models use standard BERT pre-training. Others use retrieval-oriented pre-training objectives like:
Inverse Cloze Task (ICT): Given a random sentence, predict which document it came from.
Contrastive Span Prediction: Predict which spans come from the same document.
Title-Body Prediction: Predict which title matches which document body.
These objectives provide weak but abundant retrieval signal from unlabeled text.
Stage 2: Large-Scale Weak Supervision
Train on massive paired datasets with noisy but abundant signal:
Web-mined pairs:
- Title-body pairs from web pages
- Query-click pairs from search logs (if available)
- Anchor text and linked pages
- Reddit posts and their top comments
Synthetic pairs:
- Questions generated by LLMs from passages
- Paraphrases generated by back-translation or LLM
This stage trains on millions to billions of pairs. The signal is noisy, but scale compensates.
Stage 3: High-Quality Fine-tuning
Train on smaller, high-quality datasets:
Human-annotated retrieval datasets:
- MS MARCO (1M queries with human relevance judgments)
- Natural Questions, TriviaQA (question-answer pairs)
- HotpotQA (multi-hop reasoning)
Task-specific data:
- STS (Semantic Textual Similarity) benchmarks
- NLI (Natural Language Inference) pairs
- Paraphrase datasets
This stage uses hard negative mining intensively.
Stage 4: Instruction Tuning (Optional)
Recent models like E5-mistral add instruction-following capability:
Instead of raw text, the input includes task instructions:
- "Retrieve documents that answer this question: {query}"
- "Find passages similar in meaning to: {text}"
- "Identify documents on the same topic as: {text}"
This allows a single model to handle different retrieval tasks with task-specific behavior.
┌─────────────────────────────────────────────────────────────────────────────┐
│ MULTI-STAGE TRAINING PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: Pre-training │
│ ────────────────────── │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Masked Language Modeling (or retrieval-oriented pre-training) │ │
│ │ │ │
│ │ Data: Wikipedia, Books, Common Crawl (100B+ tokens) │ │
│ │ Objective: Predict masked tokens / contrastive document matching │ │
│ │ Purpose: Learn language structure and basic semantics │ │
│ │ Duration: Days to weeks on large GPU clusters │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 2: Weak Supervision at Scale │
│ ─────────────────────────────────── │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Contrastive learning on noisy pairs │ │
│ │ │ │
│ │ Data: 100M-1B pairs from: │ │
│ │ - Web page title + body │ │
│ │ - Anchor text + linked page │ │
│ │ - QA pairs from forums │ │
│ │ - LLM-generated questions from passages │ │
│ │ │ │
│ │ Negatives: In-batch (large batch sizes: 16K-65K) │ │
│ │ Purpose: Learn retrieval basics from abundant noisy signal │ │
│ │ Duration: Days on large GPU clusters │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 3: High-Quality Fine-tuning │
│ ───────────────────────────────── │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Contrastive learning with hard negatives │ │
│ │ │ │
│ │ Data: 1M-10M pairs from: │ │
│ │ - MS MARCO, Natural Questions, HotpotQA │ │
│ │ - STS benchmarks │ │
│ │ - NLI datasets (as soft positives/negatives) │ │
│ │ │ │
│ │ Negatives: Hard negatives mined from previous stage model │ │
│ │ Multiple rounds of mining → training → mining │ │
│ │ Purpose: Learn fine-grained semantic distinctions │ │
│ │ Duration: Hours to days │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 4: Instruction Tuning (Optional) │
│ ────────────────────────────────────── │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Train with task-specific instructions │ │
│ │ │ │
│ │ Input format: "Instruct: {task}\nQuery: {text}" │ │
│ │ - "Retrieve relevant documents" │ │
│ │ - "Find similar sentences" │ │
│ │ - "Identify passages that answer the question" │ │
│ │ │ │
│ │ Purpose: Enable task-specific behavior with single model │ │
│ │ Duration: Hours │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Production Embedding Model │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Cross-Encoders vs Bi-Encoders
Understanding the distinction between cross-encoders and bi-encoders is crucial for building retrieval systems.
Bi-Encoders (Dual Encoders)
Bi-encoders process query and document independently, producing separate embeddings that are compared with dot product or cosine similarity.
Advantages:
- Document embeddings can be pre-computed and cached
- Fast retrieval via approximate nearest neighbor search
- Scales to billions of documents
Disadvantages:
- No cross-attention between query and document
- May miss nuanced relevance signals
- Less accurate than cross-encoders
Cross-Encoders
Cross-encoders process query and document together in a single forward pass, with full attention between them.
Advantages:
- Full attention allows rich query-document interaction
- Significantly more accurate (typically 5-15% better on benchmarks)
- Can capture subtle relevance signals
Disadvantages:
- Must compute for every query-document pair
- Cannot pre-compute document representations
- Too slow for initial retrieval at scale
The Practical Solution: Retrieve and Re-rank
Most production systems combine both:
- First stage: Bi-encoder retrieves top-k candidates (k=100-1000) using fast vector similarity
- Second stage: Cross-encoder re-ranks candidates by computing exact scores
This gives you the efficiency of bi-encoders with accuracy approaching cross-encoders.
┌─────────────────────────────────────────────────────────────────────────────┐
│ BI-ENCODER VS CROSS-ENCODER COMPARISON │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ BI-ENCODER (Dual Encoder) │
│ ───────────────────────── │
│ │
│ Query: "What is RLHF?" Document: "RLHF is a technique..." │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │Encoder A│ │Encoder B│ (may be same weights) │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ ▼ ▼ │
│ [0.2, -0.1, ...] [0.3, -0.2, ...] │
│ │ │ │
│ └─────────┬─────────────────┘ │
│ ▼ │
│ cosine_similarity = 0.89 │
│ │
│ ✓ Encodings are INDEPENDENT │
│ ✓ Documents can be pre-computed │
│ ✓ Fast nearest neighbor search │
│ ✗ No query-document attention │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ CROSS-ENCODER │
│ ───────────── │
│ │
│ Input: "[CLS] What is RLHF? [SEP] RLHF is a technique... [SEP]" │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Transformer │ │
│ │ (full cross- │ │
│ │ attention) │ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ [CLS] → Linear → σ → 0.92 │
│ │
│ ✓ Full attention between query and document │
│ ✓ More accurate relevance modeling │
│ ✗ Must compute for EVERY pair │
│ ✗ Cannot pre-compute document representations │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ PRACTICAL PIPELINE: Retrieve then Re-rank │
│ ───────────────────────────────────────── │
│ │
│ Query ──▶ Bi-Encoder ──▶ ANN Search ──▶ Top 100 ──▶ Cross-Encoder │
│ │ │ docs re-ranking │
│ ▼ ▼ │ │
│ Query embedding 1B documents ▼ │
│ (pre-computed) Top 10 results │
│ (re-ordered) │
│ │
│ Latency: ~10ms + ~100ms = ~110ms total │
│ Accuracy: Near cross-encoder quality │
│ Scalability: Billions of documents │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
ColBERT and Late Interaction
ColBERT (Contextualized Late Interaction over BERT) represents a middle ground between bi-encoders and cross-encoders.
The ColBERT Approach
Instead of pooling token representations into a single vector, ColBERT keeps all token embeddings:
- Query: N token embeddings (one per token)
- Document: M token embeddings (one per token)
Similarity is computed as MaxSim: for each query token, find the maximum similarity to any document token, then sum:
Why This Works
MaxSim captures a form of soft term matching:
- Query token "diabetes" will have high similarity to document token "diabetes" or "diabetic"
- Query token "symptoms" will match document tokens like "symptoms," "signs," "manifestations"
- Each query term finds its best match in the document
This is more expressive than a single-vector comparison but still allows document pre-computation.
Trade-offs
Advantages:
- Better accuracy than bi-encoders (often close to cross-encoders)
- Documents can still be pre-indexed
- Interpretable: you can see which tokens matched
Disadvantages:
- Storage: must store all token embeddings, not just one vector per document
- Retrieval complexity: requires specialized indexing (like PLAID)
- Higher latency than single-vector bi-encoders
ColBERT v2 and PLAID indexing have made this practical for large-scale deployment, and it's increasingly popular for applications where retrieval quality justifies the overhead.
Matryoshka Embeddings: Flexible Dimensionality
Matryoshka Representation Learning (MRL) is an elegant technique that makes embedding dimensions adaptive.
The Problem
Different applications have different dimension requirements:
- High-precision retrieval might want 1024+ dimensions
- Mobile deployment might need 256 dimensions
- Some use cases fall in between
Traditionally, you'd train separate models for each dimension or truncate (which usually degrades quality significantly).
The Matryoshka Solution
MRL trains embeddings so that the first d dimensions form a valid d-dimensional embedding for any d less than the full dimension.
During training, the loss is computed at multiple truncation points:
Where d might be [64, 128, 256, 512, 1024, ...].
This forces the model to encode the most important information in early dimensions. The first 256 dimensions must be a good embedding on their own, with dimensions 257-512 adding refinement, and so on.
Practical Benefits
Runtime flexibility: Choose dimension based on latency/quality trade-off without retraining.
Efficient search: Search with low dimensions first, then refine with full dimensions.
Progressive retrieval: Quickly eliminate candidates with low-dimensional comparison, then use full dimensions for top-k.
OpenAI's text-embedding-3 models support MRL natively, and several open-source models (like nomic-embed-text) also support it.
Training Data Sources
The quality of embedding models depends critically on training data. Here's where that data comes from:
Natural Paired Data
Title-Body Pairs: Web page titles are often good summaries of content. Billions of pairs available from Common Crawl.
Question-Answer Pairs: Stack Overflow, Quora, Reddit—any Q&A platform provides natural pairs where questions and accepted answers should be similar.
Anchor Text: The text of a hyperlink often describes the linked page. "Click here for the RLHF paper" linked to the RLHF paper creates a natural pair.
Search Logs: If you have access to search logs with clicks, query-clicked-document pairs are gold (but rarely available outside search companies).
Synthetic Data Generation
LLM-Generated Questions: Given a passage, ask an LLM to generate questions that the passage answers. This creates abundant question-passage pairs from any corpus.
Paraphrase Generation: Use back-translation (English → French → English) or LLM paraphrasing to create semantic equivalents.
Hard Negative Generation: Ask an LLM to generate plausible but wrong answers, creating high-quality hard negatives.
Human-Annotated Datasets
MS MARCO: 1 million queries with human relevance judgments. The gold standard for passage retrieval.
Natural Questions: 300K+ real Google search queries with Wikipedia answers.
BEIR Benchmark: Collection of diverse retrieval tasks (scientific, financial, news, etc.) for evaluation.
STS Benchmark: Human-annotated semantic similarity scores for sentence pairs.
Fine-Tuning for Your Domain
General embedding models work well for general text, but domain-specific applications often benefit from fine-tuning.
When to Fine-Tune
Fine-tuning makes sense when:
- Your domain has specialized vocabulary (medical, legal, scientific terms)
- Semantic relationships differ from general text ("side effects" and "adverse events" are synonymous in medical contexts)
- You have domain-specific query patterns
- General models underperform on your evaluation set
Collecting Fine-Tuning Data
Query logs: If your application has users, log queries and what they ultimately found useful.
Expert curation: Have domain experts create or validate query-document pairs.
Synthetic generation: Use domain documents + LLM to generate domain-specific questions.
Transfer from similar domains: Medical search data might help with biotech; legal search might help with compliance.
Fine-Tuning Strategies
Full fine-tuning: Update all model weights. Most flexible but requires more data (thousands of pairs minimum) and risks overfitting.
LoRA/QLoRA: Train low-rank adapters, keeping base weights frozen. Efficient, works with less data, easier to experiment.
Sentence Transformers library: Provides high-level APIs for fine-tuning embedding models with various loss functions.
Staged fine-tuning: First fine-tune on abundant weak signal, then on smaller high-quality data.
Avoiding Catastrophic Forgetting
Fine-tuning on narrow domain data can degrade general performance. Mitigations:
- Mix domain data with general data during fine-tuning
- Use lower learning rates
- Early stopping based on validation on both domain and general benchmarks
- LoRA naturally limits forgetting by keeping base weights frozen
Evaluation: How to Measure Embedding Quality
Evaluating embedding models requires understanding what "good" means for your application.
Retrieval Metrics
Recall@k: What fraction of relevant documents appear in the top k results? Critical for retrieval systems where a later re-ranker will refine results.
NDCG (Normalized Discounted Cumulative Gain): Accounts for ranking order—relevant documents ranked higher contribute more.
MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant result. Useful when users typically want one good answer.
MAP (Mean Average Precision): Average precision at each relevant document, averaged across queries.
Similarity Metrics
Spearman Correlation: For STS tasks, how well do model similarities correlate with human similarity judgments?
Benchmark Suites
MTEB (Massive Text Embedding Benchmark): 56 datasets across 8 task types. The standard for comparing embedding models.
BEIR: 18 diverse retrieval datasets. Tests generalization across domains.
Custom evaluation: Build an evaluation set from your actual use case. Ultimately, performance on your data matters most.
Evaluation Pitfalls
Leakage: If your training data overlaps with evaluation data, metrics are meaningless. MS MARCO is in many training sets—don't treat it as unbiased evaluation.
Distribution shift: Academic benchmarks may not match your production distribution.
Metric gaming: Optimizing for one metric can hurt others. Track multiple metrics.
Practical Recommendations
Based on the current landscape, here are practical recommendations for different scenarios:
For Most Applications
Start with a strong general-purpose model:
- OpenAI text-embedding-3-large for best quality (if you can use proprietary APIs)
- BGE-large-en-v1.5 or E5-large-v2 for strong open-source options
- nomic-embed-text if you need fully open (Apache 2.0) weights and training data
Use the retrieve-and-rerank pattern:
- Bi-encoder for initial retrieval (top 100)
- Cross-encoder re-ranker for final ordering (top 10-20)
For Latency-Sensitive Applications
- Use smaller models (BGE-small, E5-small)
- Consider Matryoshka truncation to reduce dimensions
- Quantize embeddings (int8 often works well)
- Use GPU inference or optimized CPU inference (ONNX)
For Domain-Specific Applications
- Evaluate general models first—they might be good enough
- If not, fine-tune with domain data using Sentence Transformers
- Start with LoRA fine-tuning before full fine-tuning
- Build a domain-specific evaluation set to measure progress
For Multilingual Applications
- Cohere embed-v3 (multilingual by default)
- BGE-M3 (multilingual, supports multiple retrieval modes)
- E5-multilingual-large
For Code Search
- Voyage Code (proprietary)
- CodeBERT or GraphCodeBERT (open-source)
- Consider fine-tuning general models on code pairs
The Future of Embedding Models
Several trends are shaping the future:
Larger context windows: Models like jina-embeddings-v2 support 8K tokens, enabling embedding of longer documents without chunking.
Multi-vector representations: ColBERT-style approaches are becoming more practical, offering better quality with reasonable efficiency.
Instruction-following embeddings: Models that can adjust behavior based on task instructions, enabling a single model to handle diverse retrieval needs.
Integration with LLMs: Using LLMs to generate embeddings (like E5-mistral) or to supervise embedding training with synthetic data.
Domain specialization: More models tailored for specific verticals (medical, legal, financial, code).
Efficiency improvements: Better quantization, distillation, and architecture innovations to reduce costs while maintaining quality.
Frequently Asked Questions
Related Articles
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
Agentic RAG: When Retrieval Meets Autonomous Reasoning
How to build RAG systems that don't just retrieve—they reason, plan, and iteratively refine their searches to solve complex information needs.
LLM-Powered Search for E-Commerce: Beyond NER and Elasticsearch
A deep dive into building intelligent e-commerce search systems that understand natural language, leverage metadata effectively, and support multi-turn conversations—moving beyond classical NER + Elasticsearch approaches.
Building Semantic Memory for LLM Conversations: A Hierarchical RAG Approach
A practical guide to building a semantic search system for your LLM conversation history using hierarchical chunking, HyDE retrieval, knowledge graphs, and agentic research patterns.