How many dimensions should I use for my embeddings?

It depends on your quality-cost trade-off. For most applications, 768-1024 dimensions provide excellent quality. If using Matryoshka-trained models, you can experiment with lower dimensions (256-512) for faster search with minimal quality loss. Only use higher dimensions (1536-3072) if you've verified they improve your specific use case.

Should I use cosine similarity or dot product?

For normalized embeddings (which most models produce), they're equivalent. Cosine similarity is more interpretable (bounded 0-1 for positive, -1 to 1 generally). If embeddings aren't normalized, cosine similarity is safer as it's scale-invariant.

How much training data do I need to fine-tune an embedding model?

For LoRA fine-tuning, as few as 1,000-5,000 high-quality pairs can show improvement. For full fine-tuning, aim for 50,000+ pairs to avoid overfitting. Quality matters more than quantity—1,000 expert-curated pairs beat 100,000 noisy pairs.

How do I handle documents longer than the model's context window?

Options: (1) Chunk documents and embed each chunk separately, (2) Use a model with longer context (jina-embeddings-v2 supports 8K), (3) Summarize long documents before embedding. Chunking is most common but loses cross-chunk context.

Are open-source embedding models as good as proprietary ones?

For many applications, yes. On MTEB benchmarks, top open-source models (BGE, E5, GTE) match or exceed OpenAI and Cohere on many tasks. The gap has narrowed significantly. Proprietary models may still have advantages in specific areas or convenience.

How do I know if my embedding model is the bottleneck in my RAG system?

Run retrieval evaluation separately from end-to-end evaluation. Measure Recall@k for your retrieval stage. If Recall@100 is low, your embedding model (or indexing strategy) needs improvement. If Recall@100 is high but final answers are poor, look at re-ranking, context assembly, or generation.

Training Embedding Models: From Contrastive Learning to Production Retrieval | Enrico Piovano

Why Embedding Models Matter

Every RAG system, semantic search engine, and retrieval-based application depends on embedding models. These models transform text into dense vectors where semantic similarity corresponds to geometric proximity. Yet while practitioners obsess over chunking strategies and prompt engineering, the embedding model itself often receives little attention—it's treated as a black box API call.

This matters because embedding quality fundamentally bounds retrieval quality. No amount of sophisticated re-ranking, query expansion, or hybrid search can recover from an embedding model that fails to capture the semantic relationships your application needs. If your embedding model doesn't understand that "myocardial infarction" and "heart attack" are semantically identical, your medical RAG system will miss half the relevant documents.

Understanding how embedding models are trained unlocks several capabilities: fine-tuning for your domain, selecting the right model for your use case, diagnosing retrieval failures, and making informed decisions about the quality-cost-latency trade-offs inherent in embedding selection.

The Embedding Landscape in 2025

The embedding model space has matured significantly. A few years ago, practitioners chose between OpenAI's ada-002 and a handful of open-source options. Today, the landscape includes:

Proprietary APIs:

OpenAI text-embedding-3-large (3072 dimensions, strong general performance)
Cohere embed-v3 (multiple size options, multilingual strength)
Voyage AI (domain-specific variants for code, legal, finance)
Google's Gecko (tight Vertex AI integration)

Open-Source Leaders:

BGE (BAAI General Embedding) family from Beijing Academy of AI
E5 family from Microsoft Research
GTE (General Text Embeddings) from Alibaba
Nomic Embed (fully open weights and training data)
Jina Embeddings (8K context, multilingual)

Specialized Models:

CodeBERT and derivatives for code
SciBERT, PubMedBERT for scientific domains
Legal-BERT, FinBERT for specialized domains

The performance gap between open-source and proprietary models has narrowed dramatically. On the MTEB (Massive Text Embedding Benchmark) leaderboard, open models now match or exceed proprietary options on many tasks. The decision often comes down to factors beyond raw performance: latency requirements, data privacy, fine-tuning needs, and cost at scale.

The Architecture of Embedding Models

Most modern text embedding models share a common architecture: a transformer encoder that processes input text and produces contextual representations, followed by a pooling operation that collapses the sequence into a single vector.

Encoder Architecture

The encoder is typically a BERT-style transformer, though the specific architecture varies:

BERT-base variants (110M parameters) offer good performance-to-cost ratios and remain popular for production deployments where latency matters.

BERT-large variants (340M parameters) provide better quality but significantly higher latency and cost.

Modern architectures like the E5-mistral models use decoder-only transformers (Mistral 7B) as the backbone, achieving state-of-the-art results but with substantial computational requirements.

The encoder processes input text through multiple transformer layers. Each layer applies self-attention (allowing tokens to attend to each other) followed by feed-forward networks. The output is a sequence of contextualized token representations—one vector per input token.

Pooling Strategies

Raw encoder output is a sequence of vectors. To get a single embedding, we must pool these vectors together. The choice of pooling strategy significantly impacts quality:

CLS Token Pooling: Use the representation of the special [CLS] token (first position). This was BERT's original design—the idea being that the [CLS] token, through attention, aggregates information from the entire sequence. In practice, this works reasonably but may not fully capture sequence information.

Mean Pooling: Average all token representations (excluding padding). This is now the dominant approach for embedding models. It ensures every token contributes to the final representation and proves more robust across different text lengths.

Last Token Pooling: Use the final token's representation. This works well for decoder-only models (like E5-mistral) where the last token naturally aggregates information through causal attention.

Attention-Weighted Pooling: Learn attention weights over tokens and compute a weighted average. More expressive but adds parameters and complexity.

The evolution from CLS pooling to mean pooling represents a shift in understanding: rather than trusting a single special token to aggregate information, mean pooling explicitly incorporates all tokens, proving more reliable in practice.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                    EMBEDDING MODEL ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Input Text: "The quick brown fox jumps over the lazy dog"                 │
│                           │                                                 │
│                           ▼                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                      TOKENIZATION                                    │   │
│  │  [CLS] The quick brown fox jumps over the lazy dog [SEP]            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                           │                                                 │
│                           ▼                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                   TOKEN EMBEDDINGS                                   │   │
│  │  Each token → 768-dimensional vector (from embedding table)         │   │
│  │  + Position embeddings (learned or sinusoidal)                       │   │
│  │  + Token type embeddings (for multi-segment inputs)                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                           │                                                 │
│                           ▼                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │              TRANSFORMER ENCODER (12 layers)                         │   │
│  │                                                                       │   │
│  │  Each layer:                                                          │   │
│  │  ┌─────────────────────────────────────────────────────────────┐    │   │
│  │  │ Multi-Head Self-Attention                                    │    │   │
│  │  │ (tokens attend to all other tokens)                          │    │   │
│  │  └─────────────────────────────────────────────────────────────┘    │   │
│  │                        │                                              │   │
│  │                        ▼                                              │   │
│  │  ┌─────────────────────────────────────────────────────────────┐    │   │
│  │  │ Feed-Forward Network                                         │    │   │
│  │  │ (independent per-token transformations)                      │    │   │
│  │  └─────────────────────────────────────────────────────────────┘    │   │
│  │                                                                       │   │
│  │  Output: Contextualized representations for each token               │   │
│  │  Shape: [sequence_length, hidden_size] = [11, 768]                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                           │                                                 │
│                           ▼                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                        POOLING                                       │   │
│  │                                                                       │   │
│  │  Option 1: CLS Pooling → Take first token [768]                      │   │
│  │  Option 2: Mean Pooling → Average all tokens [768]                   │   │
│  │  Option 3: Attention Pooling → Learned weighted average [768]        │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                           │                                                 │
│                           ▼                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    NORMALIZATION                                     │   │
│  │  L2 normalize: vector / ||vector||₂                                  │   │
│  │  Result: Unit vector on hypersphere                                  │   │
│  │  Cosine similarity = dot product (after normalization)               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                           │                                                 │
│                           ▼                                                 │
│                                                                             │
│  Final Embedding: [0.023, -0.156, 0.089, ..., 0.045]  (768 dimensions)     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Dimensionality Considerations

Embedding dimensionality involves trade-offs:

Higher dimensions (1024, 1536, 3072) can capture more nuanced semantic distinctions but increase storage costs, retrieval latency, and memory requirements. They also require more training data to avoid overfitting.

Lower dimensions (256, 384, 512) are more efficient but may lose semantic precision. However, modern training techniques have pushed quality at lower dimensions remarkably high.

Matryoshka Representation Learning (MRL) trains embeddings so that truncated prefixes remain useful. A 1024-dimension embedding trained with MRL can be truncated to 256 dimensions with graceful degradation rather than catastrophic failure. This enables runtime flexibility—use full dimensions for high-precision needs, truncated for speed.

OpenAI's text-embedding-3 models support this natively: you can request any dimension up to 3072, and the API returns appropriately truncated embeddings.

Contrastive Learning: The Foundation

The dominant paradigm for training embedding models is contrastive learning. The core idea is elegantly simple: train the model to produce similar embeddings for semantically similar texts and dissimilar embeddings for unrelated texts.

The Contrastive Objective

Contrastive learning requires pairs (or groups) of texts with known relationships:

Positive pairs: Texts that should have similar embeddings. Examples include:

A query and its relevant document
A sentence and its paraphrase
An article title and its body text
A question and its answer

Negative pairs: Texts that should have dissimilar embeddings. These are typically other texts in the batch that happen to be unrelated.

The training objective pushes positive pairs together in embedding space while pushing negative pairs apart.

InfoNCE Loss

The most common contrastive loss is InfoNCE (Noise Contrastive Estimation), also known as NT-Xent (Normalized Temperature-scaled Cross Entropy) in some papers:

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                        InfoNCE LOSS EXPLAINED                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Given:                                                                     │
│  - Query embedding: q                                                       │
│  - Positive document embedding: d⁺                                          │
│  - Negative document embeddings: d₁⁻, d₂⁻, ..., dₙ⁻                         │
│  - Temperature parameter: τ (typically 0.01-0.1)                            │
│                                                                             │
│  Similarity function: sim(a, b) = cosine_similarity(a, b) / τ               │
│                                                                             │
│  Loss = -log [ exp(sim(q, d⁺)) / (exp(sim(q, d⁺)) + Σᵢ exp(sim(q, dᵢ⁻))) ] │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Intuition:                                                                 │
│  This is softmax cross-entropy where we want to "classify" the query       │
│  as belonging to the positive document among all candidates.                │
│                                                                             │
│  - Numerator: similarity to positive (should be high)                       │
│  - Denominator: similarity to positive + all negatives (normalization)     │
│                                                                             │
│  Temperature τ:                                                             │
│  - Lower τ (0.01): sharper distribution, harder optimization                │
│  - Higher τ (0.1): softer distribution, easier optimization                 │
│  - Typical: 0.02-0.05 for embedding training                                │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Visual representation:                                                     │
│                                                                             │
│              Before Training              After Training                    │
│                                                                             │
│                 d₁⁻                           d₁⁻                           │
│                  ·                              ·                            │
│             q ·                                      ·d₂⁻                    │
│          ·                                                                  │
│       d⁺              d₂⁻               q·d⁺                                │
│                  ·                          ·                                │
│              d₃⁻                                   d₃⁻·                      │
│                                                                             │
│  Query and positive are scattered    Query and positive are close          │
│  Negatives are mixed in              Negatives are pushed away             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

In-Batch Negatives

A key efficiency trick is in-batch negatives: within a batch of N query-document pairs, each query's positive document becomes a negative for all other queries. This gives you N-1 negatives "for free" without additional computation.

For a batch size of 1024, each query gets 1023 negative examples. This is remarkably efficient but has implications:

Larger batch sizes provide more negatives and generally better training
Training often uses gradient accumulation or distributed training to achieve large effective batch sizes
The negatives are "random" (whatever else is in the batch), not intentionally selected

The Dual Encoder Architecture

For retrieval applications, we often want to encode queries and documents independently (so we can pre-compute document embeddings). This leads to the dual encoder or bi-encoder architecture:

Two separate encoders (or a shared encoder) process queries and documents independently. Their outputs are compared with cosine similarity or dot product.

This enables efficient retrieval: encode all documents offline, then at query time, encode only the query and find nearest neighbors using vector similarity search.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                      DUAL ENCODER ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────┐     ┌─────────────────────────┐               │
│  │      QUERY ENCODER      │     │    DOCUMENT ENCODER     │               │
│  │                         │     │                         │               │
│  │   "What is RLHF?"       │     │  "RLHF stands for       │               │
│  │          │              │     │   Reinforcement..."     │               │
│  │          ▼              │     │          │              │               │
│  │   [Transformer]         │     │   [Transformer]         │               │
│  │          │              │     │          │              │               │
│  │          ▼              │     │          ▼              │               │
│  │   [Mean Pool]           │     │   [Mean Pool]           │               │
│  │          │              │     │          │              │               │
│  │          ▼              │     │          ▼              │               │
│  │    q ∈ ℝ⁷⁶⁸             │     │    d ∈ ℝ⁷⁶⁸             │               │
│  └──────────┬──────────────┘     └──────────┬──────────────┘               │
│             │                                │                              │
│             └───────────┬────────────────────┘                              │
│                         │                                                   │
│                         ▼                                                   │
│            ┌────────────────────────┐                                       │
│            │   SIMILARITY SCORE     │                                       │
│            │                        │                                       │
│            │  score = q · d         │                                       │
│            │        ─────────       │                                       │
│            │        ||q|| ||d||     │                                       │
│            │                        │                                       │
│            │  (cosine similarity)   │                                       │
│            └────────────────────────┘                                       │
│                                                                             │
│  Key insight: Encoders can be DIFFERENT or SHARED                          │
│                                                                             │
│  Shared weights:   Same transformer encodes both query and document        │
│                    Simpler, fewer parameters, works well in practice       │
│                                                                             │
│  Separate weights: Different encoders for query vs document                │
│                    Can specialize (short queries vs long documents)        │
│                    More parameters, sometimes better quality                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Hard Negative Mining: The Critical Ingredient

Random in-batch negatives are easy to obtain but often too easy to distinguish. If your query is "symptoms of diabetes" and your random negatives include documents about "JavaScript frameworks" and "Roman history," the model doesn't learn much—the distinction is trivial.

Hard negatives are documents that are superficially similar to the query but actually irrelevant. They force the model to learn subtle semantic distinctions.

Types of Hard Negatives

BM25 Hard Negatives: Use lexical search (BM25) to find documents with high word overlap but that aren't actually relevant. For "symptoms of diabetes," BM25 might return documents about "diabetes medications" or "diabetes prevention"—topically related but not answering the query.

Dense Retrieval Hard Negatives: Use an existing embedding model to find documents that are close in embedding space but irrelevant. These are the hardest negatives—the current model thinks they're similar, but they shouldn't be.

Cross-Encoder Hard Negatives: Use a cross-encoder (which sees query and document together) to find challenging cases. Cross-encoders are more accurate but slower than bi-encoders.

LLM-Mined Hard Negatives: Use an LLM to generate plausible but incorrect answers or to identify near-miss documents.

The Hard Negative Mining Pipeline

Hard negative mining is typically an iterative process:

Initial training: Train with random negatives or BM25 negatives
Mining: Use the current model to retrieve hard negatives from a corpus
Filtering: Remove false negatives (documents that are actually relevant)
Re-training: Train on the mixture of original data plus hard negatives
Repeat: Mine harder negatives with the improved model

This iterative approach is used by most state-of-the-art embedding models. E5 and BGE both employ multiple rounds of hard negative mining during training.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                    HARD NEGATIVE MINING PIPELINE                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Round 1: Random/Easy Negatives                                             │
│  ───────────────────────────────                                            │
│  Query: "How to train a neural network"                                     │
│  Positive: Tutorial on backpropagation                                      │
│  Easy Negatives:                                                            │
│    - Article about cooking recipes          ← Trivially different           │
│    - Document about ancient history         ← Trivially different           │
│    - News about sports                      ← Trivially different           │
│                                                                             │
│  Model easily learns to distinguish → Limited learning signal               │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Round 2: BM25 Hard Negatives                                               │
│  ───────────────────────────────                                            │
│  Query: "How to train a neural network"                                     │
│  Positive: Tutorial on backpropagation                                      │
│  BM25 Negatives (high lexical overlap):                                     │
│    - "Neural network architecture overview" ← Related but not how-to        │
│    - "Training data requirements"           ← About training, not how       │
│    - "Network security training course"     ← Wrong sense of "train"        │
│                                                                             │
│  Model must learn semantic nuance → Better learning signal                  │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Round 3: Dense Retrieval Hard Negatives                                    │
│  ───────────────────────────────────────                                    │
│  Use current model to find near-misses:                                     │
│    - "Deep learning optimization techniques" ← Semantically close           │
│    - "PyTorch training loop example"         ← Related code example         │
│    - "Gradient descent explained"            ← Sub-topic, not full answer   │
│                                                                             │
│  These are documents the current model ranks highly but shouldn't           │
│  Training on these teaches fine-grained distinctions                        │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  CRITICAL: False Negative Filtering                                         │
│  ──────────────────────────────────                                         │
│  Some "hard negatives" are actually relevant!                               │
│  "PyTorch training loop example" might BE a valid answer.                   │
│                                                                             │
│  Solutions:                                                                  │
│  - Cross-encoder scoring: High cross-encoder score → might be relevant     │
│  - LLM verification: Ask LLM if document answers query                      │
│  - Human spot-checking: Sample and verify                                   │
│  - Conservative threshold: Only use very low-similarity negatives          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The False Negative Problem

A critical challenge in hard negative mining is false negatives—documents mislabeled as negative that are actually relevant. If your mining retrieves "PyTorch tutorial for training neural networks" as a hard negative for "how to train a neural network," you're teaching the model to push apart things that should be similar.

False negatives are pernicious because they actively degrade model quality. Solutions include:

Using cross-encoder scores to filter potential false negatives
LLM-based relevance verification
Human verification of mined negatives (expensive but reliable)
Conservative thresholds (only use negatives the model already ranks low)

Multi-Stage Training: The Modern Recipe

State-of-the-art embedding models aren't trained in one shot. They follow a multi-stage recipe, each stage serving a specific purpose:

Stage 1: Pre-training (Self-Supervised)

The foundation is a pre-trained language model, typically BERT or similar. This provides:

Rich linguistic knowledge
Contextual understanding
General semantic representations

Some embedding models use standard BERT pre-training. Others use retrieval-oriented pre-training objectives like:

Inverse Cloze Task (ICT): Given a random sentence, predict which document it came from.

Contrastive Span Prediction: Predict which spans come from the same document.

Title-Body Prediction: Predict which title matches which document body.

These objectives provide weak but abundant retrieval signal from unlabeled text.

Stage 2: Large-Scale Weak Supervision

Train on massive paired datasets with noisy but abundant signal:

Web-mined pairs:

Title-body pairs from web pages
Query-click pairs from search logs (if available)
Anchor text and linked pages
Reddit posts and their top comments

Synthetic pairs:

Questions generated by LLMs from passages
Paraphrases generated by back-translation or LLM

This stage trains on millions to billions of pairs. The signal is noisy, but scale compensates.

Stage 3: High-Quality Fine-tuning

Train on smaller, high-quality datasets:

Human-annotated retrieval datasets:

MS MARCO (1M queries with human relevance judgments)
Natural Questions, TriviaQA (question-answer pairs)
HotpotQA (multi-hop reasoning)

Task-specific data:

STS (Semantic Textual Similarity) benchmarks
NLI (Natural Language Inference) pairs
Paraphrase datasets

This stage uses hard negative mining intensively.

Stage 4: Instruction Tuning (Optional)

Recent models like E5-mistral add instruction-following capability:

Instead of raw text, the input includes task instructions:

"Retrieve documents that answer this question: {query}"
"Find passages similar in meaning to: {text}"
"Identify documents on the same topic as: {text}"

This allows a single model to handle different retrieval tasks with task-specific behavior.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                  MULTI-STAGE TRAINING PIPELINE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STAGE 1: Pre-training                                                      │
│  ──────────────────────                                                     │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │ Masked Language Modeling (or retrieval-oriented pre-training)        │  │
│  │                                                                       │  │
│  │ Data: Wikipedia, Books, Common Crawl (100B+ tokens)                   │  │
│  │ Objective: Predict masked tokens / contrastive document matching     │  │
│  │ Purpose: Learn language structure and basic semantics                │  │
│  │ Duration: Days to weeks on large GPU clusters                         │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                              │                                              │
│                              ▼                                              │
│  STAGE 2: Weak Supervision at Scale                                         │
│  ───────────────────────────────────                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │ Contrastive learning on noisy pairs                                  │  │
│  │                                                                       │  │
│  │ Data: 100M-1B pairs from:                                             │  │
│  │   - Web page title + body                                             │  │
│  │   - Anchor text + linked page                                         │  │
│  │   - QA pairs from forums                                              │  │
│  │   - LLM-generated questions from passages                             │  │
│  │                                                                       │  │
│  │ Negatives: In-batch (large batch sizes: 16K-65K)                      │  │
│  │ Purpose: Learn retrieval basics from abundant noisy signal            │  │
│  │ Duration: Days on large GPU clusters                                  │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                              │                                              │
│                              ▼                                              │
│  STAGE 3: High-Quality Fine-tuning                                          │
│  ─────────────────────────────────                                          │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │ Contrastive learning with hard negatives                             │  │
│  │                                                                       │  │
│  │ Data: 1M-10M pairs from:                                              │  │
│  │   - MS MARCO, Natural Questions, HotpotQA                             │  │
│  │   - STS benchmarks                                                    │  │
│  │   - NLI datasets (as soft positives/negatives)                        │  │
│  │                                                                       │  │
│  │ Negatives: Hard negatives mined from previous stage model             │  │
│  │ Multiple rounds of mining → training → mining                         │  │
│  │ Purpose: Learn fine-grained semantic distinctions                     │  │
│  │ Duration: Hours to days                                                │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                              │                                              │
│                              ▼                                              │
│  STAGE 4: Instruction Tuning (Optional)                                     │
│  ──────────────────────────────────────                                     │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │ Train with task-specific instructions                                │  │
│  │                                                                       │  │
│  │ Input format: "Instruct: {task}\nQuery: {text}"                       │  │
│  │   - "Retrieve relevant documents"                                     │  │
│  │   - "Find similar sentences"                                          │  │
│  │   - "Identify passages that answer the question"                      │  │
│  │                                                                       │  │
│  │ Purpose: Enable task-specific behavior with single model              │  │
│  │ Duration: Hours                                                        │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                              │                                              │
│                              ▼                                              │
│                    Production Embedding Model                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Cross-Encoders vs Bi-Encoders

Understanding the distinction between cross-encoders and bi-encoders is crucial for building retrieval systems.

Bi-Encoders (Dual Encoders)

Bi-encoders process query and document independently, producing separate embeddings that are compared with dot product or cosine similarity.

Advantages:

Document embeddings can be pre-computed and cached
Fast retrieval via approximate nearest neighbor search
Scales to billions of documents

Disadvantages:

No cross-attention between query and document
May miss nuanced relevance signals
Less accurate than cross-encoders

Cross-Encoders

Cross-encoders process query and document together in a single forward pass, with full attention between them.

Advantages:

Full attention allows rich query-document interaction
Significantly more accurate (typically 5-15% better on benchmarks)
Can capture subtle relevance signals

Disadvantages:

Must compute for every query-document pair
Cannot pre-compute document representations
Too slow for initial retrieval at scale

The Practical Solution: Retrieve and Re-rank

Most production systems combine both:

First stage: Bi-encoder retrieves top-k candidates (k=100-1000) using fast vector similarity
Second stage: Cross-encoder re-ranks candidates by computing exact scores

This gives you the efficiency of bi-encoders with accuracy approaching cross-encoders.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│              BI-ENCODER VS CROSS-ENCODER COMPARISON                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  BI-ENCODER (Dual Encoder)                                                  │
│  ─────────────────────────                                                  │
│                                                                             │
│  Query: "What is RLHF?"     Document: "RLHF is a technique..."              │
│         │                            │                                      │
│         ▼                            ▼                                      │
│    ┌─────────┐                 ┌─────────┐                                  │
│    │Encoder A│                 │Encoder B│  (may be same weights)          │
│    └────┬────┘                 └────┬────┘                                  │
│         │                           │                                       │
│         ▼                           ▼                                       │
│      [0.2, -0.1, ...]           [0.3, -0.2, ...]                            │
│         │                           │                                       │
│         └─────────┬─────────────────┘                                       │
│                   ▼                                                         │
│            cosine_similarity = 0.89                                         │
│                                                                             │
│  ✓ Encodings are INDEPENDENT                                                │
│  ✓ Documents can be pre-computed                                            │
│  ✓ Fast nearest neighbor search                                             │
│  ✗ No query-document attention                                              │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  CROSS-ENCODER                                                              │
│  ─────────────                                                              │
│                                                                             │
│  Input: "[CLS] What is RLHF? [SEP] RLHF is a technique... [SEP]"           │
│                                    │                                        │
│                                    ▼                                        │
│                            ┌───────────────┐                                │
│                            │  Transformer   │                                │
│                            │  (full cross-  │                                │
│                            │   attention)   │                                │
│                            └───────┬───────┘                                │
│                                    │                                        │
│                                    ▼                                        │
│                              [CLS] → Linear → σ → 0.92                      │
│                                                                             │
│  ✓ Full attention between query and document                                │
│  ✓ More accurate relevance modeling                                         │
│  ✗ Must compute for EVERY pair                                              │
│  ✗ Cannot pre-compute document representations                              │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  PRACTICAL PIPELINE: Retrieve then Re-rank                                  │
│  ─────────────────────────────────────────                                  │
│                                                                             │
│  Query ──▶ Bi-Encoder ──▶ ANN Search ──▶ Top 100 ──▶ Cross-Encoder         │
│               │              │             docs        re-ranking           │
│               ▼              ▼                            │                 │
│         Query embedding   1B documents                    ▼                 │
│                          (pre-computed)              Top 10 results         │
│                                                      (re-ordered)           │
│                                                                             │
│  Latency: ~10ms + ~100ms = ~110ms total                                     │
│  Accuracy: Near cross-encoder quality                                       │
│  Scalability: Billions of documents                                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

ColBERT and Late Interaction

ColBERT (Contextualized Late Interaction over BERT) represents a middle ground between bi-encoders and cross-encoders.

The ColBERT Approach

Instead of pooling token representations into a single vector, ColBERT keeps all token embeddings:

Query: N token embeddings (one per token)
Document: M token embeddings (one per token)

Similarity is computed as MaxSim: for each query token, find the maximum similarity to any document token, then sum:

$\text{score} = \sum_i \max_j \text{sim}(q_i, d_j)$

Why This Works

MaxSim captures a form of soft term matching:

Query token "diabetes" will have high similarity to document token "diabetes" or "diabetic"
Query token "symptoms" will match document tokens like "symptoms," "signs," "manifestations"
Each query term finds its best match in the document

This is more expressive than a single-vector comparison but still allows document pre-computation.

Trade-offs

Advantages:

Better accuracy than bi-encoders (often close to cross-encoders)
Documents can still be pre-indexed
Interpretable: you can see which tokens matched

Disadvantages:

Storage: must store all token embeddings, not just one vector per document
Retrieval complexity: requires specialized indexing (like PLAID)
Higher latency than single-vector bi-encoders

ColBERT v2 and PLAID indexing have made this practical for large-scale deployment, and it's increasingly popular for applications where retrieval quality justifies the overhead.

Matryoshka Embeddings: Flexible Dimensionality

Matryoshka Representation Learning (MRL) is an elegant technique that makes embedding dimensions adaptive.

The Problem

Different applications have different dimension requirements:

High-precision retrieval might want 1024+ dimensions
Mobile deployment might need 256 dimensions
Some use cases fall in between

Traditionally, you'd train separate models for each dimension or truncate (which usually degrades quality significantly).

The Matryoshka Solution

MRL trains embeddings so that the first d dimensions form a valid d-dimensional embedding for any d less than the full dimension.

During training, the loss is computed at multiple truncation points:

$\mathcal{L} = \sum_{d} w(d) \cdot \mathcal{L}_{\text{contrastive}}(\mathbf{e}_{:d})$

Where d might be [64, 128, 256, 512, 1024, ...].

This forces the model to encode the most important information in early dimensions. The first 256 dimensions must be a good embedding on their own, with dimensions 257-512 adding refinement, and so on.

Practical Benefits

Runtime flexibility: Choose dimension based on latency/quality trade-off without retraining.

Efficient search: Search with low dimensions first, then refine with full dimensions.

Progressive retrieval: Quickly eliminate candidates with low-dimensional comparison, then use full dimensions for top-k.

OpenAI's text-embedding-3 models support MRL natively, and several open-source models (like nomic-embed-text) also support it.

Training Data Sources

The quality of embedding models depends critically on training data. Here's where that data comes from:

Natural Paired Data

Title-Body Pairs: Web page titles are often good summaries of content. Billions of pairs available from Common Crawl.

Question-Answer Pairs: Stack Overflow, Quora, Reddit—any Q&A platform provides natural pairs where questions and accepted answers should be similar.

Anchor Text: The text of a hyperlink often describes the linked page. "Click here for the RLHF paper" linked to the RLHF paper creates a natural pair.

Search Logs: If you have access to search logs with clicks, query-clicked-document pairs are gold (but rarely available outside search companies).

Synthetic Data Generation

LLM-Generated Questions: Given a passage, ask an LLM to generate questions that the passage answers. This creates abundant question-passage pairs from any corpus.

Paraphrase Generation: Use back-translation (English → French → English) or LLM paraphrasing to create semantic equivalents.

Hard Negative Generation: Ask an LLM to generate plausible but wrong answers, creating high-quality hard negatives.

Human-Annotated Datasets

MS MARCO: 1 million queries with human relevance judgments. The gold standard for passage retrieval.

Natural Questions: 300K+ real Google search queries with Wikipedia answers.

BEIR Benchmark: Collection of diverse retrieval tasks (scientific, financial, news, etc.) for evaluation.

STS Benchmark: Human-annotated semantic similarity scores for sentence pairs.

Fine-Tuning for Your Domain

General embedding models work well for general text, but domain-specific applications often benefit from fine-tuning.

When to Fine-Tune

Fine-tuning makes sense when:

Your domain has specialized vocabulary (medical, legal, scientific terms)
Semantic relationships differ from general text ("side effects" and "adverse events" are synonymous in medical contexts)
You have domain-specific query patterns
General models underperform on your evaluation set

Collecting Fine-Tuning Data

Query logs: If your application has users, log queries and what they ultimately found useful.

Expert curation: Have domain experts create or validate query-document pairs.

Synthetic generation: Use domain documents + LLM to generate domain-specific questions.

Transfer from similar domains: Medical search data might help with biotech; legal search might help with compliance.

Fine-Tuning Strategies

Full fine-tuning: Update all model weights. Most flexible but requires more data (thousands of pairs minimum) and risks overfitting.

LoRA/QLoRA: Train low-rank adapters, keeping base weights frozen. Efficient, works with less data, easier to experiment.

Sentence Transformers library: Provides high-level APIs for fine-tuning embedding models with various loss functions.

Staged fine-tuning: First fine-tune on abundant weak signal, then on smaller high-quality data.

Avoiding Catastrophic Forgetting

Fine-tuning on narrow domain data can degrade general performance. Mitigations:

Mix domain data with general data during fine-tuning
Use lower learning rates
Early stopping based on validation on both domain and general benchmarks
LoRA naturally limits forgetting by keeping base weights frozen

Evaluation: How to Measure Embedding Quality

Evaluating embedding models requires understanding what "good" means for your application.

Retrieval Metrics

Recall@k: What fraction of relevant documents appear in the top k results? Critical for retrieval systems where a later re-ranker will refine results.

NDCG (Normalized Discounted Cumulative Gain): Accounts for ranking order—relevant documents ranked higher contribute more.

MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant result. Useful when users typically want one good answer.

MAP (Mean Average Precision): Average precision at each relevant document, averaged across queries.

Similarity Metrics

Spearman Correlation: For STS tasks, how well do model similarities correlate with human similarity judgments?

Benchmark Suites

MTEB (Massive Text Embedding Benchmark): 56 datasets across 8 task types. The standard for comparing embedding models.

BEIR: 18 diverse retrieval datasets. Tests generalization across domains.

Custom evaluation: Build an evaluation set from your actual use case. Ultimately, performance on your data matters most.

Evaluation Pitfalls

Leakage: If your training data overlaps with evaluation data, metrics are meaningless. MS MARCO is in many training sets—don't treat it as unbiased evaluation.

Distribution shift: Academic benchmarks may not match your production distribution.

Metric gaming: Optimizing for one metric can hurt others. Track multiple metrics.

Practical Recommendations

Based on the current landscape, here are practical recommendations for different scenarios:

For Most Applications

Start with a strong general-purpose model:

OpenAI text-embedding-3-large for best quality (if you can use proprietary APIs)
BGE-large-en-v1.5 or E5-large-v2 for strong open-source options
nomic-embed-text if you need fully open (Apache 2.0) weights and training data

Use the retrieve-and-rerank pattern:

Bi-encoder for initial retrieval (top 100)
Cross-encoder re-ranker for final ordering (top 10-20)

For Latency-Sensitive Applications

Use smaller models (BGE-small, E5-small)
Consider Matryoshka truncation to reduce dimensions
Quantize embeddings (int8 often works well)
Use GPU inference or optimized CPU inference (ONNX)

For Domain-Specific Applications

Evaluate general models first—they might be good enough
If not, fine-tune with domain data using Sentence Transformers
Start with LoRA fine-tuning before full fine-tuning
Build a domain-specific evaluation set to measure progress

For Multilingual Applications

Cohere embed-v3 (multilingual by default)
BGE-M3 (multilingual, supports multiple retrieval modes)
E5-multilingual-large

For Code Search

Voyage Code (proprietary)
CodeBERT or GraphCodeBERT (open-source)
Consider fine-tuning general models on code pairs

The Future of Embedding Models

Several trends are shaping the future:

Larger context windows: Models like jina-embeddings-v2 support 8K tokens, enabling embedding of longer documents without chunking.

Multi-vector representations: ColBERT-style approaches are becoming more practical, offering better quality with reasonable efficiency.

Instruction-following embeddings: Models that can adjust behavior based on task instructions, enabling a single model to handle diverse retrieval needs.

Integration with LLMs: Using LLMs to generate embeddings (like E5-mistral) or to supervise embedding training with synthetic data.

Domain specialization: More models tailored for specific verticals (medical, legal, financial, code).

Efficiency improvements: Better quantization, distillation, and architecture innovations to reduce costs while maintaining quality.

Table of Contents

Why Embedding Models Matter

The Embedding Landscape in 2025

The Architecture of Embedding Models

Encoder Architecture

Pooling Strategies

Dimensionality Considerations

Contrastive Learning: The Foundation

The Contrastive Objective

InfoNCE Loss

In-Batch Negatives

The Dual Encoder Architecture

Hard Negative Mining: The Critical Ingredient

Types of Hard Negatives

The Hard Negative Mining Pipeline

The False Negative Problem

Multi-Stage Training: The Modern Recipe

Stage 1: Pre-training (Self-Supervised)

Stage 2: Large-Scale Weak Supervision

Stage 3: High-Quality Fine-tuning

Stage 4: Instruction Tuning (Optional)

Cross-Encoders vs Bi-Encoders

Bi-Encoders (Dual Encoders)

Cross-Encoders

The Practical Solution: Retrieve and Re-rank

ColBERT and Late Interaction

The ColBERT Approach

Why This Works

Trade-offs

Matryoshka Embeddings: Flexible Dimensionality

The Problem

The Matryoshka Solution

Practical Benefits

Training Data Sources

Natural Paired Data

Synthetic Data Generation

Human-Annotated Datasets

Fine-Tuning for Your Domain

When to Fine-Tune

Collecting Fine-Tuning Data

Fine-Tuning Strategies

Avoiding Catastrophic Forgetting

Evaluation: How to Measure Embedding Quality

Retrieval Metrics

Similarity Metrics

Benchmark Suites

Evaluation Pitfalls

Practical Recommendations

For Most Applications

For Latency-Sensitive Applications

For Domain-Specific Applications

For Multilingual Applications

For Code Search

The Future of Embedding Models

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building Production-Ready RAG Systems: Lessons from the Field

Agentic RAG: When Retrieval Meets Autonomous Reasoning

LLM-Powered Search for E-Commerce: Beyond NER and Elasticsearch

Building Semantic Memory for LLM Conversations: A Hierarchical RAG Approach