Should I use OpenAI embeddings or open-source alternatives?

If you're already using OpenAI APIs and don't have strict self-hosting requirements, text-embedding-3 is convenient and performs well. If you need self-hosting, cost optimization at scale, or want to avoid API dependencies, BGE-M3 or E5 provide comparable quality with full control.

How many dimensions should I use?

For most applications, 1024-1536 dimensions provide an excellent quality-efficiency balance. Use lower dimensions (512-768) if storage/compute constrained, or higher (3072) if maximum quality is critical. With Matryoshka models, you can experiment with different truncation levels easily.

When should I fine-tune embeddings?

Fine-tune when: (1) your domain has specialized vocabulary, (2) you have quality training pairs, (3) you've measured a meaningful gap between current performance and requirements. Don't fine-tune as a first resort—it adds complexity and maintenance burden.

How do I handle documents longer than the model's context window?

Options: (1) Chunk documents and embed each chunk separately, (2) Use models with larger context windows (Cohere Embed v4), (3) Summarize long documents before embedding. Chunking is most common but requires thoughtful chunk boundaries.

Can I mix embeddings from different models in the same vector database?

No. Different models have different dimensions and semantic spaces. Vectors from different models cannot be meaningfully compared. Use separate indexes for different embedding models.

How often should I re-embed my documents?

Re-embed when: (1) documents change, (2) you switch embedding models, (3) you fine-tune your embedding model. For static document collections, embeddings only need computing once.

Embedding Models & Strategies

Embeddings are the foundation of modern AI applications. They transform text, images, and other data into dense vector representations where semantic similarity becomes geometric proximity. The quality of your embeddings directly determines the quality of your retrieval, search, and recommendation systems.

Yet embedding model selection is often treated as an afterthought—teams default to whatever their framework suggests without understanding the tradeoffs. This guide provides comprehensive coverage of embedding models in 2025: how they work, how to choose between them, when to fine-tune, and how to optimize for production.

How Embeddings Work

Embeddings convert discrete inputs (words, sentences, documents, images) into continuous vector spaces. In these spaces, similar items cluster together, and the distance between vectors reflects semantic relationships.

The Transformation

When you embed the sentence "How do I reset my password?", the model produces a vector—typically 768 to 3072 floating-point numbers. This vector captures the meaning of the sentence in a way that enables comparison with other sentences.

The sentence "I forgot my login credentials" produces a different vector, but one that's geometrically close to the first. Meanwhile, "What's the weather today?" produces a vector far from both. This geometric relationship enables semantic search: find vectors closest to a query vector, and you find semantically similar content.

Training Objectives

Modern embedding models are trained using contrastive learning. The model learns to produce similar embeddings for semantically related pairs (a question and its answer, a query and relevant document) while pushing apart embeddings for unrelated pairs.

This training creates embeddings that capture semantic relationships far beyond keyword overlap. "Automobile" and "car" produce similar embeddings despite sharing no characters. "Bank" in financial and river contexts produces different embeddings based on surrounding words.

Dimensions and Information Capacity

Embedding dimensions determine how much information a vector can encode. Higher dimensions capture more nuance but require more storage and computation. Common dimensions range from 384 (lightweight models) to 3072 (high-capacity models).

The relationship between dimensions and quality isn't linear. Doubling dimensions doesn't double quality—there are diminishing returns. For most applications, 768-1536 dimensions provide an excellent balance between quality and efficiency.

The 2025 Embedding Model Landscape

The embedding model landscape has evolved rapidly. Open-source models now rival commercial APIs, multilingual capabilities have improved dramatically, and multimodal embeddings enable cross-modal search.

OpenAI text-embedding-3

OpenAI's text-embedding-3 family represents their third generation of embedding models, offering significant improvements over ada-002.

text-embedding-3-large: 3072 dimensions (can be truncated to lower dimensions), trained with Matryoshka Representation Learning. This technique front-loads important information in early dimensions, allowing you to use fewer dimensions with graceful quality degradation.

text-embedding-3-small: 1536 dimensions, optimized for cost and speed while maintaining strong quality.

Strengths: Strong general-purpose performance, Matryoshka dimensionality flexibility, extensive API ecosystem integration.

Weaknesses: Some benchmarks show lower accuracy than alternatives for specialized tasks. The compression tradeoff forces the model to prioritize general information over specific details, which can hurt precision for strict constraints.

Pricing: $0.13 per million tokens (large),$ 0.02 per million tokens (small).

Best for: Teams already using OpenAI APIs, general-purpose applications, and use cases where dimensionality flexibility matters.

Cohere Embed v4

Cohere's Embed v4 represents a significant advancement in commercial embeddings, particularly for multimodal and multilingual applications.

Capabilities: True multimodal embedding—can embed text and images into the same vector space. 1536 dimensions with 100+ language support. Massive context window simplifies processing of complex documents.

Strengths: Multimodal capabilities enable image-text search without separate models. Excellent multilingual performance across 100+ languages. Large context window handles long documents without chunking.

Weaknesses: Paid API only—no self-hosting option. Pricing can accumulate for high-volume applications.

Pricing: $0.12 per million tokens.

Best for: Multimodal applications, multilingual deployments, and use cases with complex visually-rich documents.

BGE (Beijing Academy of AI)

The BGE family has become the go-to choice for teams wanting commercial-quality embeddings without API dependencies.

BGE-M3: The flagship model supporting dense, sparse, and multi-vector retrieval in a single model. 1024 dimensions with excellent multilingual support (Chinese-English-Japanese particularly strong). Apache 2.0 license enables full commercial use.

BGE-large-en-v1.5: English-focused model with strong performance on retrieval benchmarks.

Strengths: Open-source with permissive licensing. Self-hostable for privacy and cost control. Excellent performance rivaling commercial APIs. Multi-functionality (dense + sparse + multi-vector) unique to BGE-M3.

Weaknesses: Requires infrastructure to host. Smaller context window than some alternatives.

Best for: Self-hosted deployments, privacy-sensitive applications, cost-conscious teams at scale, and applications needing hybrid dense/sparse retrieval.

E5 (Microsoft)

Microsoft's E5 family provides strong open-source options with unique architectural choices.

E5-Mistral: Combines Mistral's encoder architecture with E5's contrastive training objective. Compact yet high-performing with open weights.

E5-large-v2: Established workhorse model with proven performance across benchmarks.

Strengths: Open-source with strong multilingual support. Particularly good for question-answering where short queries retrieve relevant passages.

Weaknesses: Requires specific prompt formatting for optimal performance ("query: " and "passage: " prefixes).

Best for: Question-answering systems, document retrieval, and multilingual applications requiring self-hosting.

Voyage AI

Voyage AI has emerged as a strong competitor focusing on retrieval quality and specialized domain models.

voyage-3: General-purpose model optimized for retrieval tasks.

voyage-code-3: Specialized for code retrieval and understanding.

voyage-law-2, voyage-finance-2: Domain-specific models for legal and financial applications.

Strengths: Domain-specific models outperform general-purpose alternatives in their specialties. Strong retrieval optimization.

Weaknesses: Smaller ecosystem than OpenAI or Cohere. Limited self-hosting options.

Best for: Domain-specific applications (legal, finance, code) where specialized embeddings provide meaningful quality improvements.

Model Comparison Summary

Model	Dimensions	Multilingual	Self-Host	Price (per 1M tokens)
text-embedding-3-large	3072 (flexible)	Good	No	$0.13
text-embedding-3-small	1536	Good	No	$0.02
Cohere Embed v4	1536	Excellent (100+)	No	$0.12
BGE-M3	1024	Excellent	Yes (Apache 2.0)	Free
E5-Mistral	4096	Good	Yes	Free
Voyage-3	1024	Good	No	$0.06

Choosing the Right Model

Decision Framework

Start with constraints:

Constraint	Recommendation
Must self-host	BGE-M3 or E5
Multimodal (text + images)	Cohere Embed v4
Budget-sensitive at scale	BGE-M3 (self-hosted) or text-embedding-3-small
Maximum quality, cost secondary	Cohere Embed v4 or text-embedding-3-large
Domain-specific (legal/finance/code)	Voyage domain models
Multilingual (100+ languages)	Cohere Embed v4 or BGE-M3

Common Patterns

Startup MVP: Use text-embedding-3-small for simplicity and low cost. Upgrade to larger models or fine-tuning only after validating your approach.

Enterprise self-hosted: Deploy BGE-M3 on your infrastructure. The Apache 2.0 license allows commercial use without API costs or data leaving your environment.

Multilingual SaaS: Cohere Embed v4 for best-in-class multilingual quality, or BGE-M3 if self-hosting is required.

Code search: Voyage-code-3 for specialized code understanding, or fine-tuned BGE on your codebase.

Fine-Tuning Embeddings

Off-the-shelf embeddings work well for general applications, but fine-tuning can dramatically improve performance for specialized domains.

When to Fine-Tune

Fine-tuning makes sense when:

Domain vocabulary differs significantly: Medical, legal, financial, and technical domains use specialized terminology that general models may not represent well.

Your retrieval pairs have specific characteristics: If your queries are short questions and your documents are long technical manuals, fine-tuning on similar pairs improves alignment.

You have quality training data: Fine-tuning requires pairs of (query, relevant_document). If you have click logs, labeled relevance data, or can generate synthetic pairs, fine-tuning becomes viable.

Marginal quality improvements matter: In production search systems, a 5% improvement in recall@10 can significantly impact user experience and business metrics.

Fine-Tuning Approaches

Contrastive fine-tuning: Train on (query, positive, negative) triplets where the model learns to bring query and positive closer while pushing away negatives. This is the most common and effective approach.

Supervised fine-tuning: Train on (query, document, relevance_score) pairs where the model learns to predict relevance scores. Useful when you have graded relevance data.

Distillation: Train your embedding model to mimic a larger, more capable model on your domain data. This can improve smaller models significantly.

Tools for Fine-Tuning

Sentence Transformers: The standard library for embedding fine-tuning. Supports contrastive learning, distillation, and various loss functions.

LlamaIndex: Provides fine-tuning utilities integrated with their retrieval framework.

Cohere: Offers fine-tuning for enterprise customers on their Embed models.

Fine-Tuning Best Practices

Quality over quantity: 10,000 high-quality pairs often outperform 100,000 noisy pairs. Focus on representative examples of your actual use case.

Include hard negatives: Random negatives are too easy—the model learns little from them. Include negatives that are superficially similar but semantically different (same topic, different answer).

Validate on held-out data: Fine-tuning can overfit. Always measure performance on data the model hasn't seen.

Start from strong base: Fine-tuning a good base model (BGE, E5) produces better results than fine-tuning a weaker model, even with the same training data.

Dimensionality Strategies

Embedding dimensions significantly impact storage, compute, and quality. Understanding the tradeoffs enables informed optimization.

Matryoshka Embeddings

OpenAI's text-embedding-3 models use Matryoshka Representation Learning, which front-loads important information in early dimensions. You can truncate a 3072-dimensional embedding to 1024 or 512 dimensions with graceful quality degradation.

Use case: Tiered retrieval systems. Use low-dimensional embeddings for initial retrieval (fast, high recall), then high-dimensional embeddings for re-ranking (slower, high precision).

Tradeoff: Compression forces the model to prioritize general information over specific details. For tasks requiring fine-grained distinctions, full dimensions may be necessary.

Dimensionality Reduction

For models without Matryoshka training, you can apply dimensionality reduction:

PCA: Principal Component Analysis projects embeddings to lower dimensions while preserving maximum variance. Simple and effective but requires fitting on your data.

Random projection: Faster than PCA with theoretical guarantees on distance preservation. Works well for approximate applications.

Learned projection: Train a small neural network to project high-dimensional embeddings to lower dimensions while preserving similarity relationships.

Practical Recommendations

Storage-constrained: Use 512-768 dimensions. Quality loss is modest for most applications.

Balanced: Use 1024-1536 dimensions. Standard choice for production RAG systems.

Maximum quality: Use full dimensions (3072 for text-embedding-3-large). Worth the cost for applications where retrieval quality directly impacts outcomes.

Multimodal Embeddings

Multimodal embeddings map different modalities (text, images, audio) into a shared vector space, enabling cross-modal search and retrieval.

How They Work

Multimodal models train on paired data (image-caption pairs, for example) to align representations across modalities. After training, you can embed a text query and retrieve relevant images, or embed an image and find related text descriptions.

Available Options

Cohere Embed v4: Text and image embeddings in a shared space. Best-in-class for production multimodal RAG.

CLIP (OpenAI): The original multimodal embedding model. Open weights available. Good quality but showing age compared to newer options.

SigLIP: Google's improved version of CLIP with better training methodology.

Nomic Embed Vision: Open-source multimodal embeddings with permissive licensing.

Use Cases

Visual search: Users upload images and find similar items (e-commerce, stock photos).

Image-text RAG: Retrieve relevant images alongside text documents based on semantic queries.

Document understanding: Process documents with figures, charts, and images alongside text.

Considerations

Alignment quality: Different modalities may not align perfectly. Test retrieval quality across modal boundaries.

Dimension consistency: Ensure text and image embeddings have the same dimensions for storage in the same index.

Processing cost: Image embedding is typically more expensive than text embedding.

Late Chunking: Preserving Document Context

Late chunking is a novel method that addresses a fundamental problem with traditional RAG: when you chunk documents before embedding, each chunk loses the context of the surrounding document.

The Problem with Traditional Chunking

Traditional "naive chunking" splits documents first, then embeds each chunk independently. This loses critical long-distance context. In a document about Paris, the phrase "the city" might end up in a different chunk from where "Paris" is mentioned. Without the full context, the embedding model can't link these references, producing less accurate embeddings.

How Late Chunking Works

Late chunking reverses the order of operations:

Embed entire document: Apply the transformer part of the embedding model to the entire text (or the largest portion that fits in the context window)
Get token embeddings: This generates vector representations for each token that encompass textual information from the entire text
Chunk after embedding: Apply mean pooling to smaller segments of the token sequence, producing chunk embeddings that retain full document context

The key insight is that each token's embedding already contains information from the entire document through the transformer's attention mechanism. By chunking after the transformer (but before pooling), each chunk embedding benefits from document-wide context.

Performance Benefits

Traditional chunking shows similarity scores of 70-75% when matching related concepts. With late chunking, which maintains the context of the entire document, these scores rise to 82-84%. This represents a meaningful improvement in retrieval quality, particularly for documents with complex cross-references.

Late Chunking vs. Contextual Retrieval

Anthropic's Contextual Retrieval addresses the same problem differently: it sends each chunk to an LLM along with the full document to add relevant context. This is essentially context enrichment where global context is explicitly hardcoded into each chunk.

Approach	Mechanism	Cost	Storage
Late Chunking	Single embedding pass, chunk after	Low	Normal
Contextual Retrieval	LLM call per chunk	High	Increased

Late chunking achieves similar benefits without the cost of LLM calls for each chunk. However, contextual retrieval combined with BM25 and reranking achieves up to 67% reduction in retrieval failures—see Contextual Retrieval: Solving RAG's Hidden Context Problem for complete implementation details.

Implementation

Jina AI's jina-embeddings-v3 and v4 support late chunking natively. Enable it by including late_chunking=True in your request. The API concatenates all sentences and feeds them as a single string to the model, then returns separate embeddings for each chunk.

ColBERT: Multi-Vector Retrieval

ColBERT (Contextual Late Interactions BERT) represents text using token-level vector embeddings rather than a single vector per document. This enables more nuanced matching between queries and documents.

Single-Vector vs. Multi-Vector

Traditional embedding models produce a single vector for an entire document. This works well for general similarity but can miss specific term matches. If a query asks about "machine learning optimization techniques" and a document discusses these terms in different sections, a single vector may not capture the relevance well.

ColBERT produces one vector per token, enabling fine-grained matching. The relevance score is computed by finding the maximum similarity between each query token and all document tokens, then summing these scores. This "late interaction" captures detailed semantic relationships that single vectors miss.

How It Works

Encoding: Both query and document are encoded through BERT, producing token-level embeddings
Indexing: Document token embeddings are stored (compressed for efficiency)
Search: For each query token, find the most similar document token
Scoring: Sum the maximum similarities to produce a relevance score

RAGatouille: Easy ColBERT Integration

RAGatouille is a lightweight Python package that brings ColBERT-style late interaction retrieval into real-world RAG pipelines. It's open source, easy to install, and compatible with LangChain, LlamaIndex, and other frameworks.

Capabilities:

Simple indexing and search APIs
Fine-tuning support via RAGTrainer
Integration with existing RAG pipelines

While ColBERT provides excellent retrieval quality, it has higher computing cost and indexing time than single-vector approaches. RAGatouille requires decent computing power and memory for indexing.

Jina ColBERT v2

Jina ColBERT v2 improves on the original with:

89 language support
User-controlled output dimensions
8192 token context length
Improved multilingual retrieval performance

Infrastructure Support

Several vector databases now support ColBERT natively:

Vespa: ColBERT embedder with compression support and long-context implementation
Qdrant: Multi-vector generation and indexing via FastEmbed library
LanceDB: Native ColBERT support for multi-vector retrieval

When to Use ColBERT

Use Case	Single-Vector	ColBERT
General semantic search	✅ Sufficient	Overkill
Precision-critical retrieval	❌ May miss details	✅ Better matching
Long documents with diverse content	❌ Single vector loses nuance	✅ Token-level matching
Latency-critical applications	✅ Faster	❌ More computation
Storage-constrained	✅ One vector per doc	❌ Many vectors per doc

Embedding Compression

For large-scale deployments, embedding compression reduces storage costs and improves query performance.

Matryoshka Representation Learning (MRL)

How it works: During training, the model is optimized to produce useful embeddings at multiple dimension truncation points. The most important information is encoded in the first dimensions, with additional dimensions adding refinement.

Benefits:

Reduce storage by 50-80% with modest quality loss
Speed up search by reducing dimension comparisons
Flexible quality-cost tradeoffs at query time

Quantization for Embeddings

Vector quantization (covered in the Vector Database guide) also applies to embeddings:

Scalar quantization: Convert FP32 embeddings to INT8, reducing storage 4x with minimal quality loss.

Binary quantization: Convert to 1-bit per dimension, reducing storage 32x. Best for high-dimensional embeddings (1024+).

Dimensionality Reduction Techniques

For models without native MRL support:

PCA (Principal Component Analysis): Project embeddings to lower dimensions while preserving maximum variance. Requires fitting on your data.

Random projection: Faster than PCA with theoretical guarantees on distance preservation. Works well for approximate applications.

Learned projection: Train a small neural network to project high-dimensional embeddings to lower dimensions while preserving similarity relationships.

Compression Strategy Recommendations

Scenario	Recommended Approach
Using text-embedding-3	Use native Matryoshka truncation
High-dimensional embeddings (1024+)	Binary quantization + rescoring
Moderate compression needed	Scalar quantization (INT8)
Maximum compression	PCA to 256-512 dims + scalar quantization
Query-time flexibility	Store full embeddings, compress at query time

Production Optimization

Batching

Embedding models process batches more efficiently than individual items. Optimal batch sizes depend on the model and hardware:

API-based models: Batch sizes of 100-1000 items typically maximize throughput while staying within rate limits.

Self-hosted models: Batch sizes are constrained by GPU memory. Start with 32 and increase until memory is fully utilized.

Caching

Embeddings are deterministic—the same input always produces the same output. Aggressive caching prevents redundant computation:

Query caching: Cache embeddings of common queries. Even short TTLs help for repeated queries.

Document caching: Cache document embeddings permanently (or until documents change). This is the primary use case for caching.

Semantic caching: Cache not just exact matches but semantically similar queries. More complex but higher hit rates.

Async Processing

For applications ingesting documents continuously:

Background embedding: Queue documents for embedding asynchronously rather than blocking on embed calls.

Batch accumulation: Accumulate documents and embed in batches rather than one-at-a-time.

Priority queues: Prioritize embedding of high-value or time-sensitive documents.

Cost Management

Embedding costs can accumulate at scale:

Model selection: text-embedding-3-small costs 6.5x less than large with modest quality reduction.

Self-hosting: For high volumes, self-hosted BGE-M3 eliminates per-token costs entirely.

Dimensionality reduction: Lower dimensions reduce storage costs in vector databases (often billed by dimension-hours).

Deduplication: Don't re-embed unchanged documents. Track document hashes and skip embedding if content hasn't changed.

Evaluation

Retrieval Metrics

Recall@k: What fraction of relevant documents appear in the top k results? The primary metric for retrieval quality.

Precision@k: What fraction of top k results are relevant? Important when users see all retrieved results.

MRR (Mean Reciprocal Rank): Average of 1/rank for the first relevant result. Emphasizes ranking the best result highly.

NDCG: Normalized Discounted Cumulative Gain. Accounts for graded relevance and position-weighted importance.

Benchmark Datasets

MTEB (Massive Text Embedding Benchmark): Comprehensive benchmark covering retrieval, classification, clustering, and semantic similarity. The standard for comparing embedding models.

BEIR: Benchmark for zero-shot retrieval across diverse domains. Tests generalization without fine-tuning.

Domain-specific benchmarks: Legal, medical, and financial domains have specialized benchmarks that better reflect performance in those areas.

Evaluation Best Practices

Use your data: Public benchmarks don't reflect your specific use case. Create evaluation sets from your actual queries and documents.

Measure what matters: If your application primarily serves question-answering, optimize for retrieval metrics on question-document pairs, not general similarity.

Test edge cases: Include difficult queries (ambiguous, multi-hop, negation) in your evaluation set.

Monitor in production: Offline evaluation doesn't capture everything. Track user behavior signals (click-through, reformulation) as proxy metrics.

Table of Contents