Skip to main content
Back to Blog

Hybrid Search Strategies: Combining BM25 and Vector Search for Better Retrieval

Deep dive into hybrid search combining lexical (BM25) and semantic (vector) retrieval. Covers RRF fusion, linear combination, query routing, reranking, and production best practices for RAG systems in 2025.

12 min read
Share:

Hybrid Search Strategies

Pure vector search captures meaning but misses keywords. Pure lexical search finds exact matches but misses semantics. Hybrid search combines both approaches—and when done right, produces far better results than either alone, with real-world improvements of 15-30% better recall.

This guide covers hybrid search comprehensively: why it works, how to implement it, fusion techniques, when to route queries differently, and production best practices for RAG systems.


Why Hybrid Search Works

Vector search and lexical search have complementary strengths and weaknesses. Combining them creates a retrieval system more robust than either component.

Vector search excels at semantic similarity—understanding that "automobile" and "car" refer to the same concept, or that "How do I fix a broken pipe?" relates to plumbing even without mentioning "plumbing." But it has critical blind spots:

Exact term matching: Embedding search might miss identifiers like "TS-01" because embeddings are high-dimensional vectors that don't guarantee retrieval of critical terms. Product codes, error messages, and specific technical terms can be lost in the embedding space.

Rare terms: Uncommon words may not be well-represented in embedding training data. Domain-specific jargon, new terminology, and proper nouns often embed poorly.

Negation and specificity: "Python but not Django" is difficult to express in embedding space. The semantic meaning of negation doesn't always translate to geometric distance.

BM25 and other lexical methods excel at exact matching but miss semantic relationships:

Vocabulary mismatch: Users search for "heart attack" but documents say "myocardial infarction." Perfect relevance, zero keyword overlap.

Synonyms and paraphrasing: "How to terminate a process" and "killing a running program" have the same meaning but share few keywords.

Context blindness: Lexical search treats "bank" identically whether the document discusses rivers or finances.

The Hybrid Advantage

Hybrid search catches what each approach misses alone. When a user searches for "error TS-01 in authentication module":

  • Vector search finds documents about authentication errors with similar symptoms
  • Lexical search finds documents containing the exact error code "TS-01"
  • Hybrid search returns both, ranking documents that match on both dimensions highest

The combination is particularly powerful for:

  • Technical documentation with specific identifiers
  • E-commerce search with product codes and descriptive queries
  • Customer support where users mix error messages with natural language descriptions
  • Legal and medical domains with precise terminology

Fusion Techniques

The core challenge of hybrid search is combining results from different retrieval methods. Each method produces its own ranking, and fusion techniques merge these into a single coherent ranking.

Reciprocal Rank Fusion (RRF)

RRF has become the standard fusion technique for hybrid search, requiring no pretraining, weight tuning, or knowledge of score ranges.

How it works: RRF assigns each document a score based on its rank position in each result list, then combines these scores.

The score formula is: score = Σ (1 / (k + rank)) where k is a constant (typically 60) and rank is the document's position in each result list (starting at 1).

Why k=60?: This constant was determined experimentally and works well across diverse datasets. It controls how quickly scores decay with rank—higher k values give more weight to lower-ranked documents.

Example: A document ranked #1 in vector search and #5 in BM25:

  • Vector contribution: 1/(60+1) = 0.0164
  • BM25 contribution: 1/(60+5) = 0.0154
  • Combined RRF score: 0.0318

Strengths:

  • Resilient to outliers and domain shifts because it ignores absolute score magnitudes
  • Eliminates the need for score normalization
  • No tuning required—works well out of the box
  • Adaptable to changing environments without continuous fine-tuning

Weaknesses:

  • Ignores score magnitudes that might carry useful information
  • Equal weighting may not be optimal for all query types
  • Less flexibility than learned approaches

When to use: RRF is the default choice when you don't have labeled data for calibration or when robustness matters more than maximum precision.

Linear Combination

Linear combination directly merges scores from each retriever using learned or hand-tuned weights.

How it works: combined_score = α * vector_score + (1-α) * bm25_score

Critical requirement: Scores must be normalized to comparable ranges before combination. Vector similarity might range from 0.7 to 0.95, while BM25 scores might range from 5 to 50. Without normalization, the higher-magnitude scores dominate.

Normalization approaches:

  • Min-max scaling: Transform scores to [0, 1] range
  • Z-score normalization: Transform to zero mean and unit variance
  • Sigmoid scaling: Apply sigmoid function for soft bounds

Strengths:

  • Higher potential accuracy when tuned
  • Score magnitudes provide additional signal
  • Flexible weighting based on query characteristics

Weaknesses:

  • Requires normalization (adds complexity and potential for errors)
  • Weights need tuning on labeled data
  • Sensitive to score distribution changes

When to use: Linear combination is worth the complexity when you have labeled data for tuning and maximum precision matters.

Learned Fusion

For sophisticated applications, machine learning can learn optimal fusion strategies.

Learning-to-rank: Train a model to predict relevance given features from multiple retrievers (scores, ranks, query characteristics). Models like LambdaMART or neural rankers can learn complex interactions.

Cross-encoder reranking: Pass query-document pairs through a cross-encoder that considers their interaction directly. More accurate than fusion but computationally expensive.

When to use: Learned fusion makes sense when you have substantial labeled data and retrieval quality directly impacts business metrics.


Query Routing

Not all queries benefit equally from hybrid search. Some queries are clearly lexical (searching for an exact error code), while others are clearly semantic (asking a conceptual question). Query routing sends different queries to different retrieval strategies.

Query Classification

Keyword-heavy queries: Contain specific identifiers, codes, or technical terms. Route primarily to lexical search.

  • "Error code 0x80070005"
  • "ProductID SKU-12345"
  • "function malloc() in C"

Conceptual queries: Ask about ideas, relationships, or explanations. Route primarily to vector search.

  • "How does memory management work?"
  • "What causes authentication failures?"
  • "Explain the difference between REST and GraphQL"

Mixed queries: Contain both specific terms and conceptual elements. Use full hybrid search.

  • "Why does malloc() fail with error 0x80070005?"
  • "SKU-12345 shipping issues"

Routing Strategies

Rule-based routing: Use heuristics based on query characteristics:

  • Presence of codes/identifiers → lexical
  • Question words (how, why, what) → semantic
  • Short queries with technical terms → lexical
  • Longer natural language queries → semantic

Classifier-based routing: Train a classifier to predict optimal retrieval strategy based on query features. Requires labeled data showing which strategy works best for which queries.

Adaptive weighting: Rather than routing exclusively, adjust fusion weights based on query type. Keyword-heavy queries get higher lexical weight; conceptual queries get higher vector weight.

Implementation Considerations

Fallback behavior: If the primary strategy returns poor results, fall back to hybrid search. Users shouldn't suffer from incorrect routing predictions.

Confidence thresholds: Only route with high confidence. Uncertain queries should use full hybrid search as the safe default.

Monitoring: Track routing decisions and outcomes. If a routed query type consistently underperforms, adjust the routing logic.


Recent IBM research compared various retrieval combinations and found that three-way retrieval—combining BM25, dense vectors, and sparse vectors—produces the best results for RAG.

Dense vs. Sparse Vectors

Dense vectors: Standard embeddings from models like text-embedding-3. Every dimension has a non-zero value. Capture semantic meaning through distributed representation.

Sparse vectors: Most dimensions are zero, with non-zero values for specific features. Examples include SPLADE (learned sparse representations) and traditional TF-IDF. Excel at exact term matching while retaining some learned semantics.

The Three-Way Approach

Combining all three provides complementary strengths:

MethodStrength
BM25Exact term matching, well-understood
Dense vectorsSemantic understanding, conceptual matching
Sparse vectorsLearned term importance, hybrid characteristics

Fusion: Apply RRF across all three result sets. Documents appearing highly in multiple lists receive the highest combined scores.

When Three-Way Makes Sense

Three-way hybrid adds complexity. It's worth considering when:

  • Maximum retrieval quality is critical
  • You have diverse query types (some lexical, some semantic)
  • Your documents contain both precise terminology and conceptual content
  • You've already optimized two-way hybrid and need further improvement

For most applications, two-way hybrid (BM25 + dense vectors) provides excellent results with less complexity.


Reranking

Hybrid search retrieves candidates; reranking improves their ordering. The two-stage approach—retrieve many, rerank few—enables using expensive but accurate models efficiently.

Cross-Encoder Reranking

Cross-encoders process query-document pairs jointly, enabling deep interaction modeling. They're much more accurate than bi-encoder similarity but too slow for initial retrieval.

How it works: Retrieve top 50-100 documents with hybrid search. Pass each (query, document) pair through a cross-encoder that outputs a relevance score. Reorder by cross-encoder scores.

Models: Cohere Rerank, Jina Reranker, BGE-reranker, ColBERT. Each offers different quality-speed tradeoffs.

Improvement: Cross-encoder reranking with hybrid retrieval yields substantial improvement over hybrid alone, particularly for complex queries where initial ranking is noisy.

ColBERT

ColBERT (Contextualized Late Interaction over BERT) provides a middle ground between bi-encoders (fast) and cross-encoders (accurate).

How it works: ColBERT encodes queries and documents separately (like bi-encoders) but retains token-level embeddings. Relevance is computed through token-level interaction, capturing more nuance than single-vector similarity.

Advantages: Faster than cross-encoders (documents can be pre-computed), more accurate than bi-encoders, supports efficient retrieval and reranking.

Use case: ColBERT works as both retriever and reranker. As a reranker, it's more efficient than cross-encoders while capturing fine-grained relevance.

Reranking Best Practices

Retrieve generously: Retrieve 3-5x more documents than you'll ultimately return. Reranking can't surface documents that weren't retrieved.

Rerank selectively: Reranking is expensive. Apply it to the top candidates (50-100), not thousands.

Cache reranker results: For repeated queries, cache reranking outcomes.

Latency budget: Reranking adds latency (typically 50-200ms). Ensure total latency meets user expectations.

Cross-Encoder Model Comparison

ModelLanguagesLatency (50 docs)Quality (NDCG)Best For
Cohere Rerank v3100+~150msExcellentProduction, multilingual
Jina Reranker v2100+~120msVery GoodSelf-hosted, multilingual
BGE-reranker-v2-m3100+~100msVery GoodOpen source, self-hosted
ms-marco-MiniLMEnglish~50msGoodLow-latency, English-only
ColBERT v2English~80msVery GoodHybrid retriever+reranker

Notes: Latencies are approximate for 50 document pairs on GPU. Quality varies by domain—benchmark on your data.


Tuning Examples

Scenario: Users search with product names, codes, and descriptive queries.

Configuration:

  • α = 0.3 (30% BM25, 70% vector) for general queries
  • α = 0.7 (70% BM25, 30% vector) when query contains product codes
  • Query routing: Detect product codes via regex, adjust α dynamically

Results: 23% improvement in click-through rate vs. pure vector search.

Example 2: Technical Documentation

Scenario: Developers search for error messages, function names, and conceptual explanations.

Configuration:

  • Three-way hybrid: BM25 + dense vectors + SPLADE sparse vectors
  • RRF fusion with k=60
  • Cross-encoder reranking on top 50 results

Results: 31% improvement in MRR for error message queries (exact terms critical), 18% improvement for conceptual queries.

Scenario: Lawyers search for case law, statutes, and legal concepts.

Configuration:

  • α = 0.5 (balanced) as default
  • Phrase-aware BM25 for exact legal phrases
  • Domain-fine-tuned embeddings (legal-BERT based)
  • Reranking with BGE-reranker

Results: 27% improvement in recall@10, critical for ensuring no relevant precedents are missed.

Tuning α: A Systematic Approach

When tuning the linear combination weight α:

  1. Grid search: Test α values from 0.0 to 1.0 in increments of 0.1
  2. Evaluate on labeled data: For each α, measure recall@k and NDCG
  3. Identify optimal range: Usually a range of α values perform similarly well
  4. Consider query types: Different query types may have different optimal α values
  5. Implement dynamic routing: If patterns emerge, route queries to different α values based on query characteristics

Common findings:

  • Keyword-heavy queries: α closer to 1.0 (more BM25)
  • Conceptual queries: α closer to 0.0 (more vector)
  • Mixed queries: α around 0.3-0.5 performs well

Production Best Practices

Index Configuration

Separate indexes: Maintain separate indexes for lexical and vector search. This allows independent optimization and scaling.

Consistent document IDs: Ensure the same document has the same ID across indexes. Fusion requires mapping results to the same documents.

Synchronization: Keep indexes in sync. A document added to the vector index must also appear in the lexical index (and vice versa for deletions).

Performance Optimization

Parallel retrieval: Execute lexical and vector search in parallel. Fusion happens after both complete. Total latency is max(lexical, vector) + fusion, not sum.

Result limits: Retrieve enough candidates for quality fusion (50-100 per method) but not so many that fusion becomes slow.

Caching: Cache both individual retriever results and fused results for repeated queries.

Tuning Workflow

  1. Establish baseline: Measure pure vector and pure lexical performance on your evaluation set.

  2. Implement RRF: Start with RRF (k=60). Measure improvement over best single method.

  3. Tune weights: If you have labeled data, experiment with linear combination and varying weights. Optimize α on a validation set.

  4. Add reranking: If retrieval quality is still insufficient, add cross-encoder reranking on top candidates.

  5. Consider three-way: If still needed, add sparse vectors for three-way hybrid.

  6. Monitor continuously: Track retrieval metrics in production. Query patterns change, and tuning may need adjustment.

Evaluation Metrics

Recall@k: Primary metric—what fraction of relevant documents appear in top k results?

MRR: Mean Reciprocal Rank—how highly is the first relevant document ranked?

NDCG: Normalized Discounted Cumulative Gain—comprehensive metric accounting for graded relevance and position.

Latency: P50 and P95 retrieval latency. Hybrid search is slower than single-method; ensure it's still acceptable.


Implementation Patterns

Vector Database Native Hybrid

Most vector databases now support hybrid search natively:

Weaviate: Excellent hybrid search with configurable alpha weighting between BM25 and vector components.

Qdrant: Supports combining vector search with full-text search through its filtering and query APIs.

Pinecone: Native sparse-dense hybrid search combining dense vectors with sparse (BM25-style) representations.

Milvus: Supports hybrid search through its query interface with multiple vector fields.

Using native hybrid search simplifies architecture—one database handles both retrieval methods.

External Fusion

For more control, maintain separate systems and fuse externally:

Architecture:

  • Elasticsearch/OpenSearch for lexical search
  • Vector database for semantic search
  • Application layer for fusion

Advantages:

  • Optimize each system independently
  • Use best-in-class for each retrieval type
  • More flexibility in fusion logic

Disadvantages:

  • More infrastructure to maintain
  • Synchronization complexity
  • Higher latency from multiple round-trips

LangChain/LlamaIndex Integration

Both frameworks provide hybrid search abstractions:

LangChain: EnsembleRetriever combines multiple retrievers with configurable weights. Supports any retriever implementing the base interface.

LlamaIndex: QueryFusionRetriever implements various fusion strategies including RRF. Integrates with their node/index abstractions.

These abstractions simplify implementation but may limit advanced customization.


Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles