Hybrid Search: Combining Lexical and Semantic Retrieval for Better Search Results

Modern search systems face a fundamental tension: lexical matching (finding documents with the exact words a user types) versus semantic understanding (finding documents that mean what the user intends). Neither approach alone is sufficient for complex domains, where users might search using different terminology than what appears in documents, or where queries need results spanning multiple topic areas.

This post explores the concepts behind hybrid search architecture, combining multiple retrieval strategies with neural reranking and diversity optimization.

Why Hybrid Search?

Traditional keyword search excels at precision—when users know exactly what terms to search for, lexical matching delivers accurate results quickly. But it fails when:

Users use different vocabulary than documents (synonyms, colloquialisms)
The query expresses an intent rather than specific terms
Relevant documents discuss the concept without using query keywords

Semantic search with embeddings solves these problems by matching on meaning rather than exact words. But it has its own weaknesses:

Can miss documents that use the exact query terms prominently
May return semantically related but not directly relevant content
Embedding models have knowledge cutoffs and domain limitations

Hybrid search combines both approaches, getting the best of both worlds.

The Hybrid Retrieval Pipeline

A typical hybrid search pipeline operates in stages:

User Query
    │
    ▼
┌────────────────────────────┐
│  Query Processing          │
│  (Embedding generation)    │
└────────────────────────────┘
    │
    ▼
┌────────────────────────────┐
│  Parallel Retrieval        │
│  ┌──────────┐ ┌──────────┐ │
│  │  BM25    │ │  Vector  │ │
│  │  Search  │ │  Search  │ │
│  └──────────┘ └──────────┘ │
└────────────────────────────┘
    │
    ▼
┌────────────────────────────┐
│  Score Fusion (RRF)        │
└────────────────────────────┘
    │
    ▼
┌────────────────────────────┐
│  Neural Reranking          │
│  (Optional but recommended)│
└────────────────────────────┘
    │
    ▼
┌────────────────────────────┐
│  Diversity Optimization    │
│  (MMR or similar)          │
└────────────────────────────┘
    │
    ▼
Final Results

Let's explore each stage.

Query Embedding Generation

The first step transforms the user's query into a dense vector representation. Most embedding APIs support different "task types" that optimize the embedding for its intended use:

RETRIEVAL_QUERY: Optimized for the query side of asymmetric retrieval
RETRIEVAL_DOCUMENT: Optimized for the document/corpus side

This asymmetric approach recognizes that queries are typically short and intent-focused, while documents are longer and information-dense. Using task-specific embeddings improves retrieval quality.

Popular embedding services include Google's Vertex AI text embeddings, OpenAI's embedding API, and open-source models like Sentence Transformers. The key is using the same model for both indexing documents and embedding queries.

The Lexical Component: BM25

BM25 (Best Match 25) remains the gold standard for lexical retrieval. It scores documents based on:

Term frequency (TF): How often query terms appear in a document
Inverse document frequency (IDF): How rare query terms are across the corpus
Document length normalization: Adjusting for document size

Search engines like Elasticsearch implement BM25 natively. Key configuration decisions include:

Field boosting: Matches in titles typically indicate higher relevance than matches in body text. Most systems weight title matches 2-3x higher than content matches.

Multi-field search: Queries should search across multiple fields (title, abstract, content, metadata) with appropriate weights.

Phrase matching: Exact phrase matches are strong relevance signals. A query like "high blood pressure" matching that exact phrase should score higher than documents with those words scattered throughout.

Fuzziness handling: Typo tolerance improves user experience but requires careful tuning to avoid false matches.

The Semantic Component: Vector Search

Vector search finds documents whose embeddings are closest to the query embedding, typically using cosine similarity or dot product distance.

Modern search engines support approximate nearest neighbor (ANN) algorithms like HNSW (Hierarchical Navigable Small World) that make vector search practical at scale. These algorithms trade perfect accuracy for speed—you might miss some relevant documents, but queries return in milliseconds rather than seconds.

Key considerations for vector search:

Embedding dimensions: Models offer different dimensionality (384, 768, 1536, 3072). Higher dimensions capture more nuance but require more storage and compute. Many practitioners find 768 dimensions sufficient for most use cases.

Hierarchical embeddings: Rather than embedding entire documents as single vectors, consider embedding at multiple levels—document-level for broad matching, section-level or paragraph-level for targeted retrieval. This is particularly valuable for long documents where different sections address different topics.

Candidate count: ANN algorithms use a "num_candidates" parameter controlling how many candidates to consider. Higher values improve recall at the cost of latency.

Combining Scores: Reciprocal Rank Fusion

BM25 and vector similarity scores exist on different scales and distributions. Raw score combination doesn't work well. Reciprocal Rank Fusion (RRF) solves this by combining rankings rather than scores.

The RRF formula:

RRF_score(d) = Σ 1/(k + rank_i(d))

Where:

k is a constant (typically 60)
rank_i(d) is the rank of document d in result list i

Documents appearing high in both BM25 and vector rankings get substantial score boosts. Documents appearing high in only one ranking still surface but with lower combined scores.

Elasticsearch 8.x+ supports RRF natively through the rank parameter, making it straightforward to implement hybrid queries without custom score combination logic.

The beauty of RRF is its simplicity—no score normalization required, and it's robust to the different score distributions of different retrieval methods.

Neural Reranking

Initial retrieval (BM25 + vectors) optimizes for recall—getting relevant documents into the candidate set. Neural reranking optimizes for precision—putting the most relevant documents at the top.

Rerankers use cross-encoder architecture, examining query-document pairs jointly rather than encoding them independently. This enables deeper semantic understanding at the cost of computational expense.

Cross-encoders can understand nuanced relevance that bi-encoder retrieval might miss. For example, understanding that a document about "medication side effects" is relevant to a query about "what to expect after starting treatment" even without lexical overlap.

Cost-performance tradeoff: Reranking is computationally expensive. The standard approach is cascade architecture:

Retrieve many candidates quickly (50-100 documents)
Rerank only the top candidates (10-20 documents)
Return the reranked top results

This achieves near-full-reranking quality while keeping latency manageable.

Services like Google's Vertex AI Discovery Engine, Cohere Rerank, and open-source models like ColBERT provide reranking capabilities. The choice depends on your latency requirements, cost constraints, and quality needs.

Diversity with Maximal Marginal Relevance

Sometimes the most relevant results are redundant—multiple documents covering the same aspect of a topic. For comprehensive information needs, result diversity matters.

Maximal Marginal Relevance (MMR) balances relevance against diversity:

MMR = λ * Relevance(d) - (1-λ) * max(Similarity(d, selected))

At each step, MMR selects the document that best balances being relevant to the query while being different from already-selected documents.

The λ parameter controls the tradeoff:

λ = 1.0: Pure relevance (no diversity consideration)
λ = 0.5: Equal weight to relevance and diversity
λ = 0.7: Prioritize relevance but penalize redundancy (common default)

For broad queries where users need comprehensive information, lower λ values ensure variety. For precise queries where users want the single best answer, higher λ values are appropriate.

Geographic Search

Many search applications need location awareness—finding providers, stores, or services near a user's location.

Search engines support geo queries through:

Distance filtering: Only return results within a specified radius Distance decay scoring: Rank closer results higher using functions like Gaussian decay Distance band aggregations: Group results by distance (within 5km, 5-15km, etc.)

Geographic scoring can be combined with text relevance, typically through function_score queries that multiply text relevance by geographic relevance. This produces results that balance being close and being relevant.

Performance Considerations

Hybrid search adds complexity compared to simple keyword search. Typical latency breakdown:

Stage	Typical Latency
Query embedding	50-100ms
Hybrid retrieval	30-80ms
Neural reranking	150-400ms
MMR computation	10-30ms
Total	250-600ms

Optimization strategies:

Caching: Query embeddings for common queries can be cached. Result caching with short TTL helps for popular queries.

Selective reranking: Skip reranking for queries where initial retrieval confidence is high, or for simple navigational queries.

Async architecture: In some applications, showing initial results quickly and refining with reranking provides better perceived performance.

Key Takeaways

Building effective hybrid search requires thoughtful combination of retrieval strategies:

Hybrid retrieval (BM25 + vectors) consistently outperforms either approach alone. Lexical matching catches exact terminology; semantic matching understands intent.
Reciprocal Rank Fusion provides a principled way to combine rankings without score normalization—elegant and effective.
Neural reranking is worth the latency cost for information-seeking queries. The quality improvement from cross-encoder scoring is substantial.
MMR diversity prevents result redundancy, important for broad queries where users need comprehensive information.
Cascade architecture (retrieval → reranking → diversity) keeps latency manageable while achieving high quality.
Geographic queries integrate naturally with semantic search for location-aware applications.

These concepts form the foundation for modern information retrieval systems. The specific implementation details vary by platform and use case, but the architectural patterns are broadly applicable.

For implementation details, consult the official documentation for your search platform. Elasticsearch's hybrid search documentation and RRF documentation are excellent starting points.

Note: The patterns discussed here are intentionally generalized, drawn from industry experience but presented as transferable concepts rather than specific proprietary implementations.