Retrieval

Intro

Retrieval is the stage that decides what evidence enters the prompt. In most RAG systems, generation quality plateaus at the quality of retrieval — no prompt engineering or model upgrade compensates for missing or wrong context. The goal is to balance recall (find everything relevant), precision (exclude everything irrelevant), and latency across query types: semantic paraphrases, exact identifiers, and multi-constraint requests.

The mechanism: the user query is converted into one or more search representations — a vector (for semantic search), a set of weighted terms (for keyword search), or both. The vector is matched against pre-indexed chunk vectors in a vector database. The terms are matched via a keyword index (BM25). The top-k candidates from one or both paths are fused into a single ranked list and passed to reranking or directly to the generator.

flowchart LR
    Q[User query] --> D[Vector search]
    Q --> S[Keyword search]
    D --> F[Fuse results]
    S --> F
    F --> R[Top-k candidates]
    R --> N[Reranker or generator]

Example: a user asks "rate limit error 429 behavior in partner tier." Vector search captures the semantic intent — rate limiting behavior — but may miss the exact token 429. Keyword search catches 429 and partner tier via exact word match but misses semantically related content phrased differently. Running both in parallel and fusing results covers both failure modes. Though hybrid is the safe default, it is not universally better — see Pitfalls.

Retrieval Modes

How it works:

Where it fits:

Main risk:

Sparse Retrieval — Keyword Search (BM25)

If you have used full-text search in PostgreSQL (to_tsvector/to_tsquery) or Elasticsearch, BM25 is the same core idea — it is the algorithm behind those search indexes. Think of it as grep with ranking: it finds documents containing your exact words, then sorts them by how distinctive those words are in the corpus. A query for NullReferenceException in a .NET codebase hits exactly the files that contain that string, and files where it appears in a focused context (short file, rare term) rank above a 10,000-line log dump that happens to mention it once.

How it works:

Where it fits:

Main risk:

Hybrid Retrieval — Vector + Keyword

Hybrid retrieval is like running both a full-text search (WHERE body @@ to_tsquery('error & 429')) and a vector similarity search against the same query, then merging the two result sets. RRF merges by rank position — like taking two independently sorted result lists and boosting any item that appears near the top of both. If document A is #2 in vector search and #5 in keyword search, while document B is #1 in keyword search but #200 in vector search, RRF ranks A higher because both retrievers agree it is relevant. Linear combination is the same idea but lets you explicitly set how much you trust each retriever — like a weighted UNION ALL with a tunable ratio.

How it works:

Where it fits:

Main risk:

Indexing and Filtering

The vector index determines the latency-recall tradeoff for vector search:

Metadata filtering is equally critical:

Pitfalls

Silent Recall Degradation at Scale

HNSW recall degrades as the corpus grows — no errors, no latency spike, just worse context fed to the LLM. At a fixed ef_search value, the index becomes less accurate as more vectors crowd the space. Infrastructure dashboards show healthy metrics while answer quality silently declines. Long-tail and rare-entity queries degrade first.

Detection: maintain ground-truth query-chunk pairs and run Recall@k checks on a schedule. Latency and error-rate monitoring alone will not catch recall regression.

Embedding Model Migration Debt

Embedding models produce incompatible vector spaces. Upgrading models means re-embedding the entire corpus — you cannot query a new model's vectors against an old model's index. At scale, this means parallel infrastructure costs, downtime risk, and potential regression even when benchmark scores improve. API providers (like OpenAI) can deprecate models on their schedule, forcing emergency re-embedding.

Mitigation: treat embedding model selection as a long-term infrastructure decision. Store the model version alongside each vector. Set upgrade thresholds based on domain-specific metrics, not MTEB deltas. Use collection aliases and shadow traffic to validate before cutover.

Aggregate Metrics Hiding Segment Failures

Overall recall of 70% can mask 5% recall on the query types that matter most (multi-hop, date-filtered, identifier-heavy). Without segmentation, you cannot distinguish inventory failures (data missing from corpus) from capability failures (data exists but retrieval cannot surface it).

Detection: segment retrieval metrics by query type, tenant, locale, and domain. Alert on per-segment degradation, not just aggregate.

Vector Search Failing Silently on Identifiers

Vector search returns topically related but operationally wrong chunks for identifier-heavy queries. Unlike keyword search misses that return obviously unrelated content, vector search misses look plausible — the LLM synthesizes a confident answer from wrong evidence.

Mitigation: use hybrid retrieval for identifier-heavy corpora. Explicitly test retrieval on identifier-based queries during evaluation. If vector search is responsible for most failures in your pipeline, inspect fusion weights — identifier-heavy domains often need higher keyword weight.

Tradeoffs

Mode Recall profile Latency Operational complexity Best for
Vector only Strong on semantic paraphrases -- weak on exact identifiers Low -- single index lookup Moderate -- embedding model and vector database required Homogeneous semantic corpora with natural-language queries
Keyword only -- BM25 Strong on exact terms -- weak on paraphrases Lowest -- keyword index lookup Low -- no embedding model or vector database Identifier-heavy domains with stable vocabulary
Hybrid -- RRF Broad -- covers semantic and lexical queries Moderate -- two parallel searches plus fusion Higher -- two indexes and fusion logic Mixed query patterns -- default for most production systems
Hybrid -- linear combination Tunable -- weight toward dominant search mode Moderate -- same as RRF Highest -- requires alpha tuning per domain When one search mode is consistently stronger and you want explicit weighting

Decision rule: start with hybrid retrieval (RRF) and conservative top-k (5-20). Evaluate against single-mode baselines on your actual corpus and query distribution — hybrid is the safe default but not always the winner. Add reranking only after baseline retrieval is stable and precision at the top of the ranked list is the dominant error mode.

Questions

References


Whats next