Re-ranking
Intro
Re-ranking is a second-stage scoring pass that takes the candidate set from retrieval and reorders it using a more expensive, more accurate model before context reaches the generator. Retrieval optimizes for recall at speed — find plausible candidates from millions of chunks. Re-ranking optimizes for precision — push the most relevant candidates to the top of a small list.
The mechanism: first-stage retrieval (dense, sparse, or hybrid) returns a candidate set of 20–100 chunks ranked by approximate similarity. The reranker then scores each candidate against the query using a model that can read both query and document together (joint encoding), producing a more accurate relevance score. The reranked top-k goes to the generator.
sequenceDiagram
participant Q as Query
participant R as First-Stage Retrieval
participant RR as Reranker
participant G as Generator
Q->>R: Retrieve top-N candidates
R->>RR: N candidates for rescoring
Note over RR: Score each candidate
jointly with query
RR->>G: Top-k reranked chunksExample: a hybrid retrieval returns 50 candidates for "what are the SLA penalties for tier-2 partners." Ten candidates mention SLAs generally, three mention tier-2 specifically, and the rest are noise about partner onboarding. A cross-encoder reranker reads each candidate alongside the query and pushes the three tier-2 SLA documents to positions 1–3, where the generator uses them. Without reranking, the generator might receive mostly generic SLA content and produce a vague answer.
Reranking Approaches
Cross-Encoder Reranking
A cross-encoder takes the query and a single document as a concatenated input, passes them through a transformer together, and outputs a relevance score. Unlike bi-encoders (which embed query and document independently), cross-encoders perform full token-level attention between query and document. This joint encoding captures fine-grained interactions — negation, qualifier scope, entity co-reference — that independent embeddings miss.
The tradeoff is speed: a cross-encoder must run inference once per query-document pair. Scoring 50 candidates means 50 forward passes. This makes cross-encoders impractical for first-stage retrieval over millions of chunks, but well-suited for rescoring a small candidate set.
SBERT provides pretrained cross-encoder models across a speed-quality spectrum. At one end, cross-encoder/ms-marco-TinyBERT-L-2-v2 scores ~9000 docs/sec with moderate quality. At the other, cross-encoder/ms-marco-MiniLM-L-12-v2 scores ~960 docs/sec with substantially higher nDCG and MRR on MS MARCO.
Cohere Rerank offers cross-encoder reranking as a managed API. Models like rerank-v3.5 and rerank-v4.0 accept JSON and semi-structured data natively, handle multilingual queries, and require no infrastructure. The tradeoff is per-query API cost and network latency.
Late Interaction — ColBERT
ColBERT (Contextualized Late Interaction over BERT) encodes query and document independently into per-token embeddings, then scores relevance using MaxSim: for each query token, find the maximum cosine similarity to any document token, then sum across all query tokens. This is "late interaction" — token representations are pre-computed independently, but scoring considers token-level alignment.
The key advantage over cross-encoders: document embeddings are pre-computed and stored at index time. At query time, only the query needs encoding. Scoring is a matrix operation (MaxSim) over pre-stored document token vectors, which is significantly faster than full cross-encoder inference per candidate.
ColBERTv2 adds residual compression that reduces per-document storage by 6–10x while retaining most of the quality. On BEIR zero-shot benchmarks, ColBERTv2 achieves competitive nDCG with full cross-encoders at a fraction of the latency.
The tradeoff: ColBERT requires multi-vector storage (one vector per token per document), which standard single-vector stores do not support natively. Dedicated engines like PLAID (ColBERTv2's retrieval engine) or vector stores with multi-vector support are needed.
Score Fusion — RRF and Alternatives
Score fusion combines ranked lists from multiple retrievers into a single ordering. This is not reranking in the cross-encoder sense — no new model scores relevance — but it serves the same purpose of improving ranking quality before generation.
Reciprocal Rank Fusion (RRF) is the most common fusion method. For each document, sum the reciprocal of its rank in each input list:
flowchart LR
D[Document d] --> S1[Rank 3 in dense retrieval]
D --> S2[Rank 1 in BM25]
S1 --> F[RRF = 1 over 63 + 1 over 61 = 0.032]
S2 --> F
F --> R[Combined score 0.032]The formula: RRF_score = sum of 1 over rank_i + k where k=60 is the standard constant from the original paper. RRF is rank-based, not score-based — it does not need score normalization across retrievers with different scales, which makes it robust.
Linear combination normalizes scores from each retriever to a common range and computes a weighted sum: score = alpha * dense_score + (1 - alpha) * sparse_score. This preserves score magnitude but requires choosing alpha and handling score distributions that shift across query types.
When to use which: RRF is the safer default because it only depends on rank ordering, not score distributions. Linear combination is worth trying when one retriever is consistently more reliable than the other and you want to weight it explicitly. In both cases, score fusion is a complement to model-based reranking, not a replacement — fuse first, then rerank the fused list.
Pitfalls
Latency Budget Exhaustion
Cross-encoder reranking adds 50–200ms per query depending on candidate count and model size. In a pipeline with a 500ms total SLA, reranking can consume 10–40% of the budget. Teams add reranking for quality, then discover that p95 latency exceeds the SLA under production load.
Mitigation: set a hard candidate cap (20–50 documents) and choose the reranker model size based on your latency budget, not just quality benchmarks. Profile reranking latency under realistic batch sizes and concurrency, not just single-query benchmarks.
Candidate Count Reduction Under Load
Under traffic pressure, teams reduce the candidate count passed to the reranker (from 100 to 20) to stay within latency budgets. This silently kills recall — if the relevant document was at position 35 in the first-stage results, reducing to top-20 means the reranker never sees it.
Detection: monitor first-stage recall@N at the candidate count you actually pass to the reranker, not the theoretical maximum. If recall@20 is significantly lower than recall@100, the candidate cut is the bottleneck, not the reranker.
Reranker-Retriever Distribution Mismatch
A reranker trained on MS MARCO (short web passages, English) may underperform on your domain (long technical documents, multilingual). The reranker's relevance judgments are calibrated to its training distribution — out-of-distribution documents get unreliable scores.
Mitigation: evaluate the reranker on your own query-document pairs before committing. If domain-specific recall degrades after reranking (reranker demotes relevant documents), the reranker is hurting, not helping. Consider domain-adapted or multilingual reranker models.
Over-Reliance on Reranking to Fix Retrieval
Reranking can only reorder what retrieval found. If the relevant document is not in the candidate set at all (recall failure), no amount of reranking will surface it. Teams sometimes add rerankers expecting them to fix retrieval coverage problems, when the actual fix is better chunking, embedding model selection, or query expansion.
Diagnostic: if reranking improves precision but not recall, the pipeline has a first-stage recall problem, not a ranking problem.
Tradeoffs
| Approach | Quality | Latency per query | Infrastructure | Best for |
|---|---|---|---|---|
| No reranking | Baseline -- retrieval order only | Lowest -- no extra scoring | None | Simple corpora where first-stage ranking is sufficient |
| Score fusion only -- RRF | Moderate -- better ordering from multiple signals | Minimal -- arithmetic on ranks | None -- works with any retriever pair | Hybrid retrieval where dense and sparse complement each other |
| Cross-encoder -- small model | High -- joint query-document attention | 50-100ms for 20-50 candidates | GPU or CPU inference | Quality-sensitive pipelines with moderate latency budgets |
| Cross-encoder -- large model | Highest -- deepest token interaction | 100-300ms for 20-50 candidates | GPU required | High-stakes domains where quality justifies latency |
| ColBERT late interaction | High -- token-level alignment with pre-computed docs | 20-50ms for 20-50 candidates | Multi-vector storage -- specialized index | Latency-sensitive pipelines needing better-than-bi-encoder quality |
| Managed API -- Cohere or Azure | High -- no infrastructure | Network round-trip + provider latency | None -- API call | Teams without ML infrastructure or needing fast integration |
Decision rule: start without reranking and measure retrieval quality. Add score fusion (RRF) when combining multiple retrievers. Add model-based reranking only when precision failures at the top of the ranked list are the dominant error mode — not when recall is the problem.
Questions
The improvement may be in ranking positions that the generator does not use. If the generator only reads the top-3 chunks, improvements at positions 4–5 are invisible to users. Evaluate whether the reranker changes the top-k composition that actually enters the prompt, not just overall nDCG. Also verify that the eval set reflects production query distribution — gains on eval-set query types may not represent the queries users actually send.
When the reranker is out-of-distribution for your domain. A reranker trained on short web passages may misjudge relevance on long technical documents, internal terminology, or multilingual content — demoting actually relevant documents. Also when the candidate count is too small: if first-stage recall@N is already low, the reranker only reorders noise. Always compare recall and precision before and after reranking on domain-specific queries.
ColBERT pre-computes per-token document embeddings at index time and stores them. At query time, only the query tokens need encoding (one forward pass regardless of candidate count). Scoring is a MaxSim matrix operation over pre-stored vectors, not a full transformer forward pass per document. Cross-encoders must run a complete forward pass for each query-document pair because they jointly encode the concatenated input. The tradeoff is that ColBERT needs multi-vector storage, which uses more space and requires specialized indexes.
References
- Retrieve and rerank pipeline — bi-encoder retrieval plus cross-encoder reranking (SBERT)
- Pretrained cross-encoder models — speed and quality benchmarks (SBERT)
- ColBERT — efficient and effective passage search via contextualized late interaction (SIGIR 2020)
- ColBERTv2 — residual compression and denoised supervision (NAACL 2022)
- Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods (SIGIR 2009)
- Semantic ranking in Azure AI Search — L2 reranking with language understanding (Microsoft Learn)
- Rerank API — models and semi-structured data support (Cohere)
- Late interaction overview — ColBERT, ColPali, and production context (Weaviate)
- Voyage Rerank-2 benchmarks — 93-dataset comparison across reranker models (Voyage AI)
- MTEB leaderboard — reranking task results (Hugging Face)