Re-ranking

Intro

Re-ranking is a second-stage scoring pass that takes the candidate set from retrieval and reorders it using a more expensive, more accurate model before context reaches the generator. Retrieval optimizes for recall at speed — find plausible candidates from millions of chunks. Re-ranking optimizes for precision — push the most relevant candidates to the top of a small list.

The mechanism: first-stage retrieval (dense, sparse, or hybrid) returns a candidate set of 20–100 chunks ranked by approximate similarity. The reranker then scores each candidate against the query using a model that can read both query and document together (joint encoding), producing a more accurate relevance score. The reranked top-k goes to the generator.

sequenceDiagram
    participant Q as Query
    participant R as First-Stage Retrieval
    participant RR as Reranker
    participant G as Generator
    Q->>R: Retrieve top-N candidates
    R->>RR: N candidates for rescoring
    Note over RR: Score each candidate
jointly with query RR->>G: Top-k reranked chunks

Example: a hybrid retrieval returns 50 candidates for "what are the SLA penalties for tier-2 partners." Ten candidates mention SLAs generally, three mention tier-2 specifically, and the rest are noise about partner onboarding. A cross-encoder reranker reads each candidate alongside the query and pushes the three tier-2 SLA documents to positions 1–3, where the generator uses them. Without reranking, the generator might receive mostly generic SLA content and produce a vague answer.

Reranking Approaches

Cross-Encoder Reranking

A cross-encoder takes the query and a single document as a concatenated input, passes them through a transformer together, and outputs a relevance score. Unlike bi-encoders (which embed query and document independently), cross-encoders perform full token-level attention between query and document. This joint encoding captures fine-grained interactions — negation, qualifier scope, entity co-reference — that independent embeddings miss.

The tradeoff is speed: a cross-encoder must run inference once per query-document pair. Scoring 50 candidates means 50 forward passes. This makes cross-encoders impractical for first-stage retrieval over millions of chunks, but well-suited for rescoring a small candidate set.

SBERT provides pretrained cross-encoder models across a speed-quality spectrum. At one end, cross-encoder/ms-marco-TinyBERT-L-2-v2 scores ~9000 docs/sec with moderate quality. At the other, cross-encoder/ms-marco-MiniLM-L-12-v2 scores ~960 docs/sec with substantially higher nDCG and MRR on MS MARCO.

Cohere Rerank offers cross-encoder reranking as a managed API. Models like rerank-v3.5 and rerank-v4.0 accept JSON and semi-structured data natively, handle multilingual queries, and require no infrastructure. The tradeoff is per-query API cost and network latency.

Late Interaction — ColBERT

ColBERT (Contextualized Late Interaction over BERT) encodes query and document independently into per-token embeddings, then scores relevance using MaxSim: for each query token, find the maximum cosine similarity to any document token, then sum across all query tokens. This is "late interaction" — token representations are pre-computed independently, but scoring considers token-level alignment.

The key advantage over cross-encoders: document embeddings are pre-computed and stored at index time. At query time, only the query needs encoding. Scoring is a matrix operation (MaxSim) over pre-stored document token vectors, which is significantly faster than full cross-encoder inference per candidate.

ColBERTv2 adds residual compression that reduces per-document storage by 6–10x while retaining most of the quality. On BEIR zero-shot benchmarks, ColBERTv2 achieves competitive nDCG with full cross-encoders at a fraction of the latency.

The tradeoff: ColBERT requires multi-vector storage (one vector per token per document), which standard single-vector stores do not support natively. Dedicated engines like PLAID (ColBERTv2's retrieval engine) or vector stores with multi-vector support are needed.

Score Fusion — RRF and Alternatives

Score fusion combines ranked lists from multiple retrievers into a single ordering. This is not reranking in the cross-encoder sense — no new model scores relevance — but it serves the same purpose of improving ranking quality before generation.

Reciprocal Rank Fusion (RRF) is the most common fusion method. For each document, sum the reciprocal of its rank in each input list:

flowchart LR
    D[Document d] --> S1[Rank 3 in dense retrieval]
    D --> S2[Rank 1 in BM25]
    S1 --> F[RRF = 1 over 63 + 1 over 61 = 0.032]
    S2 --> F
    F --> R[Combined score 0.032]

The formula: RRF_score = sum of 1 over rank_i + k where k=60 is the standard constant from the original paper. RRF is rank-based, not score-based — it does not need score normalization across retrievers with different scales, which makes it robust.

Linear combination normalizes scores from each retriever to a common range and computes a weighted sum: score = alpha * dense_score + (1 - alpha) * sparse_score. This preserves score magnitude but requires choosing alpha and handling score distributions that shift across query types.

When to use which: RRF is the safer default because it only depends on rank ordering, not score distributions. Linear combination is worth trying when one retriever is consistently more reliable than the other and you want to weight it explicitly. In both cases, score fusion is a complement to model-based reranking, not a replacement — fuse first, then rerank the fused list.

Pitfalls

Latency Budget Exhaustion

Cross-encoder reranking adds 50–200ms per query depending on candidate count and model size. In a pipeline with a 500ms total SLA, reranking can consume 10–40% of the budget. Teams add reranking for quality, then discover that p95 latency exceeds the SLA under production load.

Mitigation: set a hard candidate cap (20–50 documents) and choose the reranker model size based on your latency budget, not just quality benchmarks. Profile reranking latency under realistic batch sizes and concurrency, not just single-query benchmarks.

Candidate Count Reduction Under Load

Under traffic pressure, teams reduce the candidate count passed to the reranker (from 100 to 20) to stay within latency budgets. This silently kills recall — if the relevant document was at position 35 in the first-stage results, reducing to top-20 means the reranker never sees it.

Detection: monitor first-stage recall@N at the candidate count you actually pass to the reranker, not the theoretical maximum. If recall@20 is significantly lower than recall@100, the candidate cut is the bottleneck, not the reranker.

Reranker-Retriever Distribution Mismatch

A reranker trained on MS MARCO (short web passages, English) may underperform on your domain (long technical documents, multilingual). The reranker's relevance judgments are calibrated to its training distribution — out-of-distribution documents get unreliable scores.

Mitigation: evaluate the reranker on your own query-document pairs before committing. If domain-specific recall degrades after reranking (reranker demotes relevant documents), the reranker is hurting, not helping. Consider domain-adapted or multilingual reranker models.

Over-Reliance on Reranking to Fix Retrieval

Reranking can only reorder what retrieval found. If the relevant document is not in the candidate set at all (recall failure), no amount of reranking will surface it. Teams sometimes add rerankers expecting them to fix retrieval coverage problems, when the actual fix is better chunking, embedding model selection, or query expansion.

Diagnostic: if reranking improves precision but not recall, the pipeline has a first-stage recall problem, not a ranking problem.

Tradeoffs

Approach Quality Latency per query Infrastructure Best for
No reranking Baseline -- retrieval order only Lowest -- no extra scoring None Simple corpora where first-stage ranking is sufficient
Score fusion only -- RRF Moderate -- better ordering from multiple signals Minimal -- arithmetic on ranks None -- works with any retriever pair Hybrid retrieval where dense and sparse complement each other
Cross-encoder -- small model High -- joint query-document attention 50-100ms for 20-50 candidates GPU or CPU inference Quality-sensitive pipelines with moderate latency budgets
Cross-encoder -- large model Highest -- deepest token interaction 100-300ms for 20-50 candidates GPU required High-stakes domains where quality justifies latency
ColBERT late interaction High -- token-level alignment with pre-computed docs 20-50ms for 20-50 candidates Multi-vector storage -- specialized index Latency-sensitive pipelines needing better-than-bi-encoder quality
Managed API -- Cohere or Azure High -- no infrastructure Network round-trip + provider latency None -- API call Teams without ML infrastructure or needing fast integration

Decision rule: start without reranking and measure retrieval quality. Add score fusion (RRF) when combining multiple retrievers. Add model-based reranking only when precision failures at the top of the ranked list are the dominant error mode — not when recall is the problem.

Questions

References


Whats next