Re-ranking

Intro

Re-ranking is a second-stage scoring pass that takes the candidate set from retrieval and reorders it using a more expensive, more accurate model before context reaches the generator. Retrieval optimizes for recall at speed — find plausible candidates from millions of chunks. Re-ranking optimizes for precision — push the most relevant candidates to the top of a small list.

The mechanism: first-stage retrieval (dense, sparse, or hybrid) returns a candidate set of 20–100 chunks ranked by approximate similarity. The reranker then scores each candidate against the query using a model that can read both query and document together (joint encoding), producing a more accurate relevance score. The reranked top-k goes to the generator.

sequenceDiagram
    participant Q as Query
    participant R as First-Stage Retrieval
    participant RR as Reranker
    participant G as Generator
    Q->>R: Retrieve top-N candidates
    R->>RR: N candidates for rescoring
    Note over RR: Score each candidate
jointly with query
    RR->>G: Top-k reranked chunks

Example: a hybrid retrieval returns 50 candidates for "what are the SLA penalties for tier-2 partners." Ten candidates mention SLAs generally, three mention tier-2 specifically, and the rest are noise about partner onboarding. A cross-encoder reranker reads each candidate alongside the query and pushes the three tier-2 SLA documents to positions 1–3, where the generator uses them. Without reranking, the generator might receive mostly generic SLA content and produce a vague answer.

Reranking Approaches

Cross-Encoder Reranking

A cross-encoder takes the query and a single document as a concatenated input, passes them through a transformer together, and outputs a relevance score. Unlike bi-encoders (which embed query and document independently), cross-encoders perform full token-level attention between query and document. This joint encoding captures fine-grained interactions — negation, qualifier scope, entity co-reference — that independent embeddings miss.

The tradeoff is speed: a cross-encoder must run inference once per query-document pair. Scoring 50 candidates means 50 forward passes. This makes cross-encoders impractical for first-stage retrieval over millions of chunks, but well-suited for rescoring a small candidate set.

SBERT provides pretrained cross-encoder models across a speed-quality spectrum. At one end, cross-encoder/ms-marco-TinyBERT-L-2-v2 scores ~9000 docs/sec with moderate quality. At the other, cross-encoder/ms-marco-MiniLM-L-12-v2 scores ~960 docs/sec with substantially higher nDCG and MRR on MS MARCO.

Cohere Rerank offers cross-encoder reranking as a managed API. Models like rerank-v3.5 and rerank-v4.0 accept JSON and semi-structured data natively, handle multilingual queries, and require no infrastructure. The tradeoff is per-query API cost and network latency.

Late Interaction — ColBERT

ColBERT (Contextualized Late Interaction over BERT) encodes query and document independently into per-token embeddings, then scores relevance using MaxSim: for each query token, find the maximum cosine similarity to any document token, then sum across all query tokens. This is "late interaction" — token representations are pre-computed independently, but scoring considers token-level alignment.

The key advantage over cross-encoders: document embeddings are pre-computed and stored at index time. At query time, only the query needs encoding. Scoring is a matrix operation (MaxSim) over pre-stored document token vectors, which is significantly faster than full cross-encoder inference per candidate.

ColBERTv2 adds residual compression that reduces per-document storage by 6–10x while retaining most of the quality. On BEIR zero-shot benchmarks, ColBERTv2 achieves competitive nDCG with full cross-encoders at a fraction of the latency.

The tradeoff: ColBERT requires multi-vector storage (one vector per token per document), which standard single-vector stores do not support natively. Dedicated engines like PLAID (ColBERTv2's retrieval engine) or vector stores with multi-vector support are needed.

Score Fusion — RRF and Alternatives

Score fusion combines ranked lists from multiple retrievers into a single ordering. This is not reranking in the cross-encoder sense — no new model scores relevance — but it serves the same purpose of improving ranking quality before generation.

Reciprocal Rank Fusion (RRF) is the most common fusion method. For each document, sum the reciprocal of its rank in each input list:

flowchart LR
    D[Document d] --> S1[Rank 3 in dense retrieval]
    D --> S2[Rank 1 in BM25]
    S1 --> F[RRF = 1 over 63 + 1 over 61 = 0.032]
    S2 --> F
    F --> R[Combined score 0.032]

The formula: RRF_score = sum of 1 over rank_i + k where k=60 is the standard constant from the original paper. RRF is rank-based, not score-based — it does not need score normalization across retrievers with different scales, which makes it robust.

Linear combination normalizes scores from each retriever to a common range and computes a weighted sum: score = alpha * dense_score + (1 - alpha) * sparse_score. This preserves score magnitude but requires choosing alpha and handling score distributions that shift across query types.

When to use which: RRF is the safer default because it only depends on rank ordering, not score distributions. Linear combination is worth trying when one retriever is consistently more reliable than the other and you want to weight it explicitly. In both cases, score fusion is a complement to model-based reranking, not a replacement — fuse first, then rerank the fused list.

Pitfalls

Latency Budget Exhaustion

Cross-encoder reranking adds 50–200ms per query depending on candidate count and model size. In a pipeline with a 500ms total SLA, reranking can consume 10–40% of the budget. Teams add reranking for quality, then discover that p95 latency exceeds the SLA under production load.

Mitigation: set a hard candidate cap (20–50 documents) and choose the reranker model size based on your latency budget, not just quality benchmarks. Profile reranking latency under realistic batch sizes and concurrency, not just single-query benchmarks.

Candidate Count Reduction Under Load

Under traffic pressure, teams reduce the candidate count passed to the reranker (from 100 to 20) to stay within latency budgets. This silently kills recall — if the relevant document was at position 35 in the first-stage results, reducing to top-20 means the reranker never sees it.

Detection: monitor first-stage recall@N at the candidate count you actually pass to the reranker, not the theoretical maximum. If recall@20 is significantly lower than recall@100, the candidate cut is the bottleneck, not the reranker.

Reranker-Retriever Distribution Mismatch

A reranker trained on MS MARCO (short web passages, English) may underperform on your domain (long technical documents, multilingual). The reranker's relevance judgments are calibrated to its training distribution — out-of-distribution documents get unreliable scores.

Mitigation: evaluate the reranker on your own query-document pairs before committing. If domain-specific recall degrades after reranking (reranker demotes relevant documents), the reranker is hurting, not helping. Consider domain-adapted or multilingual reranker models.

Over-Reliance on Reranking to Fix Retrieval

Reranking can only reorder what retrieval found. If the relevant document is not in the candidate set at all (recall failure), no amount of reranking will surface it. Teams sometimes add rerankers expecting them to fix retrieval coverage problems, when the actual fix is better chunking, embedding model selection, or query expansion.

Diagnostic: if reranking improves precision but not recall, the pipeline has a first-stage recall problem, not a ranking problem.

Tradeoffs

Approach	Quality	Latency per query	Infrastructure	Best for
No reranking	Baseline -- retrieval order only	Lowest -- no extra scoring	None	Simple corpora where first-stage ranking is sufficient
Score fusion only -- RRF	Moderate -- better ordering from multiple signals	Minimal -- arithmetic on ranks	None -- works with any retriever pair	Hybrid retrieval where dense and sparse complement each other
Cross-encoder -- small model	High -- joint query-document attention	50-100ms for 20-50 candidates	GPU or CPU inference	Quality-sensitive pipelines with moderate latency budgets
Cross-encoder -- large model	Highest -- deepest token interaction	100-300ms for 20-50 candidates	GPU required	High-stakes domains where quality justifies latency
ColBERT late interaction	High -- token-level alignment with pre-computed docs	20-50ms for 20-50 candidates	Multi-vector storage -- specialized index	Latency-sensitive pipelines needing better-than-bi-encoder quality
Managed API -- Cohere or Azure	High -- no infrastructure	Network round-trip + provider latency	None -- API call	Teams without ML infrastructure or needing fast integration

Decision rule: start without reranking and measure retrieval quality. Add score fusion (RRF) when combining multiple retrievers. Add model-based reranking only when precision failures at the top of the ranked list are the dominant error mode — not when recall is the problem.

Questions

Why can reranking improve offline nDCG without visible quality improvement for end users?

The improvement may be in ranking positions that the generator does not use. If the generator only reads the top-3 chunks, improvements at positions 4–5 are invisible to users. Evaluate whether the reranker changes the top-k composition that actually enters the prompt, not just overall nDCG. Also verify that the eval set reflects production query distribution — gains on eval-set query types may not represent the queries users actually send.

When does reranking hurt retrieval quality instead of helping?

When the reranker is out-of-distribution for your domain. A reranker trained on short web passages may misjudge relevance on long technical documents, internal terminology, or multilingual content — demoting actually relevant documents. Also when the candidate count is too small: if first-stage recall@N is already low, the reranker only reorders noise. Always compare recall and precision before and after reranking on domain-specific queries.

Why is ColBERT faster than a cross-encoder at query time despite also using token-level scoring?

ColBERT pre-computes per-token document embeddings at index time and stores them. At query time, only the query tokens need encoding (one forward pass regardless of candidate count). Scoring is a MaxSim matrix operation over pre-stored vectors, not a full transformer forward pass per document. Cross-encoders must run a complete forward pass for each query-document pair because they jointly encode the concatenated input. The tradeoff is that ColBERT needs multi-vector storage, which uses more space and requires specialized indexes.

References

Whats next

Parent
LLM

Pages