What role does document reranking play alongside decisions about whether to retrieve?
This explores two distinct control decisions in retrieval pipelines — *what to keep and in what order* (reranking) versus *whether to fetch at all* (selective retrieval) — and how the corpus treats them as complementary levers rather than one problem.
This explores two decisions that sit on opposite ends of a retrieval pipeline: reranking decides what survives and in what order *after* you've pulled candidates, while the retrieve-or-not decision happens *before* anything is fetched. The corpus suggests these aren't competing fixes — they're addressing different failure points, and the strongest systems make both decisions smarter rather than picking one.
On the reranking side, the recurring theme is that ordering by raw similarity is the weak link. Embeddings measure *association*, not *relevance* — a structural mismatch flagged directly in Where do retrieval systems fail and why? — so a reranker's real job is to repair that gap. Can rationale-driven selection beat similarity re-ranking for evidence? makes this concrete: having an LLM generate rationales to flag evidence beat similarity reranking by 33% while using half the chunks. Can verification separate structural near-misses from topical matches? pushes the same logic into a two-stage shape — cheap recall first, then a learned verifier that catches structural near-misses the first pass can't — which is reranking reframed as a distinct verification step rather than a sort.
On the whether-to-retrieve side, the corpus is increasingly skeptical of always retrieving. When should language models retrieve external knowledge versus use internal knowledge? frames each reasoning step as a choice between fetching externally and trusting the model's own parametric knowledge, gaining ~22% largely by *not* retrieving when retrieval would only add noise. Where do retrieval systems fail and why? calls fixed-interval triggering a core architectural flaw. And Can models decide better than retrievers which tools to use? hands the decision to the model entirely — it emits requests when reasoning demands them, rather than a retriever guessing in one passive round.
The interesting part is where the two decisions blur. Can a model's partial response guide what to retrieve next? uses a model's partial answer to decide *what to retrieve next* — the generation itself becomes the trigger and the reranking signal at once. Does supervising retrieval steps outperform final answer rewards? shows that rewarding the *intermediate* retrieval steps (which chains were good, which were dead ends) beats only scoring the final answer — meaning the system learns both when to fetch and what to keep from the same feedback. Do hierarchical retrieval architectures outperform flat ones on complex queries? separates planning from synthesis so these decisions don't interfere with each other.
The quiet takeaway: reranking and retrieve-or-not are both ways of controlling *noise budget*. Every irrelevant chunk that survives reranking, and every unnecessary retrieval that fires, costs the same thing — context polluted with material that degrades the answer. That's why Can RAG systems refuse to answer without reliable evidence? treats refusal as a first-class option: sometimes the right rerank is to keep nothing and the right retrieval decision is to admit you have no good evidence. Worth knowing too, from the human side — Do users trust citations more when there are simply more of them? found people trust answers with *more* citations regardless of whether they're relevant, which means sloppy reranking isn't just inefficient, it can actively manufacture misplaced trust.
Sources 10 notes
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.
METEORA uses LLM-generated rationales with flagging instructions to select evidence, achieving 33% better accuracy with 50% fewer chunks than similarity re-ranking across legal, financial, and academic domains. The method also improves adversarial robustness substantially.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.
ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.
Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.
Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.