Are uncertainty estimation and external feature signals complementary for retrieval?
This explores whether two competing ways to decide *when* a model should reach for retrieval — the model's own sense of uncertainty versus cheap signals read off the question itself — work better together or are really rivals.
This explores whether two competing ways to decide *when* a model should reach for retrieval — the model's own sense of uncertainty versus cheap signals read off the question itself — work better together or are really rivals. The corpus frames them less as complements than as two camps that each claim the same ground, and the interesting finding is how close they come to a tie.
On one side, the model's self-knowledge wins. Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop questions and matches it on multi-hop, while spending a fraction of the model and retriever calls — the model's own confidence turns out to be a more reliable trigger than external heuristics Can simple uncertainty estimates beat complex adaptive retrieval?. On the other side, you don't need to look inside the model at all: a learned predictor built from 27 lightweight, external question features matches those uncertainty methods on overall performance for far less cost, and actually *outperforms* them on complex questions Can question features alone predict when to retrieve?. So the honest answer to "are they complementary?" is that the corpus shows them as near-substitutes that diverge by question type — uncertainty is strongest on simple single-hop lookups, question features pull ahead on the hard, compositional ones. That divergence is exactly where complementarity could live: route by the kind of question, not by one universal signal.
What makes the trade-off sharper is that "when to retrieve" is itself a learnable decision, not a fixed schedule. DeepRAG treats each reasoning step as a Markov decision process and learns, step by step, whether to pull external knowledge or trust the model's parametric memory — and the 22% accuracy gain comes as much from *not* retrieving (cutting noise from unnecessary lookups) as from retrieving well When should language models retrieve external knowledge versus use internal knowledge?. Read alongside the two trigger studies, this suggests the real prize isn't picking uncertainty or question-features as the better oracle; it's that both are inputs to a policy that decides retrieval per step.
It helps to know *why* the trigger decision matters so much. A structural account of where RAG breaks names adaptive triggering as one of three independent failure levels — fixed-interval retrieval simply wastes context — sitting beside semantic-task mismatch and the hard mathematical limits of embedding dimension Where do retrieval systems fail and why?. In other words, getting the *when* wrong is a distinct failure from getting the *what* wrong, which is why a cheap, accurate trigger signal is worth so much. And once you do retrieve, refusing to answer without grounded evidence becomes the backstop that keeps a bad trigger from turning into a confident hallucination Can RAG systems refuse to answer without reliable evidence?.
The thing you may not have expected to learn: the cheaper signal is often the better one. The lightweight external-feature predictor isn't a fallback for when you can't probe the model — it beats the introspective method precisely on the complex questions you'd assume demand the model's own judgment. The frontier question the corpus implies but doesn't yet answer is whether feeding both signals into a learned per-step policy beats either alone.
Sources 5 notes
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Learned predictors using 27 lightweight external question features match complex uncertainty-based methods on overall performance while costing far less, and outperform them on complex questions across 6 QA datasets.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.