Can adaptive per-step decisions outperform uniform retrieval policies across different reasoning tasks?
This explores whether letting a model decide at each step — retrieve or not, which structure to pull, how hard to think — beats applying one fixed retrieval rule to every query, and the corpus has a surprisingly layered answer.
This explores whether per-step, on-the-fly decisions about retrieval beat a single uniform policy applied to every query. The corpus says yes — but the more interesting finding is *which kind* of adaptivity pays off, and where the cheapest version wins. The clearest case is DeepRAG, which frames each reasoning step as a Markov decision process: at every step the model chooses whether to reach for external knowledge or trust what it already knows. That selective switching delivers a ~22% accuracy gain, and notably the gain comes as much from *not* retrieving — eliminating noise from unnecessary lookups — as from retrieving well When should language models retrieve external knowledge versus use internal knowledge?. So adaptivity isn't just 'retrieve more cleverly'; it's knowing when retrieval would hurt.
That theme repeats one level up, at the question of *what* to retrieve. StructRAG shows that routing each query to a task-matched knowledge structure — a table, a graph, an algorithm, a catalogue, or plain chunks — beats uniform chunk-based RAG on knowledge-intensive reasoning. It grounds this in cognitive-fit theory: different reasoning tasks want different representations, so one-size-fits-all retrieval is leaving accuracy on the table Can routing queries to task-matched structures improve RAG reasoning?. The same 'separate and route' instinct shows up architecturally, where splitting query planning from answer synthesis reduces interference and lifts multi-hop performance Do hierarchical retrieval architectures outperform flat ones on complex queries?.
But here's the twist the corpus delivers, and it's worth pausing on: the *fanciest* adaptivity isn't always the winner. A calibrated, cheap uncertainty estimate — just reading the model's own token-probability confidence to decide whether to retrieve — consistently beats elaborate multi-call adaptive retrieval on single-hop tasks and matches it on multi-hop, at a fraction of the compute Can simple uncertainty estimates beat complex adaptive retrieval?. The lesson: adaptive *yes*, but the model's own self-knowledge is often a better signal than external heuristics dressed up as adaptivity. Per-step decisions beat uniform policies; complicated per-step machinery doesn't automatically beat simple per-step machinery.
The same 'learn when, not just how' pattern extends beyond retrieval into reasoning itself. Thinkless trains a single model to route between extended thinking and quick direct answers without needing difficulty labels — adaptive depth, self-calibrated Can models learn when to think versus respond quickly?. And there's a mechanistic hint at *why* per-step decisions carry so much weight: in RLVR training, only ~20% of tokens are high-entropy 'forking points' where the real reasoning decisions happen, and training on just those matches full-gradient performance Do high-entropy tokens drive reasoning model improvements?. Decisions are concentrated at a few pivotal steps — which is exactly the territory adaptive per-step policies are built to exploit.
One caveat the corpus adds for 'across different reasoning tasks': adaptivity has a budget cost. Unrestricted reasoning inside a single search turn can eat the context needed for later retrieval rounds, so long-horizon tasks benefit from per-turn reasoning limits, not just smarter per-step choices Does limiting reasoning per turn improve multi-turn search quality?. Put together: per-step adaptive policies do outperform uniform ones — but the best ones are calibrated to the model's own uncertainty, matched to the task's structure, and disciplined about their compute budget rather than maximally elaborate.
Sources 7 notes
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.