INQUIRING LINE

How can per-step decisions about knowledge retrieval improve reasoning over uniform policies?

This explores whether letting a model decide retrieval moment-by-moment during reasoning — pulling external knowledge only at the steps that need it — beats applying one fixed retrieval rule everywhere.


This explores whether letting a model decide retrieval moment-by-moment during reasoning — pulling external knowledge only at the steps that need it — beats applying one fixed retrieval rule everywhere. The corpus's clearest answer comes from framing retrieval as a sequence of choices rather than a switch you flip once. DeepRAG treats each reasoning step as a Markov Decision Process, learning at every step whether to consult external sources or trust what the model already knows; the payoff is a ~22% accuracy gain that comes as much from *not* retrieving when retrieval would only inject noise as from retrieving when it helps When should language models retrieve external knowledge versus use internal knowledge?. The lesson is counterintuitive: uniform 'always retrieve' policies don't just waste effort, they actively degrade reasoning by drowning good internal knowledge in irrelevant fetched text.

The same 'choose the structure per query' instinct shows up one level higher. StructRAG routes each query to a task-appropriate knowledge format — a table, a graph, an algorithm, a plain chunk — depending on what the question actually demands, and beats uniform retrieval by grounding the choice in cognitive-fit theory: different reasoning tasks fit different representations, so forcing one shape on all of them is a mismatch Can routing queries to task-matched structures improve RAG reasoning?. Per-step and per-query selectivity are the same idea applied at different granularities — match the retrieval action to the local demand instead of standardizing it.

There's a deeper reason selectivity wins, visible if you look at *which* steps matter. Work on RLVR finds that only about 20% of tokens are high-entropy 'forking points' where the model genuinely decides where reasoning goes — and training on just those matches full-gradient performance Do high-entropy tokens drive reasoning model improvements?. Retrieval decisions plausibly cluster at exactly these junctions: most steps are low-stakes continuations where external lookup adds nothing, while a few pivotal steps are where fresh knowledge changes the trajectory. A uniform policy spends equally on both; a per-step policy concentrates effort where the fork actually is. Graph-O1 makes this concrete in the retrieval setting itself — instead of ingesting a whole knowledge graph, it learns a step-by-step traversal policy with MCTS and RL, deciding which edge to follow next rather than reading everything Can learned traversal policies beat exhaustive graph reading?.

Selectivity also has to be budgeted, not just toggled. Agentic deep research shows search behaves like a test-time scaling axis with diminishing returns, so the question isn't only *whether* to retrieve at a step but *how much* budget to spend across steps Does search budget scale like reasoning tokens for answer quality?. And long-horizon research suffers when any single step over-spends: capping per-turn reasoning preserves the context window for later retrieval rounds, which is a per-step discipline rather than a global time limit Does limiting reasoning per turn improve multi-turn search quality?. There's a subtle trap worth naming, though — chain-of-thought reasoning degrades predictably off-distribution, producing fluent but invalid logic Does chain-of-thought reasoning actually generalize beyond training data?, so a per-step policy is only as trustworthy as the step-level judgments driving it. That's why generative step-wise judges that reason *about* each reasoning step outperform classifier-style scorers Can judges that reason about reasoning outperform classifier rewards?: good per-step retrieval needs good per-step evaluation to know which steps were actually pivotal.

The thread across all of this: the corpus keeps finding that one fixed policy applied uniformly is the wrong default, and that the real gains live in learning *where the decision points are* and acting differently at each one — whether the decision is retrieve-vs-recall, which structure to fetch, which graph edge to walk, or how much budget to burn before moving on.


Sources 8 notes

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can learned traversal policies beat exhaustive graph reading?

Graph-O1 replaces whole-graph ingestion with step-by-step agentic navigation using Monte Carlo Tree Search and reinforcement learning. This approach fits within LLM context windows while learning domain-specific traversal policies, though it trades certainty about the full graph for decision-making under uncertainty.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether per-step retrieval decisions outperform uniform retrieval policies in LLM reasoning — a question a curated library explored across 2021–2025.

What a curated library found — and when (dated claims, not current truth):
• DeepRAG (2025-02) reports ~22% accuracy gain by treating each reasoning step as a choice: retrieve or rely on internal knowledge, with the payoff coming as much from *not* retrieving (avoiding noise) as from retrieving.
• StructRAG (2024-10) routes queries to task-appropriate knowledge formats (table, graph, algorithm, text) based on cognitive fit; uniform retrieval underperforms because different reasoning tasks demand different representations.
• High-entropy minority tokens (2025-06) — roughly 20% of tokens — are the critical forking points where reasoning direction is decided; per-step policies concentrate retrieval effort at these junctions, not on low-stakes continuations.
• Graph-based selective traversal (2025-06, implied) replaces whole-graph ingestion with step-by-step MCTS+RL policies that decide which edge to follow, not which subgraph to read.
• Agentic deep research (2025-06) exhibits test-time scaling with diminishing returns; the question becomes not whether to retrieve but *how much budget* to spend per step, and long-horizon tasks require per-turn reasoning caps to preserve context for later retrieval rounds.

Anchor papers (verify; mind their dates):
• DeepRAG (arXiv:2502.01142, 2025-02) — foundational MDP framing of per-step retrieval
• StructRAG (arXiv:2410.08815, 2024-10) — cognitive-fit routing across knowledge formats
• High-Entropy Minority Tokens (arXiv:2506.01939, 2025-06) — identifies forking-point concentration
• Agentic Deep Research (arXiv:2506.18959, 2025-06) — test-time budget scaling in retrieval

Your task:
(1) RE-TEST EACH CONSTRAINT: For DeepRAG's 22% gain, has subsequent work (last 6 months) widened or narrowed the gap with uniform policies under newer model scales, fine-tuning regimes, or retrieval indices? Does StructRAG's cognitive-fit routing hold across multimodal or long-context models? Investigate whether the 20%-token high-entropy claim holds under different RL objectives or prompt styles — is this a robust property or training-regime artifact? Separate the durable insight (per-step selectivity should beat uniformity in principle) from perishable constraints (specific gains, indices, model sizes).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: look for papers arguing uniform retrieval is cheaper/sufficient under certain scaling laws, or showing per-step overhead outweighs selectivity gains, or proposing that end-to-end fine-tuning (without explicit per-step policy) learns selectivity implicitly.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Do multimodal or vision-language models exhibit the same high-entropy forking-point concentration, and does per-step retrieval of images/embeddings beat uniform multimodal augmentation? (b) In very long-context windows (100K+ tokens), does per-step retrieval remain more efficient than batching retrieval at fixed checkpoints, or does amortization flip the economics?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines