INQUIRING LINE

Should retrieval be triggered always or only for difficult questions?

This explores whether a RAG system should fetch external documents on every query, or only when the model actually needs help — and how the corpus says to make that call.


This explores whether retrieval should fire on every query or only for hard ones — and the corpus lands firmly on the second view, with a twist: the best signal for "hard" comes from the model's own uncertainty, not from a fixed schedule. The clearest statement is FLARE's: retrieval should trigger when the model's next-token confidence drops, not at fixed intervals or once at the start When should retrieval happen during model generation?. Low confidence is the model telling you it has hit a genuine knowledge gap; retrieving everywhere else just injects noise. DeepRAG reaches the same conclusion from a different angle, framing each reasoning step as a decision — retrieve or rely on what I already know — and getting a 22% accuracy jump largely by *not* retrieving when parametric knowledge suffices When should language models retrieve external knowledge versus use internal knowledge?. So "always" isn't just wasteful; it actively hurts by drowning good answers in irrelevant context.

The surprising practical finding is how cheaply you can detect "difficult." You might expect deciding when to retrieve to require an elaborate adaptive controller, but calibrated token-probability uncertainty beats those multi-call adaptive schemes on single-hop questions and matches them on multi-hop — at a fraction of the compute Can simple uncertainty estimates beat complex adaptive retrieval?. The model's self-knowledge turns out to be a more reliable trigger than external heuristics. If you only take one thing away: a simple confidence threshold is often the whole answer to "when."

But "difficulty" isn't a single dial — it's also a function of *what kind* of question is being asked. One line of work shows question type determines retrieval strategy: simple evidence-lookup questions suit plain RAG, while comparison or debate questions need aspect-specific retrieval, and reason/experience questions need decomposition before retrieval even makes sense Does question type determine the right retrieval strategy?. So the real decision isn't binary (retrieve / don't) but routed: easy factoids may need nothing, and the genuinely hard ones may need *more* retrieval, structured differently — sometimes iteratively, with a persistent memory workspace that revisits evidence across cycles to resolve contradictions Can reasoning systems maintain memory across retrieval cycles?.

There's also a quieter argument for letting the model drive the trigger rather than a passive retriever. MCP-Zero shows models that proactively emit structured requests for what they need outperform single-shot semantic matching, because the model refines its ask as reasoning unfolds Can models decide better than retrievers which tools to use?. That's the same instinct as uncertainty-gating — trust the model to know when it's stuck — extended from "whether" to "what." And once you accept selective retrieval, you can train the retrieval steps themselves: process-level supervision that rewards good intermediate retrieval decisions substantially outperforms only grading the final answer Does supervising retrieval steps outperform final answer rewards?.

Worth flagging the opposite edge case the corpus raises: when sources are noisy or unreliable, the safer move can be to retrieve aggressively but constrain *generation* — refusing to answer unless the evidence is solid Can RAG systems refuse to answer without reliable evidence?. So the full picture isn't "retrieve less" but "retrieve deliberately": fire on uncertainty, route by question type, let the model ask for what it needs, and decide separately how much to trust what comes back.


Sources 8 notes

When should retrieval happen during model generation?

Active retrieval triggered by low token probability improves both accuracy and efficiency compared to one-shot or continuous retrieval. FLARE demonstrates that models signal genuine knowledge gaps through low confidence, enabling dynamic budget allocation to actual information needs.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Can reasoning systems maintain memory across retrieval cycles?

ComoRAG demonstrates that iterative evidence acquisition with a persistent memory workspace outperforms stateless multi-step retrieval by detecting and resolving contradictions through deeper exploration, achieving up to 11% gains on complex queries.

Can models decide better than retrievers which tools to use?

MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.

Does supervising retrieval steps outperform final answer rewards?

Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher evaluating whether retrieval-augmented generation (RAG) should fire on every query or selectively. The question remains open: what is the optimal retrieval trigger strategy as model and retrieval capabilities evolve?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints to re-test:
• Model confidence/uncertainty is a better retrieval trigger than fixed schedules or single-shot retrieval; FLARE (2023) and Adaptive Retrieval Without Self-Knowledge (2025-01) show token-probability thresholding beats adaptive heuristics at lower compute.
• Selective retrieval beats always-on: DeepRAG (2025-02) reports 22% accuracy gains by routing per-step decisions — retrieve only when parametric knowledge gaps emerge.
• Question type matters: Typed-RAG (2025-03) shows factoid vs. comparison vs. reasoning questions demand different retrieval strategies; one-size-fits-all fails.
• Models that proactively request tools/context outperform passive retrieval: MCP-Zero (2025-06) shows structured model-driven asks beat semantic matching.
• Process-level supervision on intermediate retrieval decisions substantially outperforms outcome-only grading (inferred from path ~2025–2026 direction).
• Edge case: noisy sources may require aggressive retrieval + constrained generation (refuse unless evidence is solid) rather than less retrieval.

Anchor papers (verify; mind their dates):
• arXiv:2305.06983 (2023-05) — Active Retrieval Augmented Generation (FLARE)
• arXiv:2502.01142 (2025-02) — DeepRAG: Thinking to Retrieval Step by Step
• arXiv:2503.15879 (2025-03) — Typed-RAG: Type-aware Multi-Aspect Decomposition
• arXiv:2506.01056 (2025-06) — MCP-Zero: Proactive Toolchain Construction

Your task:
(1) RE-TEST EACH CONSTRAINT. For selective-retrieval-on-uncertainty, probe whether recent models (o1, Claude 3.5+, Llama 3.2+) still exhibit measurable confidence drop at knowledge gaps, or whether scale/training has flattened that signal. For process-level supervision, check whether newer RL/synthetic-data methods have superseded gradient-based retrieval-step training. For question-type routing, verify whether large in-context examples or instruction-tuning have dissolved type-specific retrieval needs.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Look for papers arguing always-on retrieval, dense-retrieval-only, or end-to-end retrieval learning that bypass explicit trigger logic.
(3) Propose 2 research questions that assume the regime has moved: (a) If model confidence is no longer a clean signal (e.g., due to training-induced calibration issues), what alternative self-monitoring signal should trigger retrieval? (b) Can multi-modal or cross-lingual context change the cost–benefit of selective vs. always-on retrieval?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines