When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
The default RAG paradigm retrieves once before generation and never again. This works for short-form factoid questions where the information need is fully expressed in the query. It fails for long-form generation where information needs emerge as the text develops — you cannot know in advance what you will need to support page three of a document.
FLARE (Forward-Looking Active Retrieval) introduces a principled trigger: retrieve only when the model generates low-probability tokens. The assumption is that large language models are reasonably well-calibrated — low confidence signals genuine knowledge gaps rather than stylistic uncertainty. When the model starts guessing, it should look something up. When it is confident, retrieval would add noise.
The mechanism: generate a tentative next sentence, check token probabilities, retrieve if confidence falls below threshold, regenerate with retrieved context. The retrieval query is the tentative sentence itself — forward-looking rather than backward-looking. This "what I am about to say" framing captures future information needs better than "what I was asked."
The distinction between short-form and long-form generation matters architecturally. Short-form (factoid QA) has clear information needs explicit in the query — single retrieval is appropriate. Long-form (summaries, essays, reports) has evolving information needs that only become clear during generation — iterative retrieval is necessary. Treating both the same way is the failure mode of standard RAG.
The practical consequence: retrieval becomes a dynamic resource, not a fixed setup cost. Active retrieval systems naturally allocate more retrieval budget to uncertain passages and none to passages the model handles confidently. This aligns retrieval investment with actual knowledge gaps.
Step-level retrieval for reasoning chains (Search-o1): The active retrieval principle extends from long-form generation to step-wise reasoning. Search-o1 integrates an agentic search workflow into o1-like reasoning chains: when the model encounters knowledge uncertainty at any reasoning step, it generates a search query to retrieve external knowledge. Standard problem-level RAG does NOT address this — it retrieves once at the start, while knowledge needs vary step by step in complex reasoning. The frequency of uncertainty markers (e.g., "perhaps" averaging 30+ occurrences per reasoning chain) signals that knowledge gaps are pervasive in extended reasoning, not isolated. A separate Reason-in-Documents module filters retrieved content before injection, addressing the noise problem: raw retrieved documents are verbose and can disrupt reasoning coherence.
Inquiring lines that use this note as a source 16
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do one-shot transparency studies miss the temporal reversal entirely?
- Why does long-form generation need different retrieval than factoid questions?
- Why does model confidence correlate with robustness to prompt variations?
- Should retrieval be triggered always or only for difficult questions?
- When should a system decide to retrieve versus reason alone?
- How do RAG and prompting techniques differ in supporting each granularity level?
- How does routing decide between models before generation happens?
- Should retrieval be triggered by model uncertainty or fixed intervals?
- Does uncertainty trigger retrieval better than fixed-interval tool calls?
- How does response content compare to model confidence as a retrieval trigger?
- How should retrieval systems decide when to fetch new information?
- What threshold combinations for uncertainty and rarity signals maximize RAG performance?
- How much does retrieval budget improve when triggered by dual signals instead of fixed intervals?
- Can adaptive retrieval triggered by model uncertainty improve RAG reliability?
- How should retrieval triggers use model uncertainty instead of fixed intervals?
- What concrete failures happen when RAG ignores temporal relevance?
Related concepts in this collection 10
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
same adaptive allocation principle; here applied to retrieval rather than reasoning tokens
-
Does search budget scale like reasoning tokens for answer quality?
Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
search budget as an inference-compute axis; active retrieval is the trigger mechanism that determines where that budget goes
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
related failure: models that cannot say "I don't know" also cannot identify when to retrieve
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
active retrieval is the constructive response to detected knowledge gaps; overthinking is the pathological alternative when the model lacks a retrieval escape and spirals with its own reasoning instead
-
When should an agent actually stop and deliberate?
How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
same uncertainty-triggered adaptive compute principle at different granularity: FLARE triggers retrieval on low-confidence tokens, SAND triggers deliberation on inconsistent action samples
-
Do iterative refinement methods suffer from overthinking?
Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
iterative refinement fails partly because it re-reasons on the same information; uncertainty-triggered retrieval provides the missing ingredient by injecting new evidence when revision stalls
-
Can interleaving reasoning with real-world feedback prevent hallucination?
Does grounding language model reasoning in external world observations rather than internal associations help prevent error propagation and false outputs? This explores whether breaking the static chain-of-thought pattern can catch and correct mistakes in real time.
FLARE refines ReAct's foundational interleaving principle: ReAct retrieves at every reasoning step unconditionally, while uncertainty-gated retrieval makes the trigger conditional on genuine knowledge gaps rather than mandatory at each step
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
both identify the cost of continuing computation past its useful threshold: FLARE gates retrieval on detected knowledge gaps rather than at fixed intervals; the overthinking note shows thinking tokens beyond the sweet spot harm accuracy; uncertainty-gating is the retrieval-level analog of the optimal thinking-token limit
-
Can models learn when to think versus respond quickly?
Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
Thinkless and FLARE solve the same when-to-invest-compute problem at different levels: Thinkless decides response-level (think vs. short), FLARE decides retrieval-level (retrieve vs. not); both use model uncertainty as the trigger signal
-
Can simple uncertainty estimates beat complex adaptive retrieval?
Does measuring a language model's own confidence on token probabilities outperform expensive multi-call adaptive retrieval pipelines? This matters because it could simplify RAG systems while reducing computational overhead.
validates and extends the FLARE principle: calibrated token-probability uncertainty estimation is sufficient for retrieval trigger decisions and outperforms more complex adaptive pipelines
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Active Retrieval Augmented Generation
- Deep Research: A Systematic Survey
- Chain-of-Retrieval Augmented Generation
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home
- UR2: Unify RAG and Reasoning through Reinforcement Learning
- LLM-Independent Adaptive RAG: Let the Question Speak for Itself
- Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
Original note title
active retrieval should trigger on model uncertainty not at fixed intervals