When should retrieval happen during model generation?

Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.

Synthesis note · 2026-02-22 · sourced from RAG

The default RAG paradigm retrieves once before generation and never again. This works for short-form factoid questions where the information need is fully expressed in the query. It fails for long-form generation where information needs emerge as the text develops — you cannot know in advance what you will need to support page three of a document.

FLARE (Forward-Looking Active Retrieval) introduces a principled trigger: retrieve only when the model generates low-probability tokens. The assumption is that large language models are reasonably well-calibrated — low confidence signals genuine knowledge gaps rather than stylistic uncertainty. When the model starts guessing, it should look something up. When it is confident, retrieval would add noise.

The mechanism: generate a tentative next sentence, check token probabilities, retrieve if confidence falls below threshold, regenerate with retrieved context. The retrieval query is the tentative sentence itself — forward-looking rather than backward-looking. This "what I am about to say" framing captures future information needs better than "what I was asked."

The distinction between short-form and long-form generation matters architecturally. Short-form (factoid QA) has clear information needs explicit in the query — single retrieval is appropriate. Long-form (summaries, essays, reports) has evolving information needs that only become clear during generation — iterative retrieval is necessary. Treating both the same way is the failure mode of standard RAG.

The practical consequence: retrieval becomes a dynamic resource, not a fixed setup cost. Active retrieval systems naturally allocate more retrieval budget to uncertain passages and none to passages the model handles confidently. This aligns retrieval investment with actual knowledge gaps.

Step-level retrieval for reasoning chains (Search-o1): The active retrieval principle extends from long-form generation to step-wise reasoning. Search-o1 integrates an agentic search workflow into o1-like reasoning chains: when the model encounters knowledge uncertainty at any reasoning step, it generates a search query to retrieve external knowledge. Standard problem-level RAG does NOT address this — it retrieves once at the start, while knowledge needs vary step by step in complex reasoning. The frequency of uncertainty markers (e.g., "perhaps" averaging 30+ occurrences per reasoning chain) signals that knowledge gaps are pervasive in extended reasoning, not isolated. A separate Reason-in-Documents module filters retrieved content before injection, addressing the noise problem: raw retrieved documents are verbose and can disrupt reasoning coherence.

Inquiring lines that read this note 16

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do we evaluate AI systems when user perception misleads actual performance?

Why do one-shot transparency studies miss the temporal reversal entirely?

How should retrieval systems optimize for multi-step reasoning during inference?

Why does long-form generation need different retrieval than factoid questions?

Can model confidence signals reliably improve reasoning quality and calibration?

Why does model confidence correlate with robustness to prompt variations?

When should retrieval-augmented systems decide to fetch new information?

Why do reasoning models fail at systematic problem-solving and search?

When should a system decide to retrieve versus reason alone?

How do prompt structure and constraints affect model instruction reliability?

How do RAG and prompting techniques differ in supporting each granularity level?

Can model routing outperform monolithic scaling as an efficiency strategy?

How does routing decide between models before generation happens?

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

24 direct connections · 188 in 2-hop network ·medium cluster Open in graph ↗

When should retrieval happen during model genera… Can we allocate inference compute based on prompt … Does search budget scale like reasoning tokens for… Does reasoning fine-tuning make models worse at de… Why do reasoning models overthink ill-posed questi… When should an agent actually stop and deliberate? Do iterative refinement methods suffer from overth… Can interleaving reasoning with real-world feedbac… Does more thinking time always improve reasoning a…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
same adaptive allocation principle; here applied to retrieval rather than reasoning tokens
Does search budget scale like reasoning tokens for answer quality? Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
search budget as an inference-compute axis; active retrieval is the trigger mechanism that determines where that budget goes
Does reasoning fine-tuning make models worse at declining to answer? When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
related failure: models that cannot say "I don't know" also cannot identify when to retrieve
Why do reasoning models overthink ill-posed questions? Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
active retrieval is the constructive response to detected knowledge gaps; overthinking is the pathological alternative when the model lacks a retrieval escape and spirals with its own reasoning instead
When should an agent actually stop and deliberate? How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
same uncertainty-triggered adaptive compute principle at different granularity: FLARE triggers retrieval on low-confidence tokens, SAND triggers deliberation on inconsistent action samples
Do iterative refinement methods suffer from overthinking? Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
iterative refinement fails partly because it re-reasons on the same information; uncertainty-triggered retrieval provides the missing ingredient by injecting new evidence when revision stalls
Can interleaving reasoning with real-world feedback prevent hallucination? Does grounding language model reasoning in external world observations rather than internal associations help prevent error propagation and false outputs? This explores whether breaking the static chain-of-thought pattern can catch and correct mistakes in real time.
FLARE refines ReAct's foundational interleaving principle: ReAct retrieves at every reasoning step unconditionally, while uncertainty-gated retrieval makes the trigger conditional on genuine knowledge gaps rather than mandatory at each step
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
both identify the cost of continuing computation past its useful threshold: FLARE gates retrieval on detected knowledge gaps rather than at fixed intervals; the overthinking note shows thinking tokens beyond the sweet spot harm accuracy; uncertainty-gating is the retrieval-level analog of the optimal thinking-token limit
Can models learn when to think versus respond quickly? Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
Thinkless and FLARE solve the same when-to-invest-compute problem at different levels: Thinkless decides response-level (think vs. short), FLARE decides retrieval-level (retrieve vs. not); both use model uncertainty as the trigger signal
Can simple uncertainty estimates beat complex adaptive retrieval? Does measuring a language model's own confidence on token probabilities outperform expensive multi-call adaptive retrieval pipelines? This matters because it could simplify RAG systems while reducing computational overhead.
validates and extends the FLARE principle: calibrated token-probability uncertainty estimation is sufficient for retrieval trigger decisions and outperforms more complex adaptive pipelines

When should retrieval happen during model generation?

Inquiring lines that read this note 16

Related concepts in this collection 10

Related papers in this collection 8

Search by related questions 4