SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation

When should retrieval happen during model generation?

Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.

Synthesis note · 2026-02-22 · sourced from RAG
RAG How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The default RAG paradigm retrieves once before generation and never again. This works for short-form factoid questions where the information need is fully expressed in the query. It fails for long-form generation where information needs emerge as the text develops — you cannot know in advance what you will need to support page three of a document.

FLARE (Forward-Looking Active Retrieval) introduces a principled trigger: retrieve only when the model generates low-probability tokens. The assumption is that large language models are reasonably well-calibrated — low confidence signals genuine knowledge gaps rather than stylistic uncertainty. When the model starts guessing, it should look something up. When it is confident, retrieval would add noise.

The mechanism: generate a tentative next sentence, check token probabilities, retrieve if confidence falls below threshold, regenerate with retrieved context. The retrieval query is the tentative sentence itself — forward-looking rather than backward-looking. This "what I am about to say" framing captures future information needs better than "what I was asked."

The distinction between short-form and long-form generation matters architecturally. Short-form (factoid QA) has clear information needs explicit in the query — single retrieval is appropriate. Long-form (summaries, essays, reports) has evolving information needs that only become clear during generation — iterative retrieval is necessary. Treating both the same way is the failure mode of standard RAG.

The practical consequence: retrieval becomes a dynamic resource, not a fixed setup cost. Active retrieval systems naturally allocate more retrieval budget to uncertain passages and none to passages the model handles confidently. This aligns retrieval investment with actual knowledge gaps.

Step-level retrieval for reasoning chains (Search-o1): The active retrieval principle extends from long-form generation to step-wise reasoning. Search-o1 integrates an agentic search workflow into o1-like reasoning chains: when the model encounters knowledge uncertainty at any reasoning step, it generates a search query to retrieve external knowledge. Standard problem-level RAG does NOT address this — it retrieves once at the start, while knowledge needs vary step by step in complex reasoning. The frequency of uncertainty markers (e.g., "perhaps" averaging 30+ occurrences per reasoning chain) signals that knowledge gaps are pervasive in extended reasoning, not isolated. A separate Reason-in-Documents module filters retrieved content before injection, addressing the noise problem: raw retrieved documents are verbose and can disrupt reasoning coherence.

Inquiring lines that use this note as a source 16

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
24 direct connections · 183 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

active retrieval should trigger on model uncertainty not at fixed intervals