Why does long-form generation need different retrieval than factoid questions?
This explores why generating a long, multi-part answer demands a fundamentally different retrieval pattern than answering a single fact lookup — and the corpus suggests the difference is structural, not just a matter of fetching more.
This explores why long-form generation needs different retrieval than factoid questions. The short version from the corpus: a factoid has one information need that the question itself fully expresses, so a single up-front fetch works. Long-form generation has *many* information needs, and most of them aren't visible in the original question — they only surface as the text unfolds. The retrieval problem shifts from "find the answer" to "keep discovering what you need next."
The clearest articulation of this is the finding that question type determines retrieval strategy at all Does question type determine the right retrieval strategy?. Evidence-based factoid questions are well-served by standard one-shot RAG, but comparison, debate, and experience/reason questions require aspect-specific retrieval and decomposition — you have to break the request into sub-needs and retrieve against each, then aggregate. A factoid asks for a node; a long-form answer asks for a structure.
The deeper reason a single fetch fails is that the original query can't express needs it doesn't yet know it has. Two notes converge here from different angles. ITER-RETGEN shows that a model's partial response itself becomes the best retrieval query — generation surfaces implicit gaps the original question missed, so feeding the half-written answer back into retrieval substantially helps multi-hop reasoning Can a model's partial response guide what to retrieve next?. And FLARE makes the timing concrete: retrieval should trigger when the model's own token confidence drops, not at fixed intervals — the model signals genuine knowledge gaps mid-generation, and a long answer hits many such gaps a factoid never reaches When should retrieval happen during model generation?. Both reframe retrieval as something interleaved with generation rather than prior to it. That's the whole shift: factoid retrieval is a step before writing; long-form retrieval is a loop inside writing How should systems retrieve and reason with external knowledge?.
There's also a failure mode specific to length that makes naive "just retrieve more and stuff it in the context" the wrong fix. Reasoning accuracy degrades sharply with input length even far below the context-window limit — dropping from 92% to 68% with only a few thousand tokens of padding Does reasoning ability actually degrade with longer inputs?. So you can't solve a long-form answer by dumping every plausibly-relevant document up front; the volume itself corrodes the reasoning. This is why long-context models can quietly substitute for RAG on simple semantic lookups but stumble when the task needs structured, multi-part assembly Can long-context LLMs replace retrieval-augmented generation systems? — and why retrieval has to be targeted and timed, drawing in only what the current sub-need requires.
The quietly interesting payoff: generation and retrieval stop being separate stages. In long-form work the model's own output is the most accurate description of what to fetch next — the answer-in-progress is a better query than the question ever was. A factoid never gives you that loop because it's done before the second sentence.
Sources 6 notes
Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.
ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.
Active retrieval triggered by low token probability improves both accuracy and efficiency compared to one-shot or continuous retrieval. FLARE demonstrates that models signal genuine knowledge gaps through low confidence, enabling dynamic budget allocation to actual information needs.
Research shows retrieval should adapt dynamically rather than follow fixed patterns, reasoning and retrieval must integrate closely, and embedding-based retrieval has fundamental limits requiring architectural alternatives.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.