Can retrieval be extended into multi-step chains like reasoning?
Standard RAG retrieves once, but multi-hop tasks need intermediate steps. Can we train models to plan retrieval sequences the way chain-of-thought trains reasoning, and scale retrieval at test time?
Standard RAG retrieves once and generates from what was found. Multi-hop reasoning tasks require information that can only be identified after retrieving and processing earlier information. The retriever, constrained to a single shot, cannot know what intermediate steps will reveal are needed.
CoRAG (Chain-of-Retrieval Augmented Generation) extends the chain-of-thought training paradigm to retrieval. Training: use rejection sampling to automatically generate intermediate retrieval chains — sequences of queries, retrieved documents, and intermediate answers — augmenting existing RAG datasets that only provide final answers. The model learns to plan retrieval steps, not just generate from retrieved context.
Test time: the retrieval chain length and count become dials. Greedy decoding (single chain) is fast. Best-of-N sampling (multiple chains) improves accuracy. Tree search (branching at each retrieval decision) maximizes accuracy at higher cost. The same token budget can be spent as retrieval steps, choosing depth vs. breadth at test time.
The scaling relationship is the same as in reasoning: more retrieval budget yields better answers, up to a point. This extends Does search budget scale like reasoning tokens for answer quality? from agentic search behavior to explicitly trained retrieval models. The TTS framework is not about reasoning tokens specifically — it is about compute allocation in any iterative process.
The practical implication: RAG systems can now have a compute dial. Low-latency, low-cost serving uses greedy decoding. High-stakes queries use tree search. The dial was not available in single-shot RAG.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can retrieval improve multi-step reasoning by triggering at each uncertainty?
- Does parallel retrieval outperform sequential search chains at test time?
- How does query planning as a separate step improve multi-hop retrieval coherence?
- How do hierarchical query planning architectures improve multi-hop retrieval?
- Why does single-round retrieval fail on multi-step tasks across different domains?
- How do retrieval heads enable chain-of-thought reasoning to reference earlier context?
- Can knowledge graph structure be exploited for efficient multi-hop retrieval?
- Do expansion-reflection loops and chain-of-retrieval approaches solve the same problem?
- Can stateless multi-step retrieval capture evidence integration as well as dynamic memory?
- How much does retrieval budget improve when triggered by dual signals instead of fixed intervals?
- How should retrieval systems handle multi-hop reasoning and iterative information needs?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does search budget scale like reasoning tokens for answer quality?
Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
same scaling law; CoRAG extends this to explicitly trained multi-step retrieval rather than agentic search behavior
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
tree search (branching) vs greedy decoding (sequential) is another instance of the parallel/sequential trade-off
-
When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
CoRAG trains the model to decide when to retrieve; active retrieval uses confidence as a trigger; both address the when-to-retrieve problem from different angles
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
CoRAG adds retrieval chain length/count as a new compute-allocation dimension alongside prompt-level budget, token-level depth, sub-token granularity, and model selection
-
Does limiting reasoning per turn improve multi-turn search quality?
When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?
CoRAG's parallel branching (best-of-N, tree search) offers a structural alternative to ASearcher's per-turn capping; parallel retrieval chains avoid the sequential context-pressure problem
-
How do internal and external test-time scaling compare?
Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
CoRAG is a third-category hybrid: training (internal) teaches retrieval chain generation, but test-time scaling is applied externally by varying chain length/count; the retrieval chain is trained internally but the compute dial is external
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
CoRAG instantiates the same parallel/sequential trade-off at the retrieval level: best-of-N (parallel chains) vs. greedy decoding (sequential chain) vs. tree search (branching depth-first); the structural choice is identical to the reasoning token allocation trade-off
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Chain-of-Retrieval Augmented Generation
- UR2: Unify RAG and Reasoning through Reinforcement Learning
- Context Tuning for Retrieval Augmented Generation
- Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
- RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
- Rethinking with Retrieval: Faithful Large Language Model Inference
- Retrieval-augmented reasoning with lean language models
- Instruction Induction: From Few Examples to Natural Language Task Descriptions
Original note title
chain-of-retrieval augmented generation enables test-time scaling for retrieval-intensive tasks