How do parallel and sequential retrieval strategies compare in compute efficiency?
This explores whether running retrieval/reasoning steps in parallel (many independent shots, vote) or in sequence (each step builds on the last) is the better use of compute — and the corpus suggests the answer depends entirely on whether the problem's pieces are independent or interlocking.
This reads the question as: when you spend a fixed compute budget, is it better spent fanning out into many parallel attempts or marching through dependent sequential steps? The corpus has a sharp answer, and it isn't 'parallel is cheaper.' On problems whose solution genuinely requires accumulating intermediate results — graph connectivity, multi-step composition — sequential chain-of-thought beats parallel voting by an *exponential* margin, because short parallel chains simply cannot reach a conclusion that depends on earlier sub-results When does sequential reasoning beat parallel voting?. Parallel voting wins when answers are independent and you're averaging out noise; the moment the steps interlock, parallelism wastes compute re-guessing instead of building.
The more interesting twist is that compute efficiency in retrieval is usually decided *before* you pick parallel or sequential — by deciding how often to retrieve at all. One line of work shows that a simple calibrated uncertainty signal (just the model's own token probabilities) beats elaborate multi-call adaptive-retrieval schemes while using a fraction of the LM and retriever calls Can simple uncertainty estimates beat complex adaptive retrieval?. In other words, the cheapest strategy is often the one that knows when *not* to fire a retrieval. DeepRAG reaches the same place from a different angle: by framing each reasoning step as a decision to retrieve-or-rely-on-memory, it cuts noise from unnecessary lookups and gains ~22% accuracy When should language models retrieve external knowledge versus use internal knowledge?.
Sequential strategies do carry a hidden cost the parallel framing hides: they consume context. Long-horizon search agents degrade when a single sequential turn burns the context window that later retrieval rounds need — capping reasoning *per turn*, not just overall, preserves room for the next cycle Does limiting reasoning per turn improve multi-turn search quality?. And the long-context bottleneck itself turns out to be compute, not memory: the expense is consolidating evicted context into usable state, which scales with how many passes you spend on it Is long-context bottleneck really about memory or compute?. So 'sequential' isn't free even when it's correct — it trades parallel breadth for a serial context tax.
A cross-cutting theme: the biggest efficiency wins come from *separating and routing*, not from picking one execution mode. Hierarchical architectures that split query planning from answer synthesis outperform flat ones on multi-hop queries by reducing interference Do hierarchical retrieval architectures outperform flat ones on complex queries?, and StructRAG shows that routing each query to a task-appropriate knowledge structure beats applying uniform retrieval to everything Can routing queries to task-matched structures improve RAG reasoning?. Read together, the corpus reframes your question: the real efficiency lever is matching execution shape to problem shape — parallel for independent noise, sequential for compositional dependency, and a router deciding which is which — rather than crowning one strategy as cheaper across the board.
Sources 7 notes
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.