How do hierarchical query planning architectures improve multi-hop retrieval?
This explores why splitting retrieval into a planning layer and an execution layer — rather than retrieving everything in one flat pass — helps with questions that require chaining several facts together (multi-hop).
This explores why splitting retrieval into a planning layer and an execution layer helps with questions that require chaining several facts together. The corpus has a clear throughline here: multi-hop failure is usually architectural, not a tuning problem. Flat retrieval grabs a pile of chunks ranked by surface similarity, but compositional questions ('which director made the film that won the award X judged?') need the system to figure out *what to look for next* based on *what it just found* — and a single embedding pass can't do that. The cleanest statement of the hierarchical principle is that separating query planning from answer synthesis into distinct components reduces interference between the two jobs and measurably improves multi-hop performance Do hierarchical retrieval architectures outperform flat ones on complex queries?. The same logic shows up in why RAG breaks at all: embeddings measure association, not task-relevance, and there's even a hard mathematical ceiling on how many documents a fixed embedding dimension can represent Where do retrieval systems fail and why?. Planning sits above that ceiling rather than fighting it.
What's interesting is that 'hierarchy' shows up in two different places, and they're worth distinguishing. One is hierarchy in the *control flow* — a planner that decides the sequence of sub-queries — and the corpus frames the strongest version of this as tightly coupling retrieval and reasoning through a Markov Decision Process with step-level feedback, so each retrieval is a decision conditioned on the reasoning state so far How should retrieval and reasoning integrate in RAG systems?. The other is hierarchy in the *knowledge structure itself*: instead of a flat chunk list, build a layered knowledge graph that runs from high-level summaries down to page-level detail, which lets the system answer cross-chapter, global questions flat retrieval simply can't reach Can multimodal knowledge graphs answer questions that flat retrieval cannot?.
The surprising twist is that good structure can collapse multiple hops back into a single step. HippoRAG converts the corpus into a knowledge graph and runs Personalized PageRank seeded from the query's concepts, traversing multi-hop paths in one retrieval pass — matching iterative methods while running 10–20x cheaper Can knowledge graphs enable multi-hop reasoning in one retrieval step?. Hypergraph memory pushes this further by binding three or more entities into a single relation, preserving the joint constraints a question needs instead of fragmenting them across separate retrieved facts Can hypergraphs capture multi-hop reasoning better than graphs?. So 'hierarchical planning' and 'better structure' are two routes to the same goal: one plans across many steps, the other front-loads the structure so fewer steps are needed.
There's also a routing dimension that's really a planning decision in disguise. StructRAG trains a router to pick the *type* of knowledge structure — table, graph, algorithm, catalogue, or plain chunks — based on what the query demands, grounding the choice in cognitive-fit theory from cognitive science Can routing queries to task-matched structures improve RAG reasoning?. That's the planning layer deciding not just what to retrieve but *how to represent* it before reasoning begins, and routing-as-a-lever shows up elsewhere as a stronger move than scaling a single model Can routing beat building one better model?.
The thing you might not have expected to want to know: hierarchy isn't always the answer. CoRAG treats retrieval like chain-of-thought, generating intermediate retrieval chains and giving you a compute dial — short greedy chains for speed, tree search for hard questions Can retrieval be extended into multi-step chains like reasoning?. But the corpus also pushes back: a calibrated uncertainty estimate from the model's own token probabilities can beat elaborate multi-call adaptive retrieval on single-hop tasks and *match* it on multi-hop, at a fraction of the cost Can simple uncertainty estimates beat complex adaptive retrieval?. The lesson across these notes is that the planning layer's real value is knowing when a question actually needs multiple hops — and sometimes the cheapest planner is the model asking itself whether it already knows enough to stop.
Sources 10 notes
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.
Research shows that tight coupling between retrieval and reasoning—via Markov Decision Processes and step-level feedback—substantially improves accuracy and efficiency. Graph-based retrieval and metacognitive monitoring address limitations of vector embeddings and prevent retrieval failures on compositional tasks.
MegaRAG builds hierarchical multimodal knowledge graphs from text and visuals to answer cross-chapter, global questions that flat chunk retrieval cannot reach. The hierarchy supports abstraction levels from high-level summaries to page-specific details while treating images as first-class graph nodes.
HippoRAG converts corpus into a knowledge graph, then uses Personalized PageRank seeded from query concepts to traverse multi-hop paths in one step. It matches iterative retrieval while being 10-20x cheaper and 6-13x faster, with 20% better accuracy on multi-hop QA.
HGMem organizes retrieved evidence as hyperedges rather than flat lists or binary graphs, allowing three or more entities to bind into single relations without decomposition. This structure accumulates coherent knowledge across retrieval steps, trading representational complexity for constraint expressiveness.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.
CoRAG extends chain-of-thought training to retrieval by using rejection sampling to generate intermediate retrieval chains. Test-time compute can scale through chain length and count, creating a compute dial—greedy decoding for speed or tree search for accuracy—just like reasoning-token scaling.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.