INQUIRING LINE

How does query decomposition reduce retrieval costs at inference?

This explores how breaking a complex question into smaller sub-queries can cut the compute spent on retrieval at inference time — and what the corpus says about when that pays off.


This explores how breaking a complex question into smaller sub-queries can cut the compute spent on retrieval — and the corpus reframes the issue in a useful way: decomposition saves cost less by retrieving less, and more by retrieving *smarter*, so you don't burn calls and context on the wrong thing. The clearest cost lever is separation of labor. When you split query planning from answer synthesis into distinct stages, the components stop interfering with each other and multi-hop accuracy improves Do hierarchical retrieval architectures outperform flat ones on complex queries?. Decomposition is what makes that separation concrete — and notably, the corpus shows the right decomposition is question-dependent: non-factoid questions fall into types where some suit plain RAG and others (comparison, debate, reasoning) only resolve through aspect-specific splitting or filtering Does question type determine the right retrieval strategy?.

The second cost lever is doing the decomposition at query time instead of paying for it up front. LogicRAG builds a small directed graph of sub-questions from the query itself at inference, which avoids the overhead and staleness of pre-building a corpus-wide knowledge graph while keeping multi-hop reasoning intact Can query-time graph construction replace pre-built knowledge graphs?. The cost you save there is amortized construction you never use; the cost you spend is a few cheap planning tokens per query. Routing is a related move — instead of running one uniform retrieval over everything, a trained router sends each sub-demand to the structure that fits it (table, graph, chunk), which is cheaper per unit of reasoning gained Can routing queries to task-matched structures improve RAG reasoning?.

But the sharpest lesson is that more decomposition is not free, and sometimes the cheapest sub-query is no sub-query. Search behaves like a test-time scaling axis: answer quality rises with search budget along a monotonic-but-diminishing curve, exactly like reasoning tokens, so every extra retrieval round trades compute for shrinking returns Does search budget scale like reasoning tokens for answer quality?. Decomposition reduces cost only when it lets you stop earlier on the flat part of that curve. Two notes make the trade-off concrete: calibrated uncertainty estimates beat elaborate multi-call adaptive retrieval at a fraction of the LM and retriever calls — the model's own self-knowledge is a cheaper trigger than external heuristics Can simple uncertainty estimates beat complex adaptive retrieval? — and on long-horizon tasks, capping reasoning *per turn* (not just overall) preserves the context budget that iterative sub-queries need, so you don't pay to re-establish ground each round Does limiting reasoning per turn improve multi-turn search quality?.

There's also a quieter cost decomposition can erase entirely: query augmentation. Fine-tuning the retrieval model on implicit queries matches augmented retrievers without expanding input length at all — the model learns to resolve ambiguity in its weights rather than at inference Can fine-tuning replace query augmentation for retrieval?. That's the opposite end of the same dial: you can pay the cost once in training, or repeatedly at inference through decomposition and augmentation. The thing you didn't know you wanted to know is that 'reducing retrieval cost' isn't one move — it's a choice across at least four dials (split vs. flat, query-time vs. pre-built, route vs. uniform, train-once vs. decompose-each-time), and the corpus suggests the biggest wins come from matching the dial to the *shape of the question*, not from decomposing harder.


Sources 8 notes

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Can query-time graph construction replace pre-built knowledge graphs?

LogicRAG constructs directed acyclic graphs from queries at inference time rather than pre-building corpus-wide graphs, eliminating construction overhead, avoiding staleness, and enabling query-specific retrieval logic without sacrificing multi-hop reasoning capability.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can fine-tuning replace query augmentation for retrieval?

Fine-tuned semantic search models trained on implicit queries match the performance of augmented pretrained retrievers without expanding input length. The model learns to resolve ambiguity through training rather than requiring explicit augmentation.

Next inquiring lines