INQUIRING LINE

Why do question types determine retrieval and decomposition strategy in QA?

This explores why the kind of question being asked — not just its topic — dictates how a QA system should fetch evidence and break the problem apart.


This explores why the kind of question being asked — not just its topic — dictates how a QA system should fetch evidence and break the problem apart. The most direct answer in the collection comes from work showing that non-factoid questions split into roughly five types, and each type wants a different retrieval and aggregation recipe Does question type determine the right retrieval strategy?. An evidence-seeking question is well served by ordinary RAG — find the passage, return it. But a comparison or debate question needs aspect-specific retrieval (you have to gather each side), and an experience or reasoning question needs to be decomposed into sub-questions or filtered before retrieval even makes sense. The question type, in other words, encodes the shape of the answer — and retrieval has to match that shape.

What makes this lateral rather than narrow is that several other notes converge on the same underlying claim from different angles: the right strategy is conditional, and a system that applies one fixed pipeline to every question pays for it. DeepRAG frames each reasoning step as a decision about whether to retrieve at all or lean on the model's internal knowledge, and gets a ~22% accuracy lift precisely by switching strategy per step instead of retrieving uniformly When should language models retrieve external knowledge versus use internal knowledge?. A simpler line of work reaches a parallel conclusion — calibrated uncertainty estimates can decide *when* retrieval is worth the cost, beating heavier adaptive schemes Can simple uncertainty estimates beat complex adaptive retrieval?. Both say the same thing the question-type work says: the trigger for retrieval is a property of the question, not a constant.

Decomposition shows the same conditionality. Multi-hop and complex queries benefit from separating query planning from answer synthesis into distinct stages, which reduces interference and outperforms flat pipelines Do hierarchical retrieval architectures outperform flat ones on complex queries?. And the unit of retrieval itself should bend to the question: how-to and procedural questions are badly served by fixed-size chunks, which sever the step-to-step dependencies, so 'logic units' that preserve prerequisites and links between steps work far better for that question type How do logic units preserve procedural coherence better than chunks?. A factual lookup never needs that machinery; a procedure always does.

There's a quieter insight worth pulling out: format and framing shape strategy more than content does. One study found that the *format* a model was trained on (multiple-choice vs. free-form) shaped its reasoning style — breadth-first vs. depth-first — about 7.5 times more strongly than the subject domain Does training data format shape reasoning strategy more than domain?. That's the same principle running underneath the whole question: the structural type of a problem governs how it should be attacked, and topic is secondary. If you want to go further into why this matters upstream, the work on training models to ask good clarifying questions shows that even *recognizing* what type of question is at hand — its clarity, specificity, what's missing — is itself a skill that has to be learned and decomposed Can models learn to ask genuinely useful clarifying questions?.

The thing you didn't know you wanted to know: 'one good RAG pipeline' is a category error. The field is quietly converging on the idea that QA isn't a single task but a family of tasks wearing the same costume, and the first real move in answering well is classifying which one you're holding.


Sources 7 notes

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.

How do logic units preserve procedural coherence better than chunks?

THREAD replaces chunks with four-part logic units—prerequisite, header, body, linker—enabling dynamic multi-step retrieval for how-to questions. Linkers explicitly navigate between steps and branches, addressing both the semantic-vs-task-relevance gap in embeddings and the sequential dependency loss in chunk-based RAG.

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a QA systems researcher. The question remains open: **Why do question types determine retrieval and decomposition strategy in QA?** Treat the findings below as dated claims (2024–2026) to be re-tested, not current truth.

**What a curated library found — and when (dated claims, not current truth):**
Findings span June 2024 to March 2026. The library converges on conditionality:
- Non-factoid questions split into ~5 types; each type requires a different retrieval and aggregation recipe—comparison/debate questions need aspect-specific retrieval, reasoning questions need decomposition before retrieval (Typed-RAG, ~2025).
- DeepRAG frames each reasoning step as a decision whether to retrieve or rely on internal knowledge, achieving ~22% accuracy lift by switching strategy per step rather than retrieving uniformly (~2025).
- Uncertainty estimates can decide *when* retrieval is worth the cost, outperforming heavier adaptive schemes (~2025).
- Procedural and how-to questions are poorly served by fixed-size chunks; 'logic units' preserving step-to-step dependencies work far better (~2024).
- Training data format (multiple-choice vs. free-form) shapes reasoning breadth-first vs. depth-first ~7.5× more strongly than domain content (~2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2503.15879 (Typed-RAG, March 2025)
- arXiv:2502.01142 (DeepRAG, February 2025)
- arXiv:2501.12835 (Adaptive Retrieval Without Self-Knowledge, January 2025)
- arXiv:2406.13372 (Logic-Based Data Organization, June 2024)

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1, o3, newer Claude), scaling, in-context learning, or multi-agent orchestration have since **relaxed or overturned** the per-step conditionality claim. Has frontier-scale reasoning reduced the need for question-type-aware routing, or deepened it? Separate the durable insight (question type as a structural signal) from the perishable limitation (specific accuracy gaps or retrieval costs).
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for work claiming unified pipelines, end-to-end learning, or routing-free RAG that bypass explicit type classification.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "Do frontier LLMs learn to internally classify question type without supervision, making explicit routing obsolete?" or "Does continual latent reasoning (CLaRa, 2026) eliminate the need to decompose before retrieval?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines