How do taxonomy-based retrieval scaffolds improve model performance at inference time?
This explores whether organizing retrieved knowledge into the right structure — tables, graphs, hierarchies, catalogues — and routing each query to the structure that fits it actually helps a model reason better at inference time, rather than just dumping flat text into context.
This explores whether giving a model a *structured* scaffold for what it retrieves — and matching the structure to the kind of question being asked — beats the standard approach of pulling flat passages and stuffing them into the prompt. The corpus points to a clear answer: structure matters, and the gains come not from retrieving *more* but from retrieving in a *shape* the question can use.
The sharpest evidence is StructRAG Can routing queries to task-matched structures improve RAG reasoning?, which trains a router to pick among tables, graphs, algorithms, catalogues, and plain chunks depending on what the query demands — and grounds the idea in 'cognitive fit' theory: a task is easier when the information's form matches the reasoning the task requires. A comparison question wants a table; a multi-step dependency wants a graph. This is why flat retrieval underperforms — it forces every question through one representation regardless of fit. The same instinct shows up in long-context experiments Can long-context LLMs replace retrieval-augmented generation systems?: simply giving a model a huge window matches RAG on semantic lookup but *fails* on relational queries needing joins across structured tables. Context length can't substitute for structure.
A second thread is *where* the scaffold gets built. Hierarchical architectures Do hierarchical retrieval architectures outperform flat ones on complex queries? separate query planning from answer synthesis into distinct stages, which cuts interference and lifts multi-hop performance — the scaffold is the separation of concerns itself. LogicRAG Can query-time graph construction replace pre-built knowledge graphs? pushes this further: instead of pre-building a giant corpus-wide knowledge graph that goes stale, it constructs a small directed graph *from the query at inference time*, giving you query-specific reasoning logic without the construction cost. So the scaffold doesn't have to be a static taxonomy baked in advance — it can be assembled on the fly, per question.
The corpus also names why you'd bother, by cataloguing how unstructured retrieval breaks. Retrieval failures Where do retrieval systems fail and why? are described as *architectural* — embeddings measure topical association, not task relevance, and there's a mathematical ceiling on how many documents a fixed embedding dimension can even distinguish. These aren't fixed by tuning; they're fixed by changing the retrieval shape. And verification-style scaffolds Can verification separate structural near-misses from topical matches? make a related point: a second stage operating on full token-interaction patterns catches structural near-misses that compressed-vector similarity waves through.
The quiet caveat worth knowing: scaffolds reorganize knowledge the model already has — they don't install new knowledge. Prompt-side work Can prompt optimization teach models knowledge they lack? shows optimization can only activate what's in the training distribution, never supply what's missing. Taxonomy scaffolds are best read the same way — they're a better *access path* to latent capability, not a source of new facts. That reframes the whole question: a retrieval scaffold improves performance by lowering the reasoning load of *using* retrieved information, not by adding to it.
Sources 7 notes
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
LogicRAG constructs directed acyclic graphs from queries at inference time rather than pre-building corpus-wide graphs, eliminating construction overhead, avoiding staleness, and enabling query-specific retrieval logic without sacrificing multi-hop reasoning capability.
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.