What makes query complexity a better routing signal than response quality?
This explores why systems that route a query *before* generating an answer—by predicting how hard the query is—often beat systems that judge the answer *after* it's produced.
This explores why routing on query complexity (deciding which model or structure to use before any text is generated) tends to win over routing on response quality (generating something, then evaluating it). The corpus frames this less as 'complexity is smarter' and more as a matter of *when the decision happens* and *what it costs*. Routing is fundamentally a pre-generation move: RouteLLM and Hybrid-LLM cut cost 40–50% by predicting query difficulty up front and sending each query to a single model, never producing a response just to score it Can routers select the right model before generation happens?. Response-quality signals, by contrast, are reward-model territory—you can only evaluate an answer that already exists, which means you've already paid for the expensive generation you were trying to avoid. Complexity routing is cheap precisely because it skips that step.
The deeper reason complexity works as a signal is that it's *predictable from the query alone*, and the corpus shows it generalizes beyond cost-cutting. Avengers-Pro routes by semantic cluster and beats GPT-5-medium by 7%—or matches it at 27% lower cost—suggesting that picking the right specialist per query is a stronger lever than scaling one model Can routing beat building one better model?. StructRAG pushes the same idea inside retrieval: a query's demands determine whether you should reach for a table, a graph, an algorithm, or plain chunks, and a router trained to read those demands outperforms uniform retrieval Can routing queries to task-matched structures improve RAG reasoning?. In both cases the routing signal is a property of the *question*, available before any answer—not a verdict on an answer after the fact.
Where response quality does carry real signal, the corpus suggests it's most useful *during* generation rather than after a complete answer. Step-level confidence filtering catches reasoning breakdowns mid-trace that whole-answer averaging masks, and lets you stop early—reaching majority-vote accuracy with far fewer traces Does step-level confidence outperform global averaging for trace filtering?. That's the interesting twist: 'response quality' isn't worthless, but global, after-the-fact quality scores hide exactly the breakdowns you'd want to route around, while a complexity signal read up front sidesteps the problem entirely.
There's a structural argument underneath all of this. Several notes converge on the idea that separating the *decision* from the *execution* is what pays off—hierarchical research architectures that split query planning from answer synthesis reduce interference on multi-hop queries Do hierarchical retrieval architectures outperform flat ones on complex queries?, and capability-driven coordination treats matching a query to the right agent as a first-class operation rather than something inferred from outputs Can semantic capability vectors replace manual agent routing?. Complexity routing is one instance of that principle: it makes the 'which path' decision a clean, upfront classification problem instead of an expensive, after-the-fact evaluation. The thing you didn't know you wanted to know: the win isn't that complexity is a richer signal than quality—it's that quality signals arrive too late and cost too much to be good routers, while complexity is legible from the query before you've spent anything.
Sources 6 notes
RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.