SYNTHESIS NOTE

Can routing beat building one better model?

Does directing queries to specialized models via semantic clustering outperform investing in a single frontier model? This challenges whether model improvement or model selection drives performance gains.

Synthesis note · 2026-02-23 · sourced from Routers

Avengers-Pro demonstrates that routing queries to different models based on semantic clustering can exceed the performance of any individual model in the pool — including frontier models. The mechanism: embed incoming queries, cluster by semantic similarity, evaluate per-cluster model performance-efficiency scores, and route each query to the highest-scoring model for its cluster.

Three results establish the claim:

Performance: +7% average accuracy over GPT-5-medium (the strongest individual model in the pool) across 6 benchmarks
Efficiency at parity: matches GPT-5-medium accuracy at 27% lower cost
Efficiency at near-parity: reaches ~90% of GPT-5-medium performance at 63% lower cost

The earlier Avengers work made an even more striking claim: ten models of ~7B parameters each, with routing, surpassed GPT-4.1 and 4.5 across 15 datasets. This suggests the performance gain from optimal model selection can be comparable to the gap between model generations.

The architecture is lightweight: three operations at inference time (embedding, nearest-cluster lookup, score aggregation). The heavy work — fitting the clustering model and estimating per-cluster performance statistics — happens offline on a calibration set (70% for fitting, 30% for evaluation). This makes the approach deployable as a thin routing layer atop any model API ecosystem.

Since Can we allocate inference compute based on prompt difficulty?, Avengers-Pro adds a complementary optimization axis. Compute-optimal scaling asks "how much inference budget per query?" Routing asks "which model per query?" These are independent — a routing layer could be composed with per-query compute allocation for a two-dimensional Pareto optimization. Since Can inference compute replace scaling up model size?, routing extends this: you don't need a bigger model OR more compute — you need the right model for this specific query type.

The implication challenges the frontier model race: rather than building one model that dominates on everything, assembling a diverse pool of specialized-ish models with good routing may be both cheaper and more effective. This aligns with the heterogeneous architecture thesis in Can small language models handle most agent tasks? — routing makes the heterogeneous approach practical.

Inquiring lines that read this note 68

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can LLM recommenders match or exceed collaborative filtering performance?

Can model routing outperform monolithic scaling as an efficiency strategy?

What structural factors drive popularity bias in recommendation systems?

How does reasoning graph topology affect breakthrough insights and generalization?

Can ensemble evaluation methods reduce bias more than single judges?

Can semantic clustering of stakeholders preserve meaningful evaluative diversity without manual curation?

Why do self-improving systems struggle without clear external performance metrics?

What structural advantages do diffusion language models offer over autoregressive methods?

Why do hybrid paradigms outperform pure autoregressive or pure diffusion approaches?

Does domain specialization cause models to lose capabilities elsewhere?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How do multi-agent systems achieve genuine cooperation and reasoning?

How do multi-agent systems improve on single frontier models?

How does example difficulty affect learning efficiency in language models?

How do knowledge injection methods compare across cost and effectiveness?

How should query augmentation strategies be properly evaluated against baselines?

What are the consequences of models training on synthetic data?

How can smaller models help select useful data for larger models?

How should retrieval systems optimize for multi-step reasoning during inference?

How do hierarchical query planning architectures improve multi-hop retrieval?

When does architectural design matter more than raw model capacity?

How do evaluation mechanisms prevent error accumulation in autonomous research systems?

Why do production teams choose expensive frontier models over fine-tuning?

Why does self-revision increase model confidence while degrading accuracy?

Why does revision often make reasoning accuracy worse in frontier models?

What dimensions of recommendation quality do standard metrics miss?

What consumption data would validate the limited-consumption model in production systems?

Can graph structure and relationships fundamentally improve recommendation systems?

How do feature-based approaches compare to aggregation methods for cold-start?

What causes silent corruption to amplify through delegated workflows?

How do standardized protocols improve coordination in multi-agent systems?

What makes capability vectors a better coordination substrate than topic-based routing?

How do aggregate reward models systematically exclude minority user preferences?

When does clustering users by preference overcome the aggregation dilemma?

What memory abstraction level best enables agent knowledge reuse?

How do sharded HNSW indices preserve capability distinctions at scale?

When does optimizing for quality undermine the value of diversity?

Which aggregation method best exploits diversity in generated solutions?

Does decoupling planning from execution improve multi-step reasoning accuracy?

What memory architectures best support persistent reasoning across extended interactions?

Can external managers optimize context better than the model itself?

What makes weaker teacher models effective for stronger student training?

How does upward distillation transfer knowledge from smaller to larger networks?

Can single-axis benchmarks accurately predict agent deployment success?

Can a single Elo ranking represent multidimensional model capability?

How should iterative research systems allocate reasoning per search step?

How do search and reasoning workflows improve forecasting performance over base models?

How does objective evolution guide discovery better than fixed planning?

Can the same problem be solved by multiple evolutionary search strategies?

Do harness improvements transfer across model scales or memorize shortcuts?

Can smaller models produce skill updates as useful as frontier model updates?

How should personalization be implemented to improve AI assistant effectiveness?

Does base model strength determine adapter usefulness across users?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 115 in 2-hop network ·dense cluster Open in graph ↗

Can routing beat building one better model? Can we allocate inference compute based on prompt … Can inference compute replace scaling up model siz… Can small language models handle most agent tasks? Can routers select the right model before generati…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
complementary axis: compute allocation + model selection = two-dimensional optimization
Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
routing extends substitution: right model > bigger model
Can small language models handle most agent tasks? Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
routing is the mechanism enabling heterogeneous architectures
Can routers select the right model before generation happens? Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.
single-model routing as the base case this extends to multi-model pools

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

test-time model ensembling via embedding-cluster routing surpasses any individual frontier model — model selection is a stronger lever than model improvement

Can routing beat building one better model?

Inquiring lines that read this note 68

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4