Can routing beat building one better model?
Does directing queries to specialized models via semantic clustering outperform investing in a single frontier model? This challenges whether model improvement or model selection drives performance gains.
Avengers-Pro demonstrates that routing queries to different models based on semantic clustering can exceed the performance of any individual model in the pool — including frontier models. The mechanism: embed incoming queries, cluster by semantic similarity, evaluate per-cluster model performance-efficiency scores, and route each query to the highest-scoring model for its cluster.
Three results establish the claim:
- Performance: +7% average accuracy over GPT-5-medium (the strongest individual model in the pool) across 6 benchmarks
- Efficiency at parity: matches GPT-5-medium accuracy at 27% lower cost
- Efficiency at near-parity: reaches ~90% of GPT-5-medium performance at 63% lower cost
The earlier Avengers work made an even more striking claim: ten models of ~7B parameters each, with routing, surpassed GPT-4.1 and 4.5 across 15 datasets. This suggests the performance gain from optimal model selection can be comparable to the gap between model generations.
The architecture is lightweight: three operations at inference time (embedding, nearest-cluster lookup, score aggregation). The heavy work — fitting the clustering model and estimating per-cluster performance statistics — happens offline on a calibration set (70% for fitting, 30% for evaluation). This makes the approach deployable as a thin routing layer atop any model API ecosystem.
Since Can we allocate inference compute based on prompt difficulty?, Avengers-Pro adds a complementary optimization axis. Compute-optimal scaling asks "how much inference budget per query?" Routing asks "which model per query?" These are independent — a routing layer could be composed with per-query compute allocation for a two-dimensional Pareto optimization. Since Can inference compute replace scaling up model size?, routing extends this: you don't need a bigger model OR more compute — you need the right model for this specific query type.
The implication challenges the frontier model race: rather than building one model that dominates on everything, assembling a diverse pool of specialized-ish models with good routing may be both cheaper and more effective. This aligns with the heterogeneous architecture thesis in Can small language models handle most agent tasks? — routing makes the heterogeneous approach practical.
Inquiring lines that use this note as a source 62
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do naive baselines outperform trained models in entity-level CRS evaluation?
- Why does single-model routing beat ensemble and cascade approaches on latency?
- What makes query complexity a better routing signal than response quality?
- Can routing enable heterogeneous SLM-first architectures at scale?
- Should model routing decisions account for prompt-tier dependencies?
- Can this distillation pattern apply beyond e-commerce to other latency-constrained domains?
- Why is latency budget a constraint for e-commerce rankers?
- How does nesting optimization levels improve on traditional network depth?
- Can semantic clustering of stakeholders preserve meaningful evaluative diversity without manual curation?
- Can bilevel autoresearch discover new search mechanisms for the inner research loop?
- Can bilevel autoresearch succeed when the inner and outer loops use different models?
- Can model routing and compute allocation work together as independent optimizations?
- Why do hybrid paradigms outperform pure autoregressive or pure diffusion approaches?
- What capability risks emerge when models are optimized for single domains?
- What hidden costs emerge when you fine-tune models for a single domain?
- How do multi-agent systems improve on single frontier models?
- Can routing systems prevent expert models from failing outside their specialty?
- Do different domains require different types of model investment?
- Can smaller models actually perform well on specific downstream tasks?
- How should query augmentation strategies be properly evaluated against baselines?
- How can smaller models help select useful data for larger models?
- How do hierarchical query planning architectures improve multi-hop retrieval?
- How does the Ladder of Scales approach reduce search costs across model sizes?
- How should topology routing adapt to different task types?
- Can hierarchical vector routing reduce context overhead while maintaining tool coverage?
- Why do production teams choose expensive frontier models over fine-tuning?
- How do different feed-weighting schemes construct distinct network topologies at population scale?
- Why does revision often make reasoning accuracy worse in frontier models?
- How do routers decide when to escalate from small to large models?
- Do small models show different parameter efficiency patterns than large models?
- Can multiple small models outperform a single large model with good routing?
- Which architectural choices matter most when a model must fit one billion parameters?
- How does semantic clustering help decide which model handles each query?
- Can compute allocation and model routing be combined for better results?
- Why might diverse smaller models with routing beat one giant model?
- What makes routing a better investment than training larger models?
- What consumption data would validate the limited-consumption model in production systems?
- How do feature-based approaches compare to aggregation methods for cold-start?
- Why do frontier models corrupt more documents than weaker models during workflows?
- What makes capability vectors a better coordination substrate than topic-based routing?
- Can embedding-cluster routing outperform a single frontier model?
- How does routing decide between models before generation happens?
- When does clustering users by preference overcome the aggregation dilemma?
- Does model capability still matter once coordination infrastructure is optimized?
- How do sharded HNSW indices preserve capability distinctions at scale?
- Can semantic routing couple similarity matching with resource constraints?
- How does workflow scale change the failure modes of frontier models?
- Can review effort alone keep pace with frontier model degradation?
- How do pre-training and distillation enable minimal routing signals to work?
- Why do frontier models remain cost-effective despite higher token prices in production?
- Which aggregation method best exploits diversity in generated solutions?
- How much does workflow architecture matter compared to raw model capability in forecasting?
- Can external managers optimize context better than the model itself?
- What organizational bottlenecks emerge when expertise concentrates in few specialists?
- How does upward distillation transfer knowledge from smaller to larger networks?
- Can a single Elo ranking represent multidimensional model capability?
- How do search and reasoning workflows improve forecasting performance over base models?
- Can the same problem be solved by multiple evolutionary search strategies?
- Can scaling data alone solve performance gaps on long-tail concepts?
- Can smaller models produce skill updates as useful as frontier model updates?
- What makes mixture-of-experts routing learn token-level specialization effectively?
- Why does Branch-Train-Merge fail without learned routing between experts?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
complementary axis: compute allocation + model selection = two-dimensional optimization
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
routing extends substitution: right model > bigger model
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
routing is the mechanism enabling heterogeneous architectures
-
Can routers select the right model before generation happens?
Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.
single-model routing as the base case this extends to multi-model pools
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing
- Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
- RouteLLM: Learning to Route LLMs with Preference Data
- MasRouter: Learning to Route LLMs for Multi-Agent Systems
- RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
- Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration
- MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
- Test-Time Scaling with Reflective Generative Model
Original note title
test-time model ensembling via embedding-cluster routing surpasses any individual frontier model — model selection is a stronger lever than model improvement