SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Agentic Systems and Tool Use

Can routing beat building one better model?

Does directing queries to specialized models via semantic clustering outperform investing in a single frontier model? This challenges whether model improvement or model selection drives performance gains.

Synthesis note · 2026-02-23 · sourced from Routers

Avengers-Pro demonstrates that routing queries to different models based on semantic clustering can exceed the performance of any individual model in the pool — including frontier models. The mechanism: embed incoming queries, cluster by semantic similarity, evaluate per-cluster model performance-efficiency scores, and route each query to the highest-scoring model for its cluster.

Three results establish the claim:

The earlier Avengers work made an even more striking claim: ten models of ~7B parameters each, with routing, surpassed GPT-4.1 and 4.5 across 15 datasets. This suggests the performance gain from optimal model selection can be comparable to the gap between model generations.

The architecture is lightweight: three operations at inference time (embedding, nearest-cluster lookup, score aggregation). The heavy work — fitting the clustering model and estimating per-cluster performance statistics — happens offline on a calibration set (70% for fitting, 30% for evaluation). This makes the approach deployable as a thin routing layer atop any model API ecosystem.

Since Can we allocate inference compute based on prompt difficulty?, Avengers-Pro adds a complementary optimization axis. Compute-optimal scaling asks "how much inference budget per query?" Routing asks "which model per query?" These are independent — a routing layer could be composed with per-query compute allocation for a two-dimensional Pareto optimization. Since Can inference compute replace scaling up model size?, routing extends this: you don't need a bigger model OR more compute — you need the right model for this specific query type.

The implication challenges the frontier model race: rather than building one model that dominates on everything, assembling a diverse pool of specialized-ish models with good routing may be both cheaper and more effective. This aligns with the heterogeneous architecture thesis in Can small language models handle most agent tasks? — routing makes the heterogeneous approach practical.

Inquiring lines that use this note as a source 62

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 126 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

test-time model ensembling via embedding-cluster routing surpasses any individual frontier model — model selection is a stronger lever than model improvement