INQUIRING LINE

How does routing decide between models before generation happens?

This explores how a router picks which model to use *before* any text is generated — predicting which model fits a query rather than running several and judging the outputs afterward.


This explores how a router picks which model to use *before* any text is generated — predicting which model fits a query rather than running several and judging the outputs afterward. The corpus draws a sharp line here: routing is a *pre-generation* decision, fundamentally different from a reward model that scores responses after the fact. Systems like RouteLLM and Hybrid-LLM estimate a query's difficulty up front and send it to a single model, cutting cost 40–50% while keeping latency low — because they commit to one model instead of running an ensemble or cascade and paying for all of them Can routers select the right model before generation happens?. The whole bet is that you can predict how hard a question is before answering it.

How does the router make that prediction? Two flavors show up. The difficulty-estimation approach asks 'is this query hard enough to need the expensive model?' The semantic approach skips difficulty entirely and asks 'what *kind* of query is this?' — Avengers-Pro clusters queries by meaning and routes each cluster to the model that does best on it, beating a frontier model by 7% on accuracy or matching it at 27% lower cost Can routing beat building one better model?. The striking claim there is that *which* model you pick can be a stronger lever than building a bigger one: ten small models with a good router previously edged out much larger frontier systems. Selection beats scaling.

Routing also gets more complicated than 'pick one model.' In multi-agent settings the pre-generation decision explodes into four entangled choices made at once — how the agents should collaborate, how many you need, what role each plays, and which LLM backs each one. MasRouter learns all four jointly through a cascaded controller and still comes out ahead on both accuracy and cost What decisions must multi-agent routing systems optimize simultaneously?. So 'routing' is less a single switch than a small upstream planning problem.

The most surprising corner is that routing doesn't only happen *between* models — it can happen *inside* one. Hybrid reasoning models route at the token level, deciding which tokens deserve expensive deliberation, and recover ~91% of the gains of full reasoning that way. This connects to a deeper finding: RL post-training seems to teach a model *when* to reason rather than *how*, meaning the reasoning ability already exists and the real skill being learned is a routing-like deployment decision Does RL post-training create reasoning or just deploy it?.

Worth knowing where this idea bends. The clean 'decide-then-generate' picture assumes you can judge a query before touching it — but a related thread argues a model's *partial* output reveals information the original query couldn't express, and that uncertainty mid-generation is a better trigger for fetching help than any up-front guess Can a model's partial response guide what to retrieve next? When should retrieval happen during model generation?. That's the live tension: pure pre-generation routing is cheap and fast, but some of the best signal about what a query needs only appears once generation is already underway.


Sources 6 notes

Can routers select the right model before generation happens?

RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

What decisions must multi-agent routing systems optimize simultaneously?

MasRouter shows that routing in multi-agent systems must jointly optimize collaboration topology, agent count, role allocation, and per-agent LLM assignment through a cascaded controller. This unified approach surpasses single-model routing by 3.51% accuracy while cutting HumanEval costs by 49%.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can a model's partial response guide what to retrieve next?

ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.

When should retrieval happen during model generation?

Active retrieval triggered by low token probability improves both accuracy and efficiency compared to one-shot or continuous retrieval. FLARE demonstrates that models signal genuine knowledge gaps through low confidence, enabling dynamic budget allocation to actual information needs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM systems researcher evaluating pre-generation routing claims. The core question: **How do routers decide which model to use before generation starts, and has that regime shifted?**

What a curated library found — and when (findings span 2023–2026; treat as dated claims, not current truth):
• Difficulty-estimation routers cut costs 40–50% by predicting query hardness up front and sending to one model, vs. ensemble/cascade cascades (2024–2025).
• Semantic/embedding-cluster routing beats frontier models by 7% accuracy or matches at 27% lower cost; ten small models + good router previously outperformed much larger systems (2024).
• Multi-agent routing requires *four simultaneous decisions* (collaboration, count, roles, LLM per agent) learned jointly; MasRouter optimizes all four (2025).
• Token-level reasoning routing inside a single model recovers ~91% of full reasoning gains; RL post-training teaches *when* to reason, not *how* — routing is the real deployment skill (2025–2026).
• **Tension:** pure pre-generation routing is fast/cheap, but partial-output uncertainty mid-generation reveals information the original query could not, triggering better help-fetching than up-front guesses (2023–2024).

Anchor papers (verify; mind their dates):
• arXiv:2404.14618 — Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing (2024).
• arXiv:2502.11133 — MasRouter: Learning to Route LLMs for Multi-Agent Systems (2025).
• arXiv:2604.15726 — LLM Reasoning Is Latent, Not the Chain of Thought (2026).
• arXiv:2305.06983 — Active Retrieval Augmented Generation (2023).

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For difficulty-estimation and semantic routing, verify whether advances in embedding quality, in-context learning, instruction tuning, or test-time compute since mid-2025 have lifted the 40–50% cost ceiling or the 27% efficiency gap. Separately: has token-level reasoning routing been superseded by end-to-end reasoning scaling (e.g., o1-style reasoning tokens)? Distinguish the durable question (can pre-generation signals beat mid-generation ones?) from the perishable limitation (current routers are weak at predicting hard queries).
(2) **Surface the strongest contradicting work from the last ~6 months.** Look for papers arguing reasoning quality *requires* full generation or that mid-generation uncertainty is the only reliable signal. Flag disagreement on whether RL teaches routing vs. capability.
(3) **Propose 2 research questions that assume the regime may have moved:** (a) Does in-context routing (e.g., teaching a router via in-context examples without retraining) close the pre- vs. mid-generation gap? (b) Can a router learn to predict not just *which* model, but *which reasoning depth per token*, from a single unified model with variable compute?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines