How do pre-training and distillation enable minimal routing signals to work?
This explores why a router can pick the right model from a cheap signal — query difficulty or an embedding cluster — rather than having to run the models first, and how pre-training and distillation are what make that thin signal enough.
This reads the question as asking why routing can work off such a minimal signal: estimate a query's complexity *before* anything is generated, and send it to the right place. The corpus suggests the answer is that the hard work has already happened upstream — the routing decision is cheap precisely because pre-training and distillation have already loaded the capability into the candidate models. The router isn't deciding what a model knows; it's only deciding which already-capable model to wake up.
Start with what 'minimal' means. Can routers select the right model before generation happens? shows routers like RouteLLM cut cost 40–50% by predicting query difficulty before generation, never evaluating a response. Can routing beat building one better model? pushes this further: Avengers-Pro routes on nothing more than which semantic cluster a query's embedding falls into, and ten small models routed this way beat a single frontier model. The signal is tiny — a complexity score, a cluster id — but it works because each destination model is already a finished, competent system. Selection becomes a stronger lever than scaling.
That competence is where pre-training enters. Does RL training collapse format diversity in pretrained models? is a useful tell: post-training mostly amplifies one format already latent in the pre-trained distribution rather than installing something new. In other words, the behaviors a router selects between were largely set during pre-training — the model's 'specialty' is a pre-existing region of its distribution, not something the router conjures. Can decoding-time tuning preserve knowledge better than weight fine-tuning? reinforces the same point from the other side: the valuable knowledge lives in the base weights, and light decoding-time steering can redirect style and reasoning without disturbing it. A thin external signal is enough to shift behavior because the substrate is already rich.
Distillation is what makes the *cheap* destinations worth routing to. Can small models match large models on function calling? shows small models trained on a large teacher's correct-and-incorrect examples matching big models on function calling — the teacher's capability compressed into a model small enough to be one of many in a routing pool. This is why a fleet of 7B models plus a router can rival GPT-4-class systems: distillation manufactures specialists cheaply, and routing only has to point at them. Can continuous reasoning avoid forgetting in instruction-tuned models? echoes the architecture — freeze the capable backbone, attach a small trained helper — showing the recurring pattern of keeping the expensive knowledge intact while a lightweight component does the steering or selecting.
The thing you might not have known you wanted to know: routing, distillation, and decoding-time tuning are three versions of the same bet — that the costly, knowledge-bearing computation should happen once, up front, and that everything after can be a thin, cheap signal riding on top. The router's minimalism isn't a limitation; it's evidence of how much pre-training and distillation already settled.
Sources 6 notes
RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.