Can specialized components replace single fully-trained models in deployment?
This explores whether you can build deployments out of small, specialized parts — little models, swappable expert skills, task-specific training — instead of relying on one big model that does everything.
This explores whether specialized components — small models, swappable expert vectors, task-decomposed training — can stand in for a single fully-trained generalist at deployment time. The corpus leans toward yes, but with sharp conditions: the economics favor specialization, the architecture allows it, but the gains come with hidden costs you have to manage.
The strongest economic case is that most of what agents actually do is repetitive and well-defined, and small language models handle exactly that work at 10–30× lower cost than a frontier model. The rational pattern isn't one big model — it's a heterogeneous mix, small models by default and a large one called in only when needed Can small language models handle most agent tasks?. That mix only holds up if the small models are genuinely competent, and there's evidence they can be: small models trained with DPO on a teacher's correct-and-incorrect examples match much larger models on function calling, because the explicit negative examples fix the rigid-format failures that ordinary fine-tuning leaves behind Can small models match large models on function calling?. Decomposing a capability also helps — breaking function calling into seven granular subtasks and training across them generalizes better than one umbrella dataset and closes the gap with GPT, Claude, and Gemini Can breaking function calling into subtasks improve model generalization?.
The most striking version of "components instead of one model" doesn't even need separate models. Transformer² tunes only the singular values inside weight matrices to produce expert vectors that mix dynamically at inference, composing the right specialists on the fly without the interference you'd normally get — outperforming LoRA with fewer parameters Can models dynamically activate expert skills at inference time?. This reframes the question: maybe the single model and the specialized components are the same artifact, with specialization selected at runtime rather than baked in. That view is reinforced by evidence that RL post-training mostly teaches a model *when* to reason rather than *how* — the reasoning is already latent, and hybrid routing recovers 91% of the gains by just deciding which tokens get the deeper treatment Does RL post-training create reasoning or just deploy it?.
The catch is that specialization is never free. Every domain adaptation method has a narrow sweet spot, and visible performance gains often hide degradation in reasoning faithfulness, capability transfer, and format flexibility How do domain training techniques actually reshape model behavior?. Training pressure can actively narrow a model: RL collapses the diversity of output formats onto a single dominant one within the first epoch Does RL training collapse format diversity in pretrained models?, and overly hard training samples can teach degenerate shortcuts that contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. So a specialized component can be sharper on its task while quietly worse at the things you didn't measure.
Two deeper cautions reframe the whole bet. First, swapping in a specialized model is never just swapping a model — turning an LLM into an action-capable agent requires transforming the surrounding pipeline (data, grounding, memory, tools, safety), and the harness is what determines whether actions are grounded or hallucinated Can you turn an LLM into an agent by just fine-tuning?. Second, if you were hoping to get diversity by ensembling many different models, the corpus warns of an "Artificial Hivemind": 70+ models converge on strikingly similar outputs because they share training data and alignment, so a pile of nominally distinct models may not buy you the variety you assumed Do different AI models actually produce diverse outputs?. The surprise here is that the real competition isn't "specialized parts vs. one model" — it's runtime composition vs. baked-in training, and the cheaper, more flexible answer keeps turning out to be selecting specialization at inference rather than committing to it up front.
Sources 10 notes
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.