How do ensemble methods apply within a single model?
This explores how the ensemble idea — combining many models for better, more robust answers — gets folded *inside* one model: composing specialist sub-skills, searching weight space, or routing subtasks, rather than running separate networks in parallel.
This explores how the ensemble idea — usually "run several models and combine them" — gets pulled inside a single model. The corpus has a surprisingly rich answer, and it starts with a warning: classic ensembling assumes the members *disagree*. INFINITY-CHAT's study of 70+ models found an "Artificial Hivemind" — different LLMs independently converge on nearly identical outputs because they share training data and alignment recipes Do different AI models actually produce diverse outputs?. If your ensemble members all say the same thing, you've paid for diversity you didn't get. That's the motivation for moving the ensemble *into* one model, where you can engineer the diversity deliberately.
The cleanest version of this is composing experts at inference time. Transformer² tunes only the singular values inside weight matrices to produce "expert vectors" that mix dynamically when a prompt arrives — effectively assembling a task-specific ensemble on the fly from one model's parameters, without the interference that plagues stacking adapters Can models dynamically activate expert skills at inference time?. A more exotic cousin treats the model population as a swarm: PSO-style "particles" move through weight space and discover composed experts that can answer questions *none* of the starting experts could, using a couple hundred validation examples and no gradient training Can language models discover new expertise through collaborative weight search?. Both are ensembles in spirit — many specialists combined — but the combination happens in one model's weights.
The other route is decomposition: split a task into subtasks and let one model train across all of them, which is multi-task learning doing the work an ensemble would. Granite-20B-FunctionCalling breaks function calling into seven granular subtasks and generalizes better than umbrella datasets do Can breaking function calling into subtasks improve model generalization?. Nexus splits forecasting into contextualization, macro/micro outlook, and synthesis stages so one system doesn't have to juggle numerical and contextual reasoning at once Can decomposing forecasting into stages unlock numerical and contextual reasoning?. The catch is interference — cramming many tasks into shared weights makes them step on each other. The fix that keeps showing up is structural isolation: identify the core parameters each task needs, freeze those, and merge the rest, which beats plain multi-task fine-tuning Can isolating task-specific parameters prevent multi-task fine-tuning interference?.
The thread that ties these together — and the thing worth taking away — is that an in-model ensemble lives or dies on *preserved diversity*. The same corpus shows how easily single-model training collapses it: RL post-training amplifies one pretraining format and suppresses the rest within a single epoch Does RL training collapse format diversity in pretrained models?, and training-time critique exists largely to counteract that narrowing and keep solutions varied Do critique models improve diversity during training itself?. So the real question "how do ensembles fit in one model" reframes to: can you keep enough internal disagreement alive — across experts, parameters, or subtasks — that combining them still buys you something? The methods above are all different bets on yes.
Sources 8 notes
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.
Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.
Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.
Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.