INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›What determines success in trainin…›this inquiring line

Ensembles work by disagreement — so what happens when you try to build that diversity inside a single model?

How do ensemble methods apply within a single model?

This explores how the ensemble idea — combining many models for better, more robust answers — gets folded *inside* one model: composing specialist sub-skills, searching weight space, or routing subtasks, rather than running separate networks in parallel.

This explores how the ensemble idea — usually "run several models and combine them" — gets pulled inside a single model. The corpus has a surprisingly rich answer, and it starts with a warning: classic ensembling assumes the members *disagree*. INFINITY-CHAT's study of 70+ models found an "Artificial Hivemind" — different LLMs independently converge on nearly identical outputs because they share training data and alignment recipes Do different AI models actually produce diverse outputs?. If your ensemble members all say the same thing, you've paid for diversity you didn't get. That's the motivation for moving the ensemble *into* one model, where you can engineer the diversity deliberately.

The cleanest version of this is composing experts at inference time. Transformer² tunes only the singular values inside weight matrices to produce "expert vectors" that mix dynamically when a prompt arrives — effectively assembling a task-specific ensemble on the fly from one model's parameters, without the interference that plagues stacking adapters Can models dynamically activate expert skills at inference time?. A more exotic cousin treats the model population as a swarm: PSO-style "particles" move through weight space and discover composed experts that can answer questions *none* of the starting experts could, using a couple hundred validation examples and no gradient training Can language models discover new expertise through collaborative weight search?. Both are ensembles in spirit — many specialists combined — but the combination happens in one model's weights.

The other route is decomposition: split a task into subtasks and let one model train across all of them, which is multi-task learning doing the work an ensemble would. Granite-20B-FunctionCalling breaks function calling into seven granular subtasks and generalizes better than umbrella datasets do Can breaking function calling into subtasks improve model generalization?. Nexus splits forecasting into contextualization, macro/micro outlook, and synthesis stages so one system doesn't have to juggle numerical and contextual reasoning at once Can decomposing forecasting into stages unlock numerical and contextual reasoning?. The catch is interference — cramming many tasks into shared weights makes them step on each other. The fix that keeps showing up is structural isolation: identify the core parameters each task needs, freeze those, and merge the rest, which beats plain multi-task fine-tuning Can isolating task-specific parameters prevent multi-task fine-tuning interference?.

The thread that ties these together — and the thing worth taking away — is that an in-model ensemble lives or dies on *preserved diversity*. The same corpus shows how easily single-model training collapses it: RL post-training amplifies one pretraining format and suppresses the rest within a single epoch Does RL training collapse format diversity in pretrained models?, and training-time critique exists largely to counteract that narrowing and keep solutions varied Do critique models improve diversity during training itself?. So the real question "how do ensembles fit in one model" reframes to: can you keep enough internal disagreement alive — across experts, parameters, or subtasks — that combining them still buys you something? The methods above are all different bets on yes.

Sources 8 notes

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Show all 8 sources

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence1.71 match · arxiv ↗
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance1.70 match · arxiv ↗
Transformer2: Self-adaptive LLMs1.67 match · arxiv ↗
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration1.63 match · arxiv ↗
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains1.61 match · arxiv ↗
Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration1.57 match · arxiv ↗
Nexus: An Agentic Framework for Time Series Forecasting0.90 match · arxiv ↗
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)0.90 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining the question: **How do ensemble methods apply within a single model?** treating it as still-open despite recent progress.

**What a curated library found — and when (dated claims, not current truth):**
Findings span 2024–2026; treat all as perishable constraints to re-test.
- Multiple LLMs converge on nearly identical outputs due to shared training data and alignment, eroding classical ensemble diversity gains (~2025).
- Transformer² dynamically composes expert vectors at inference by tuning singular values, assembling task-specific ensembles without adapter stacking interference (~2025).
- Swarm intelligence in weight space discovers adapted experts via PSO-style particle search, solving questions no starting expert could answer, using ~200 validation examples (~2024).
- Multi-task decomposition (function calling into 7 subtasks; forecasting into contextualization/outlook/synthesis) outperforms umbrella datasets but suffers task interference in shared weights (~2024–2026).
- Core-parameter isolation—freezing task-critical parameters and merging the rest—beats plain multi-task fine-tuning by preventing interference (~2025).
- RL post-training converges on a single dominant pretraining format within one epoch, suppressing diversity; training-time critique counteracts this narrowing (~2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2501.06252 (Transformer2, Jan 2025)
- arXiv:2410.11163 (Model Swarms, Oct 2024)
- arXiv:2508.21741 (Core-Parameter Isolation, Aug 2025)
- arXiv:2504.07912 (Echo Chamber / RL Post-training, Apr 2025)

**Your task:**
(1) **Re-test each constraint.** For each method above (expert vectors, swarm search, multi-task decomposition, isolation), assess whether newer model scales, training algorithms (e.g., post-RL refinement, continued pretraining), orchestration (e.g., dynamic routing, retrieval-augmented merging), or evals have **relaxed or inverted** the claimed interference or diversity-collapse problems. Does the "homogeneity problem" still bind, or have recent scaling/alignment methods broken it? Separate the durable question ("how to preserve internal disagreement") from perishable claims ("RL post-training *necessarily* narrows format diversity").

(2) **Surface the strongest contradicting or superseding work** from the last ~6 months. Look for: (a) evidence that unified models outperform in-model composition on standard benchmarks, (b) new routing/gating schemes that preserve diversity at scale, (c) post-hoc merging methods that beat parameter isolation.

(3) **Propose 2 research questions that assume the regime may have moved:**
   - Q1: Can adaptive routing policies (e.g., learned via validation or online reward) sustain diversity in multi-task weights better than static isolation?
   - Q2: Does mixture-of-experts style conditioning on input *at parameter scale* (not just adapter/router scale) resurrect single-model ensembling in very large models where diversity was previously eroded?

**Guardrail:** Cite arXiv IDs; flag anything you cannot ground in a real paper.

Ensembles work by disagreement — so what happens when you try to build that diversity inside a single model?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8