INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›What determines success in trainin…›this inquiring line

Merging model weights across tasks always loses something — so when is that loss too costly to accept?

When should model isolation be preferred over weight-averaging approaches?

This explores a practical choice in continual and multi-task learning: when is it better to fence off separate parameters per task (isolation) versus blending parameters together into shared weights (averaging/merging)?

This explores when keeping task-specific parameters walled off beats merging them into shared weights — and the corpus suggests the answer turns on one thing: how much your tasks actually conflict. The clearest case for isolation comes from streaming recommendation, where Can model isolation solve streaming recommendation better than replay? gives each task its own parameters precisely because the alternative methods — replay and distillation — can't offer explicit control over the stability-plasticity trade-off. Isolation lets you preserve old patterns *exactly* while growing new capacity for emerging preferences. That word 'exactly' is the crux: averaging is lossy by design, and when forgetting old behavior is unacceptable, lossy is disqualifying.

But the most useful note here refuses the binary entirely. Can isolating task-specific parameters prevent multi-task fine-tuning interference? shows the winning recipe is *both at once*: identify the small core region each task truly depends on, freeze those in isolation, and geometrically merge only the non-core parameters. Pure scheduling without structural isolation wasn't enough. So the real rule isn't 'isolate vs. average' — it's 'isolate the parameters that carry irreplaceable task identity, average the rest.' Weight-averaging fails when it blends parameters that were doing genuinely incompatible jobs; it's safe on the parameters that weren't.

Why is the conflicting core so small? Does reinforcement learning update only a small fraction of parameters? offers a striking clue: reinforcement learning naturally concentrates its changes into just 5–30% of parameters, and those sparse updates are nearly identical across random seeds — structural, not arbitrary. That's an argument *for* isolation being cheap and *for* averaging being dangerous: the parameters that matter are few and consistent, so you can wall them off without much overhead, but blindly averaging over them would smear away exactly the structure that does the work.

There's a deeper reason averaging disappoints, visible from a different corner of the collection. The appeal of merging models is supposed diversity — combine many and get the best of each. But Do different AI models actually produce diverse outputs? documents an 'Artificial Hivemind' where models trained on overlapping data converge on near-identical outputs anyway. If your ingredients are already collapsed toward the same point, averaging them buys you nothing; isolation at least preserves whatever distinct behavior survives. Relatedly, Can models be smart without organized internal structure? warns that two models with the same accuracy can have fractured internal organization — so averaging their weights, which assumes their representations are commensurable, can quietly produce something brittle that benchmarks won't catch.

If you want a single heuristic: prefer isolation when forgetting is unacceptable, when tasks genuinely interfere, or when you need explicit dials on what's preserved versus adapted. Prefer averaging on the large remainder of parameters that don't carry conflicting task identity. And for the broader question of touching weights at all, Can decoding-time tuning preserve knowledge better than weight fine-tuning? and Does staying close to the base model preserve learning ability? are worth the detour — they suggest that staying close to the base distribution (whether by not editing weights, or by minimizing drift) protects the model's future ability to keep learning. Isolation is one way to honor that principle; averaging, done carelessly, violates it.

Sources 7 notes

Can model isolation solve streaming recommendation better than replay?

DEGC uses per-task parameter isolation to handle streaming recommendation, providing explicit stability-plasticity trade-offs that experience replay and knowledge distillation methods cannot match. This approach preserves older patterns exactly while allowing new parameters to capture emerging preferences.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Show all 7 sources

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Learning, Fast and Slow: Towards LLMs That Adapt Continually1.67 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.63 match · arxiv ↗
Dynamically Expandable Graph Convolution for Streaming Recommendation0.91 match · arxiv ↗
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models0.90 match · arxiv ↗
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)0.90 match · arxiv ↗
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance0.87 match · arxiv ↗
NoveltyBench: Evaluating Language Models for Humanlike Diversity0.86 match · arxiv ↗
Tuning Language Models by Proxy0.85 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a synthesis researcher re-testing constraints on parameter isolation vs. weight-averaging in multi-task and continual learning regimes. The question remains open: when should we isolate task-specific parameters rather than merge them?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026, tracking evolving views on parameter isolation as a design choice:
• Isolation excels in streaming settings where forgetting is unacceptable; explicit control over stability-plasticity trade-off beats replay or distillation (~2023).
• The winning recipe is *hybrid*: isolate only the small core parameters each task truly depends on (~5–30% for RL, nearly identical across seeds); average the rest (~2025).
• Models trained on overlapping data converge toward near-identical outputs anyway ('Artificial Hivemind'), so averaging buys little diversity; isolation preserves whatever distinct behavior survives (~2025).
• Two models with identical accuracy can have fractally different internal representations, so averaging weights assuming commensurability risks silent brittleness (~2025).
• Staying close to the base model distribution (minimizing KL drift, not editing weights, or proxy-tuning at decode-time) protects future plasticity better than direct weight-averaging (~2025–2026).

Anchor papers (verify; mind their dates):
• 2303.11700 — Dynamically Expandable Graph Convolution for Streaming Recommendation (2023)
• 2505.11711 — Reinforcement Learning Finetunes Small Subnetworks in Large Language Models (2025)
• 2510.22954 — Artificial Hivemind: The Open-Ended Homogeneity of Language Models (2025)
• 2605.12484 — Learning, Fast and Slow: Towards LLMs That Adapt Continually (2026)

Your task:
(1) RE-TEST THE CORE TENSION. For each claim above, judge whether recent work on model merging (e.g., newer LoRA, adapter composition), in-context learning, or scaling of continual fine-tuning has *dissolved* the isolation advantage or *reinforced* it. Separate the durable finding (tasks do interfere in specific parameter regions) from the perishable constraint (isolation is the *only* solution). Cite what has since changed the trade-off.
(2) Surface the strongest *superseding* work from the last 6 months — does it argue *for* averaging in cases the library dismissed, or refine *when* isolation is cheap enough to always prefer?
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can adaptive, learned routing or layer-wise merging schedules outperform static isolation? (b) Does scaling to 100+ tasks flip the isolation-vs-averaging calculus?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Merging model weights across tasks always loses something — so when is that loss too costly to accept?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8