Why does specializing to one task make future task learning harder?
This explores catastrophic forgetting and plasticity loss — why tuning a model hard on one task tends to erode its ability to learn the next one — and what the corpus suggests is actually causing it.
This question reads as: when you specialize a model on Task A, why does Task B become harder to learn afterward? The intuition is that the model 'uses up' its capacity. But the most striking thread across this corpus is a reframing: forgetting isn't an inherent cost of specialization — it's a misallocation problem. Fast-Slow Training Can splitting adaptation into two channels reduce forgetting? shows that if you route task-specific lessons into the prompt (a fast, disposable channel) while keeping the underlying weight changes minimal, you reach the same performance faster and with substantially less forgetting. The damage comes from where the learning lands, not from learning itself — when every lesson gets written into shared parameters, later tasks overwrite earlier ones, and the network's plasticity degrades.
If the problem is shared parameters colliding, the obvious lever is to stop them from colliding. Core parameter isolation Can isolating task-specific parameters prevent multi-task fine-tuning interference? identifies the specific weight regions each task depends on, freezes those, and merges the rest — outperforming standard multi-task tuning precisely because it prevents the interference that makes future learning destructive. Transformer² Can models dynamically activate expert skills at inference time? pushes the same idea further: tune only the singular values of weight matrices, producing composable 'expert vectors' that mix at inference without stepping on each other — enabling continual specialization rather than each new skill eroding the last. Both say the same thing from different angles: keep specializations structurally separate and the second task stops paying for the first.
There's a subtler mechanism too — specialization can quietly collapse the very flexibility a future task needs. Omni-Thinker Does training order reshape how models handle different task types? shows structured tasks (math, code) drive a model's output entropy down, while open-ended tasks need entropy up. Specialize hard on the structured task first and you can collapse the entropy that creative tasks depend on — so the order of training mechanically shapes what you can still learn. Training structured-first then open-ended recovers a 6.2% gain, which means 'future task learning' isn't just about preserved weights, it's about preserved exploratory range.
The corpus also offers an escape hatch: don't write skills into weights at all. VOYAGER Can agents learn new skills without forgetting old ones? stores executable skills in an external, indexed library and composes new ones from old, learning continuously without the forgetting that weight-update methods suffer. Agent Workflow Memory Can agents learn reusable sub-task routines from past experience? does the analogous thing with reusable sub-task routines — and notably, the gains grow as the gap between past and future tasks widens. The lesson hiding here is that specialization-then-forgetting is largely an artifact of one storage choice (overwriting shared weights). Move the specialization into prompts, isolated parameter regions, composable vectors, or external libraries, and the second task stops being harder.
Worth one caution that complicates the whole picture: instruction tuning research Does instruction tuning teach task understanding or output format? finds that what a model often absorbs during specialization is the output format, not deep task understanding. If specialization is partly just narrowing the output distribution, then 'harder future learning' may sometimes be the model locked into the wrong output shape rather than genuine capacity loss — a different problem with a different fix than forgetting.
Sources 7 notes
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.