INQUIRING LINE

Why does mixed instruction data sometimes hurt specific model capabilities?

This explores why blending many kinds of instruction data into one training mix can quietly degrade a particular skill — and what the corpus says is actually happening inside the model when that occurs.


This explores why throwing more, more varied instruction data at a model can make it *worse* at a specific task rather than better. The corpus points to a consistent culprit: examples meant to teach one capability can actively pull the model's reasoning strategy away from what another task needs. The clearest statement of this comes from gradient-based data selection work, where training on just the 5% of examples most similar to a target capability beat training on the full dataset — precisely because the discarded majority contained instructions that hinder the target skill by nudging the model toward the wrong reasoning approach Can we train better models on less data?. Mixing isn't neutral; some data fights other data.

A mechanistic reason this happens is that different tasks compete for the same internal real estate. When you fine-tune on many tasks at once, their updates collide in shared parameter regions, and the result is interference rather than accumulation. Isolating each task's 'core' parameters and only merging the non-core ones consistently beats standard multi-task fine-tuning — which tells you the damage in a naive mix is structural, a tug-of-war over weights, not just a data-quality issue Can isolating task-specific parameters prevent multi-task fine-tuning interference?. The same pressure shows up as catastrophic forgetting: adding new instruction-following ability can overwrite pre-trained reasoning unless you physically protect the original weights, as SoftCoT does by freezing the backbone entirely and offloading new behavior to a small auxiliary model Can continuous reasoning avoid forgetting in instruction-tuned models?.

There's a subtler reason worth sitting with: instruction tuning may not be teaching what we think. Models trained on semantically empty or deliberately *wrong* instructions perform about as well as those trained on correct ones — what transfers is the shape of the output space, not task understanding Does instruction tuning teach task understanding or output format?. If instruction data mostly teaches 'how answers should look,' then a mixed corpus is really teaching a blended output format. That blend can crowd out the specific format a niche capability depends on — echoing how reinforcement learning collapses onto a single dominant format within an epoch and suppresses the alternatives Does RL training collapse format diversity in pretrained models?.

The deepest version of the problem is that interference is sometimes the whole point of a skill. In heuristic-override tasks, removing 'distracting' cues *hurts* performance, because the skill is composing conflicting signals, not filtering them — so data that trains a clean filtering habit actively damages a capability that needs the opposite Why does removing spurious cues sometimes hurt model performance?. Relatedly, work separating a 'decomposer' from a 'solver' found that keeping planning and execution in one model lets them interfere, while splitting them improves accuracy and generalization Does separating planning from execution improve reasoning accuracy?. The thread running through all of these: capabilities aren't simply additive. Mixing data assumes they stack; the corpus keeps finding they collide.

The useful takeaway — the thing you might not have known you wanted to know — is that the fix is rarely 'more data' or even 'cleaner data.' It's *separation*: select the slice that matches your target Can we train better models on less data?, wall off the parameters each task owns Can isolating task-specific parameters prevent multi-task fine-tuning interference?, or freeze what already works and bolt new behavior on beside it Can continuous reasoning avoid forgetting in instruction-tuned models?. Hurt capabilities are usually the symptom of forced sharing.


Sources 7 notes

Can we train better models on less data?

LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher re-testing dated claims about instruction-data interference in LLMs. The question: Why does mixed instruction data sometimes hurt specific model capabilities?

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026; treat all as perishable.
• Gradient-based selection of just 5% of instruction data outperforms full-dataset training on target tasks, because the other 95% actively pulls reasoning toward wrong strategies (arXiv:2402.04333, ~2024).
• Parameter interference in multi-task fine-tuning is structural: isolating each task's core weights and merging only non-core ones beats naive mixing (arXiv:2508.21741, ~2025).
• Instruction tuning may teach output-format distribution, not task understanding — correct vs. wrong instructions perform similarly (arXiv:2305.11383, ~2023).
• RL post-training collapses onto a single dominant pretraining format within an epoch, suppressing alternatives (arXiv:2504.07912, ~2025).
• Freezing a backbone and offloading new behavior to auxiliary models prevents catastrophic forgetting (arXiv:2502.12134, ~2025).

Anchor papers (verify; mind their dates): arXiv:2402.04333 (LESS, 2024), arXiv:2305.11383 (Instruction Tuning, 2023), arXiv:2508.21741 (Smart Isolation, 2025), arXiv:2502.12134 (SoftCoT, 2025).

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 5% selection win: does newer data-curation tooling (e.g., synthetic-data filters, embedding-based pruning) or scaling laws now flatten that advantage? For parameter isolation: have model-merging or adapter-based methods since made the core/non-core split cheaper or unnecessary? For format-only teaching: do recent mechanistic probes of attention or SAEs show instruction data *does* encode task structure, contradicting the 2023 finding?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does arXiv:2507.11538 (How Many Instructions Can LLMs Follow?) or arXiv:2509.09677 (Long Horizon Execution) suggest the interference problem dissolves at scale or with architecture changes?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If instruction data now teaches task structure, not just format, do mixed corpora still collide, or does semantic understanding mitigate interference? (b) Do modern retrieval-augmented or in-context learning approaches render fine-tuning-time mixing moot?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines