INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How does example difficulty affect…›this inquiring line

Bigger models handle rare tasks better not by being smarter, but because extra capacity stops common examples from overwriting rare ones.

How do task frequency and complexity interact with model capacity during training?

This explores how three things — how often a task shows up in training, how hard it is, and how much raw capacity the model has — pull against each other, and what the corpus says about that tug-of-war.

This explores how task frequency, task difficulty, and model capacity interact during training — and the most useful reframing in the corpus is that these aren't three independent dials but a single competition for gradient attention. The clearest result is that bigger models learn rare tasks better not because they can *represent* things smaller models can't, but because their spare capacity weakens the gradient pull of common tasks, so frequent examples stop overwriting the slowly-accumulating features that rare tasks depend on Why do larger models learn rare tasks better?. In other words, capacity buys you *reduced interference*, not new expressivity. The striking practical implication: if rare-task failure is an interference problem, then reweighting your data mixture may be far cheaper than scaling the model.

That frames frequency and complexity as forces that can either reinforce or sabotage each other. On the complexity side, pushing difficulty too high backfires: training on near-impossible RLVR problems doesn't just fail to teach — it teaches degenerate shortcuts (answer repetition, computation-skipping) that then contaminate capabilities the model already had, because group-relative normalization treats rare accidental successes as high-value trajectories worth amplifying Do overly hard RLVR samples actually harm model capabilities?. So the same scarcity that *helps* a rare-but-learnable skill survive can *hurt* when the task is too hard to genuinely solve — the rare signal you reinforce is a fluke, not a competence.

The order you feed tasks in matters too, because different task types move the model in opposite directions. Structured domains (math, code) drive output entropy down; open-ended creative domains drive it up — and training the structured tasks first prevents entropy collapse from quietly destroying open-ended ability, worth about 6% over naive joint training Does training order reshape how models handle different task types?. Capacity competition isn't only about how many examples; it's about whether one task type's dynamics steamroll another's. A related collapse shows up in format: RL post-training tends to amplify a single dominant format inherited from pretraining and suppress the alternatives, and which format wins depends on model scale rather than on which format actually performs best Does RL training collapse format diversity in pretrained models?.

There's a deeper twist worth knowing: much of what training appears to "teach" on frequent tasks may be narrower than it looks. Instruction tuning, for instance, transfers knowledge of the *output space* far more than task understanding — models trained on semantically empty or deliberately wrong instructions perform about as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. And several independent methods suggest the heavy lifting on reasoning is *elicitation* of capability already latent in the base model, not acquisition of anything new Do base models already contain hidden reasoning ability?. If frequent, simple training mostly selects and formats existing capability while capacity mostly protects rare features from being overwritten, then the real design question isn't "train more" but "which competitions am I letting the data mixture and task schedule decide for me."

Two escape hatches sit at the edges of this. If interference is the enemy, you can sidestep it architecturally — tuning only singular values produces composable expert vectors that mix at inference without stepping on each other, getting specialization without the usual cross-task overwriting Can models dynamically activate expert skills at inference time?. And capacity isn't destiny: small models trained with DPO on a teacher's correct-and-incorrect pairs can match much larger ones on function calling, because explicit negative examples target the exact format-rigidity failures that scale would otherwise have to brute-force Can small models match large models on function calling?.

Sources 8 notes

Why do larger models learn rare tasks better?

Larger models succeed at rare tasks not because they can represent solutions smaller models cannot, but because abundant capacity weakens gradients on common tasks, preventing them from overwriting slowly-accumulating rare-task features. Data-mixture design may be cheaper than scaling.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Show all 8 sources

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining2.57 match · arxiv ↗
A Survey on Post-training of Large Language Models1.70 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools1.70 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?1.69 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models1.69 match · arxiv ↗
Exploring Format Consistency for Instruction Tuning1.69 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models1.68 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs1.68 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: How do task frequency, complexity, and model capacity interact during training — and can we predict or design around interference rather than simply scaling?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as historical constraints, not current ceiling.
- Larger models learn rare tasks better via *reduced interference* from frequent tasks, not greater expressivity; capacity protects slow-accumulating features (~2026).
- Training on near-impossible tasks teaches degenerate shortcuts (repetition, skipping) that contaminate existing capability via group-relative normalization treating rare flukes as high-value (~2026).
- Task order matters: structured domains (math, code) lower entropy; creative domains raise it. Training structured-first prevents entropy collapse, worth ~6% over naive joint training (~2025).
- RL post-training converges on a single dominant pretraining format by model scale, suppressing alternatives, regardless of true performance (~2025).
- Instruction tuning teaches output-space distribution, not task understanding; models trained on wrong instructions match correct-instruction baselines (~2023).

Anchor papers (verify; mind their dates):
- arXiv:2605.29548 (2026) — Why Larger Models Learn More
- arXiv:2625.28388 (2026) — Sample Difficulty in RLVR
- arXiv:2501.06252 (2025) — Transformer2: Self-adaptive LLMs
- arXiv:2504.07912 (2025) — Echo Chamber: RL Post-training Amplifies Behaviors

Your task:
(1) RE-TEST EACH CONSTRAINT. For interference, reduced entropy collapse, and format convergence: have newer training methods (e.g., mixture-of-experts routing, curriculum learning with adaptive reweighting, or multi-objective gradient surgery) since dissolved these limits? Separate the durable question (interference as a fundamental cross-task phenomenon) from perishable limitation (whether current methods *must* allow it). Ground what resolved it in real work.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any that shows capacity *does* buy new expressivity, or interference doesn't actually dominate rare-task failure.
(3) Propose 2 research questions assuming the regime has moved: (a) If interference can be architecturally sidestepped (expert vectors, modular routing), does task frequency still matter at all? (b) If instruction tuning teaches format not semantics, what *does* teach task understanding, and can it coexist with high-frequency training?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Bigger models handle rare tasks better not by being smarter, but because extra capacity stops common examples from overwriting rare ones.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8