How do task frequency and complexity interact with model capacity during training?
This explores how three things — how often a task shows up in training, how hard it is, and how much raw capacity the model has — pull against each other, and what the corpus says about that tug-of-war.
This explores how task frequency, task difficulty, and model capacity interact during training — and the most useful reframing in the corpus is that these aren't three independent dials but a single competition for gradient attention. The clearest result is that bigger models learn rare tasks better not because they can *represent* things smaller models can't, but because their spare capacity weakens the gradient pull of common tasks, so frequent examples stop overwriting the slowly-accumulating features that rare tasks depend on Why do larger models learn rare tasks better?. In other words, capacity buys you *reduced interference*, not new expressivity. The striking practical implication: if rare-task failure is an interference problem, then reweighting your data mixture may be far cheaper than scaling the model.
That frames frequency and complexity as forces that can either reinforce or sabotage each other. On the complexity side, pushing difficulty too high backfires: training on near-impossible RLVR problems doesn't just fail to teach — it teaches degenerate shortcuts (answer repetition, computation-skipping) that then contaminate capabilities the model already had, because group-relative normalization treats rare accidental successes as high-value trajectories worth amplifying Do overly hard RLVR samples actually harm model capabilities?. So the same scarcity that *helps* a rare-but-learnable skill survive can *hurt* when the task is too hard to genuinely solve — the rare signal you reinforce is a fluke, not a competence.
The order you feed tasks in matters too, because different task types move the model in opposite directions. Structured domains (math, code) drive output entropy down; open-ended creative domains drive it up — and training the structured tasks first prevents entropy collapse from quietly destroying open-ended ability, worth about 6% over naive joint training Does training order reshape how models handle different task types?. Capacity competition isn't only about how many examples; it's about whether one task type's dynamics steamroll another's. A related collapse shows up in format: RL post-training tends to amplify a single dominant format inherited from pretraining and suppress the alternatives, and which format wins depends on model scale rather than on which format actually performs best Does RL training collapse format diversity in pretrained models?.
There's a deeper twist worth knowing: much of what training appears to "teach" on frequent tasks may be narrower than it looks. Instruction tuning, for instance, transfers knowledge of the *output space* far more than task understanding — models trained on semantically empty or deliberately wrong instructions perform about as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. And several independent methods suggest the heavy lifting on reasoning is *elicitation* of capability already latent in the base model, not acquisition of anything new Do base models already contain hidden reasoning ability?. If frequent, simple training mostly selects and formats existing capability while capacity mostly protects rare features from being overwritten, then the real design question isn't "train more" but "which competitions am I letting the data mixture and task schedule decide for me."
Two escape hatches sit at the edges of this. If interference is the enemy, you can sidestep it architecturally — tuning only singular values produces composable expert vectors that mix at inference without stepping on each other, getting specialization without the usual cross-task overwriting Can models dynamically activate expert skills at inference time?. And capacity isn't destiny: small models trained with DPO on a teacher's correct-and-incorrect pairs can match much larger ones on function calling, because explicit negative examples target the exact format-rigidity failures that scale would otherwise have to brute-force Can small models match large models on function calling?.
Sources 8 notes
Larger models succeed at rare tasks not because they can represent solutions smaller models cannot, but because abundant capacity weakens gradients on common tasks, preventing them from overwriting slowly-accumulating rare-task features. Data-mixture design may be cheaper than scaling.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.