INQUIRING LINE

Why does instruction-following capability decrease as models scale stronger?

This explores why models that get better at reasoning often get worse at doing exactly what you told them — and whether that's a genuine trade-off or a fixable training artifact.


This explores why scaling a model's reasoning power often comes at the cost of obedience — and the corpus suggests it's less a paradox than a predictable structural tension. The clearest evidence comes from work showing that as you train models to reason harder, their instruction adherence actively drops: advanced reasoning models hit only about 50% compliance on math tasks, and the longer the chain-of-thought, the worse it gets Why do more capable reasoning models ignore your instructions? Why do better reasoning models ignore instructions?. The proposed mechanism is contextual distance — every extra step of reasoning the model generates pushes the original instruction further back, diluting the attention paid to it. The model isn't ignoring you out of defiance; it's getting absorbed in its own train of thought.

The degradation also scales with how much you ask at once, not just how smart the model is. The IFScale benchmark finds three distinct failure curves: small models degrade linearly with instruction density, mid-range models degrade exponentially, and reasoning models hold steady up to around 150 instructions before collapsing steeply — and even the best top out at 68% How does instruction density affect model performance?. So 'stronger' models don't follow instructions more reliably; they follow more instructions before hitting a wall, then fail catastrophically rather than gracefully.

A lateral clue about why this happens sits in what instruction tuning actually teaches. One striking result: models trained on semantically empty or even deliberately wrong instructions perform almost identically to those trained on correct ones — what transfers is knowledge of the output format, not understanding of the task Does instruction tuning teach task understanding or output format?. If instruction-following is largely a learned surface behavior, it makes sense that heavy reasoning training — which optimizes for getting the answer right — would erode it, since the two objectives were never deeply integrated to begin with. This echoes the finding that skills scale unevenly: reasoning and knowledge improve continuously with size while style and surface-compliance skills saturate early Do all AI skills improve equally as models scale?.

The same pattern shows up outside formal reasoning, in conversation. Models score ~90% on single-message instructions but drop to ~65% across multi-turn dialogue, locking into premature guesses and refusing to course-correct — a habit traced to RLHF rewarding helpfulness over asking for clarification Why do AI assistants get worse at longer conversations? Why do language models fail in gradually revealed conversations?. The common thread across reasoning and conversation: training objectives that reward confident forward motion (long chains, helpful answers) quietly punish the behaviors instruction-following depends on (staying anchored, deferring, checking back).

What you didn't come for but might want: the corpus hints the fix may be architectural rather than just better training. Approaches that freeze the main model and delegate reasoning to a lightweight auxiliary module preserve the pre-trained behavior instead of overwriting it Can continuous reasoning avoid forgetting in instruction-tuned models?, and DPO training with explicit negative examples directly targets the rigid format failures that plain fine-tuning leaves behind Can small models match large models on function calling?. The implication: the reasoning-vs-obedience trade-off isn't a law of nature — it's a consequence of cramming both jobs into the same weights with objectives that pull against each other.


Sources 9 notes

Why do more capable reasoning models ignore your instructions?

Advanced reasoning models achieve only 50.71% instruction adherence during mathematical reasoning. Training for reasoning depth actively worsens instruction compliance, suggesting a fundamental trade-off between reasoning power and controllability.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

How does instruction density affect model performance?

IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do all AI skills improve equally as models scale?

FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.

Why do AI assistants get worse at longer conversations?

LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Next inquiring lines