INQUIRING LINE

Why do instruction following and reasoning capability trade off in training?

This explores why training a model to reason harder seems to make it worse at simply doing what you told it — and what the corpus says about the mechanism behind that tension.


This explores why training a model to reason harder seems to come at the cost of obeying instructions. The most direct evidence is the MathIF work: as models are trained (via supervised fine-tuning and reinforcement learning) to reason longer and deeper, their instruction adherence actually drops — top reasoning models follow instructions only about half the time during math problems Why do more capable reasoning models ignore your instructions?. The proposed mechanism is almost spatial: the longer the chain of thought, the more 'contextual distance' opens up between the original instruction and the place where the model is generating, so attention to the instruction gets diluted Why do better reasoning models ignore instructions?. The model isn't choosing to disobey — the instruction simply fades into the background as the reasoning trace grows.

But there's a deeper framing the corpus suggests: reasoning and instruction-following may be learned as fundamentally different *kinds* of things, which is why optimizing one can quietly degrade the other. Instruction tuning, it turns out, mostly teaches a model the shape of acceptable outputs rather than genuine task understanding — models trained on semantically empty or even deliberately wrong instructions perform about as well as those trained on correct ones, because what actually transfers is knowledge of the output space, not comprehension Does instruction tuning teach task understanding or output format?. If instruction-following is largely a learned formatting reflex, then training that pushes the model toward long exploratory reasoning is pulling against that reflex rather than building on it.

That tension sharpens when you consider what reasoning training is actually doing. Several lines of work argue that reasoning capability already lives latent in the base model, and post-training merely *selects* or elicits it rather than creating it Do base models already contain hidden reasoning ability?. Meanwhile chain-of-thought itself looks less like genuine inference and more like constrained imitation of reasoning *form* — illogical CoT exemplars perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and performance degrades predictably the moment you leave the training distribution Does chain-of-thought reasoning actually generalize beyond training data?, Does chain-of-thought reasoning reveal genuine inference or pattern matching?. So reasoning training is teaching the model to commit to a long, self-generated procedural rollout — and that very commitment is what crowds out the short, externally-imposed constraint the user supplied.

The interesting upshot is that the trade-off may be more about *length and ordering* than about an inherent conflict between intelligence and obedience. If verbosity is the culprit, it's partly steerable: reasoning verbosity turns out to be a single linear direction in activation space, compressible by two-thirds without hurting accuracy Can we steer reasoning toward brevity without retraining? — which implies you could shorten the chain (and shrink the contextual distance) without sacrificing the reasoning. And how training is sequenced matters: establishing reasoning foundations through imitation *before* refining with verifiable rewards beats either method alone Does sequencing imitation then exploration training improve reasoning?, hinting that the order in which capabilities are layered shapes whether they cooperate or compete. The takeaway you might not have expected: the conflict isn't 'smarter models care less about you' — it's that current training makes reasoning *long and front-loaded*, and instruction-following is a fragile, format-level habit that gets buried as the trace runs on.


Sources 9 notes

Why do more capable reasoning models ignore your instructions?

Advanced reasoning models achieve only 50.71% instruction adherence during mathematical reasoning. Training for reasoning depth actively worsens instruction compliance, suggesting a fundamental trade-off between reasoning power and controllability.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains: why do instruction following and reasoning capability trade off in training? A curated library found — spanning 2023–09 — these dated claims, not current truth:

• As models train to reason longer via SFT + RL, instruction adherence drops to ~50% during math problems; longer chains dilute attention to instructions via 'contextual distance' (2025-05).
• Instruction tuning teaches output-format distribution, not task understanding; models trained on semantically empty or wrong instructions perform equivalently (2023-05).
• Chain-of-thought is constrained imitation of reasoning form, not genuine inference; logically invalid CoT exemplars perform nearly as well as valid ones (2023-07, 2025-06).
• Reasoning capability already latent in base models; post-training selects rather than creates it (2024-11).
• CoT verbosity occupies a single linear direction in activation space, compressible by two-thirds without accuracy loss (2025-07).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023-05) — Do Models Really Learn to Follow Instructions?
• arXiv:2307.10573 (2023-07) — Invalid Logic, Equivalent Gains
• arXiv:2505.14810 (2025-05) — Scaling Reasoning, Losing Control
• arXiv:2507.04742 (2025-07) — Activation Steering for Chain-of-Thought Compression

Your task:

(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, methods (e.g., novel RL objectives, structured reasoning architectures), tooling (e.g., execution guardrails), or evaluation have since relaxed or overturned it. Separate the durable question (likely: does longer generation inherently compete with instruction adherence?) from the perishable limitation (possibly: current training sequences or verbosity patterns force the trade-off). Cite what resolved each constraint; plainly state where it still holds.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~3 months. The library hints at solutions (curriculum learning, activation steering, self-supervised RL for reasoning + instruction fusion). Have any papers since Sep 2025 claim to dissolve the trade-off entirely, or prove it's orthogonal under certain conditions?

(3) Propose 2 research questions that assume the regime may have moved: e.g., "If reasoning verbosity is compressible, does short-chain reasoning restore instruction fidelity without losing capability?" or "Can joint training on reasoning *and* instruction-following constraints yield cooperative rather than competitive gradients?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines