Why does stronger reasoning reduce model compliance with instructions?
This explores why training a model to reason harder — longer chains of thought, more RL and SFT for problem-solving — tends to make it worse at obeying the explicit instructions you gave it.
This explores the apparent paradox that the better a model gets at reasoning, the less reliably it does what you actually told it to do. The most direct evidence is a measured trade-off: on the MathIF benchmark, training that improves reasoning also degrades instruction adherence, and advanced reasoning models follow instructions only about half the time while working through math problems Why do more capable reasoning models ignore your instructions?. The mechanism is almost spatial — the longer the chain of thought, the more 'contextual distance' opens up between the original instruction and the place where the model is finally generating its answer, so the instruction's pull on the model's attention gets diluted Why do better reasoning models ignore instructions?. The instruction doesn't get overruled so much as left behind.
There's a deeper reason this happens, and it's about what reasoning training actually changes. Fine-tuning for reasoning loosens the causal link between the reasoning steps and the final answer: after fine-tuning, you can truncate, paraphrase, or stuff filler into the chain of thought and the answer often doesn't change, which means the visible reasoning has become performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. If the chain of thought is steering the output less, then anything riding on that chain — including 'remember to answer in JSON' or 'keep it under 50 words' — loses its grip too. The model is optimizing toward a reasoning pattern, not toward your constraints.
This connects to a quieter finding about how instruction-following was ever learned in the first place. Instruction tuning largely teaches a model the *shape* of acceptable output, not the *meaning* of the instruction — models trained on semantically empty or even deliberately wrong instructions perform about as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. So compliance was always a relatively shallow surface behavior. When heavy reasoning training reshapes the model around a competing objective (get the math right), that shallow layer is exactly the kind of thing that gets overwritten.
The corpus also suggests the reasoning itself is more fragile than its competence implies, which compounds the problem. Reasoning models 'wander' and switch paths prematurely, abandoning good solutions mid-stream Why do reasoning models abandon promising solution paths?, and chain-of-thought looks less like genuine inference than constrained imitation of reasoning *forms* learned in training Does chain-of-thought reasoning reveal genuine inference or pattern matching?. A model busy reproducing a familiar reasoning script has less room to also honor an out-of-distribution formatting or behavioral constraint that the script never included.
The hopeful note is that this trade-off may not be fundamental to the architecture, only to current training. Verbosity turns out to be a single steerable direction in activation space — you can cut chain-of-thought length by two-thirds while keeping accuracy, no retraining required Can we steer reasoning toward brevity without retraining?. If shorter chains mean less contextual distance, then controllability and reasoning depth might be separable knobs rather than two ends of one slider — which would make the 'more reasoning, less listening' tax an artifact we can engineer around, not a law we're stuck with.
Sources 7 notes
Advanced reasoning models achieve only 50.71% instruction adherence during mathematical reasoning. Training for reasoning depth actively worsens instruction compliance, suggesting a fundamental trade-off between reasoning power and controllability.
The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.