Can reasoning fine-tuning improve both capability and instruction compliance together?
This explores whether training a model to reason better also makes it follow instructions better — or whether the two goals quietly work against each other.
This explores whether reasoning fine-tuning can lift both raw capability and instruction compliance at the same time, and the corpus's honest answer is: these two goals often pull in opposite directions, so 'together' is the hard part. The most direct evidence comes from the MathIF benchmark, which finds that both supervised and reinforcement training improve reasoning while *reducing* a model's adherence to instructions — and the effect gets worse as chain-of-thought grows longer, because a long reasoning chain puts contextual distance between the model and the original request, diluting its attention to what was actually asked Why do better reasoning models ignore instructions?. So the naive expectation — train it to think harder, get a better all-around assistant — isn't what happens by default.
There's a deeper wrinkle: it's not even clear the 'capability' half is real gain rather than repackaging. Several notes argue that base models already carry latent reasoning, and that post-training mostly *selects* or *times* it rather than creating it — RL teaches a model *when* to reason, not *how* Does RL post-training create reasoning or just deploy it?, and a half-dozen independent methods all elicit reasoning that was already sitting in base-model activations Do base models already contain hidden reasoning ability?. Worse, fine-tuning can inflate benchmark scores while hollowing out the reasoning behind them: the 'SFT accuracy trap' shows final-answer accuracy rising even as the quality of inferential steps drops ~39%, meaning models reach right answers by post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?. Fine-tuning also weakens the causal link between the stated reasoning steps and the final answer, making the chain performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?, and RL-tuned models still collapse on out-of-distribution variants, suggesting sharpened memorization over installed procedure Do fine-tuned language models actually learn optimization procedures?. If the capability gain is partly an illusion, asking it to coexist with instruction-following is asking the wrong question.
But the corpus also points to ways the two goals *can* be reconciled — mostly by changing what the reward or the architecture optimizes. The most promising is decomposing instruction quality into verifiable sub-criteria: checklist-based reward (RLCF/RaR) turns 'did it follow the instruction' into concrete checkable items, letting RL improve subjective instruction-following without overfitting to the superficial artifacts that fool holistic reward models Can breaking down instructions into checklists improve AI reward signals?. This matters because instruction tuning on its own may teach surprisingly little about the task — models trained on semantically empty or wrong instructions perform about as well as those given correct ones, suggesting standard instruction tuning transfers output *format*, not understanding Does instruction tuning teach task understanding or output format?. Reward design, not more tokens, is where compliance actually lives.
On the capability side, a couple of notes suggest you can add reasoning without paying the usual tax. SoftCoT freezes the main model and delegates continuous 'thought' generation to a small auxiliary module, preserving pretrained knowledge (and presumably instruction-following behavior) instead of overwriting it Can continuous reasoning avoid forgetting in instruction-tuned models?. And RLAG rewards both answer accuracy *and* explanation rationality together, internalizing coherent knowledge more effectively than SFT's token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning? — a rare example of jointly optimizing two signals rather than trading one for the other. The thing worth knowing you didn't know to ask: the failure isn't that reasoning and obedience are inherently incompatible — it's that the *default* training recipe optimizes a single proxy (final-answer accuracy) that rewards longer, more self-absorbed reasoning chains, and longer chains are exactly what erode instruction adherence and faithful inference. Fix what you reward, and the trade-off softens.
Sources 10 notes
The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.