INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›What capability tradeoffs emerge w…›this inquiring line

Training an AI to reason better makes it smarter at hard problems — but worse at following your instructions.

Can reasoning fine-tuning improve both capability and instruction compliance together?

This explores whether training a model to reason better also makes it follow instructions better — or whether the two goals quietly work against each other.

This explores whether reasoning fine-tuning can lift both raw capability and instruction compliance at the same time, and the corpus's honest answer is: these two goals often pull in opposite directions, so 'together' is the hard part. The most direct evidence comes from the MathIF benchmark, which finds that both supervised and reinforcement training improve reasoning while *reducing* a model's adherence to instructions — and the effect gets worse as chain-of-thought grows longer, because a long reasoning chain puts contextual distance between the model and the original request, diluting its attention to what was actually asked Why do better reasoning models ignore instructions?. So the naive expectation — train it to think harder, get a better all-around assistant — isn't what happens by default.

There's a deeper wrinkle: it's not even clear the 'capability' half is real gain rather than repackaging. Several notes argue that base models already carry latent reasoning, and that post-training mostly *selects* or *times* it rather than creating it — RL teaches a model *when* to reason, not *how* Does RL post-training create reasoning or just deploy it?, and a half-dozen independent methods all elicit reasoning that was already sitting in base-model activations Do base models already contain hidden reasoning ability?. Worse, fine-tuning can inflate benchmark scores while hollowing out the reasoning behind them: the 'SFT accuracy trap' shows final-answer accuracy rising even as the quality of inferential steps drops ~39%, meaning models reach right answers by post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?. Fine-tuning also weakens the causal link between the stated reasoning steps and the final answer, making the chain performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?, and RL-tuned models still collapse on out-of-distribution variants, suggesting sharpened memorization over installed procedure Do fine-tuned language models actually learn optimization procedures?. If the capability gain is partly an illusion, asking it to coexist with instruction-following is asking the wrong question.

But the corpus also points to ways the two goals *can* be reconciled — mostly by changing what the reward or the architecture optimizes. The most promising is decomposing instruction quality into verifiable sub-criteria: checklist-based reward (RLCF/RaR) turns 'did it follow the instruction' into concrete checkable items, letting RL improve subjective instruction-following without overfitting to the superficial artifacts that fool holistic reward models Can breaking down instructions into checklists improve AI reward signals?. This matters because instruction tuning on its own may teach surprisingly little about the task — models trained on semantically empty or wrong instructions perform about as well as those given correct ones, suggesting standard instruction tuning transfers output *format*, not understanding Does instruction tuning teach task understanding or output format?. Reward design, not more tokens, is where compliance actually lives.

On the capability side, a couple of notes suggest you can add reasoning without paying the usual tax. SoftCoT freezes the main model and delegates continuous 'thought' generation to a small auxiliary module, preserving pretrained knowledge (and presumably instruction-following behavior) instead of overwriting it Can continuous reasoning avoid forgetting in instruction-tuned models?. And RLAG rewards both answer accuracy *and* explanation rationality together, internalizing coherent knowledge more effectively than SFT's token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning? — a rare example of jointly optimizing two signals rather than trading one for the other. The thing worth knowing you didn't know to ask: the failure isn't that reasoning and obedience are inherently incompatible — it's that the *default* training recipe optimizes a single proxy (final-answer accuracy) that rewards longer, more self-absorbed reasoning chains, and longer chains are exactly what erode instruction adherence and faithful inference. Fix what you reward, and the trade-off softens.

Sources 10 notes

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Show all 10 sources

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether reasoning fine-tuning can improve capability AND instruction compliance together. This question remains open despite recent work. Here's what a curated arXiv library found (2023–2025; treat as dated claims, not current truth):

— Scaling reasoning capability creates an instruction-following deficit: supervised and RL training both improve reasoning while reducing adherence to instructions; the effect worsens as chain-of-thought grows longer, creating contextual distance from the original request (2025-05).
— Base models may already possess latent reasoning; post-training selects *when* to reason, not *how*. Fine-tuning can inflate benchmark scores while reasoning quality drops ~39%, via post-hoc rationalization rather than functional inference (2024-11, 2025-08).
— Reward design, not scale, unlocks joint optimization: checklist-based reward (RLCF) decomposes instruction-following into verifiable sub-criteria, and RLAG rewards both accuracy AND explanation rationality together, avoiding the typical trade-off (2025-07, 2025-09).
— Architectural isolation (e.g., SoftCoT) preserves frozen pretrained knowledge while delegating continuous reasoning to an auxiliary module, sidestepping the overwrite problem (2025-02).
— Standard instruction tuning may teach output format, not task understanding; models trained on semantically empty instructions perform comparably to correct ones (2023-05).

Anchor papers (verify; mind their dates):
- arXiv:2505.14810 (2025-05): Scaling Reasoning, Losing Control
- arXiv:2507.18624 (2025-07): Checklists Are Better Than Reward Models
- arXiv:2509.20162 (2025-09): Embedding Domain Knowledge via RL from Augmented Generation
- arXiv:2512.07783 (2025-12): On the Interplay of Pre-Training, Mid-Training, and RL

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (O1-family, Gemini 2.0, Claude reasoning), improved training methods (multi-objective loss weighting, auxiliary classifiers for instruction fidelity), or architectural innovations (speculative reasoning, external constraint modules) have since RELAXED or OVERTURNED the capability–compliance trade-off. Separate the durable tension (instruction adherence vs. reasoning depth may still conflict at scale) from the perishable limitation (maybe solved by checklist rewards or modular design). Cite what resolved it; flag where the trade-off still holds.
(2) Surface the strongest contradicting or superseding work from the last 3–6 months—especially any that shows joint optimization WITHOUT decomposition or modularity.
(3) Propose 2 research questions that assume the regime may have moved: (a) Do recent reasoning-optimized models show *improved* instruction compliance out-of-the-box, or do they still require deliberate alignment design? (b) Can a single reward signal (e.g., outcome-based + instruction fidelity) replace the checklist/decomposition pattern, or is decomposition necessary?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI to reason better makes it smarter at hard problems — but worse at following your instructions.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8