INQUIRING LINE

Why does reasoning fine-tuning reduce a model's ability to abstain?

This explores why training a model to reason better makes it worse at saying 'I don't know' — and what that trade-off reveals about how reasoning fine-tuning actually changes a model.


This explores why training a model to reason better makes it worse at saying 'I don't know.' The most direct answer in the corpus is also the most concrete: reasoning fine-tuning degrades a model's ability to abstain by roughly 24%, because the training signal rewards producing complete answers and quietly punishes 'I don't know' responses Does reasoning fine-tuning make models worse at declining to answer?. The model isn't learning to be reckless on purpose — it's learning that, during training, confidently answering paid off and hedging never did. Abstention is the casualty of an optimization target that only counts finished answers.

What makes this more than a one-off finding is that the same shape shows up across several other notes, under different names. Supervised fine-tuning raises benchmark accuracy while cutting the quality of the actual reasoning steps by nearly 39% — models start producing correct-looking answers by post-hoc rationalization rather than genuine inference, and standard metrics miss it because they only score the final answer Does supervised fine-tuning improve reasoning or just answers?. Relatedly, fine-tuning weakens the causal link between a model's reasoning chain and its output: you can truncate, paraphrase, or stuff filler into the reasoning and the answer often doesn't change, meaning the reasoning has become performance rather than function Does fine-tuning disconnect reasoning steps from final answers?. Put these together and a picture emerges: fine-tuning optimizes for the appearance of a confident, complete answer, and 'committing to an answer no matter what' is exactly the disposition that erodes abstention.

There's a deeper reason the trade-off is so stubborn. Several notes argue that reasoning was largely already present in the base model — post-training mostly selects *when* to deploy it rather than teaching it from scratch Do base models already contain hidden reasoning ability?, Does RL post-training create reasoning or just deploy it?. If training is tuning deployment behavior rather than installing new capability, then what it's really tuning is the model's *policy* about how eagerly to answer — and a policy optimized to always engage is a policy that has unlearned restraint. That's reinforced by evidence that RL fine-tuning often sharpens memorized template-matching rather than installing real procedures, so models over-commit to answers that collapse on out-of-distribution variants Do fine-tuned language models actually learn optimization procedures?.

The same eagerness-over-judgment dynamic appears in adjacent failures, which is the surprising part: this isn't only an abstention problem. Scaling reasoning ability also degrades instruction-following — longer chains of thought create 'contextual distance' that dilutes attention to the original request Why do better reasoning models ignore instructions?. And reasoning models tend to wander or switch paths prematurely instead of knowing when to stop Why do reasoning models abandon promising solution paths?, Do reasoning models switch between ideas too frequently?. Across all of these, the missing skill is the same one abstention requires: calibrated knowing-when — when to stop, when to obey, when to decline.

If the cause is an over-eager deployment policy, the corpus also hints at the fix: don't bake the eagerness into the weights. Approaches that keep the backbone frozen and delegate reasoning to a light auxiliary model avoid the forgetting and over-commitment that full fine-tuning causes Can continuous reasoning avoid forgetting in instruction-tuned models?, while decoupled-RL routing teaches a model to choose between thinking hard and answering briefly — a 'when to engage' signal that's exactly the missing calibration Can models learn when to think versus respond quickly?. Even training-free activation steering can dial reasoning behavior up or down after the fact Can we steer reasoning toward brevity without retraining?. The throughline worth taking away: a model that has been trained to always produce an answer has, almost by definition, been trained out of the judgment to withhold one — and the more promising designs treat 'whether to answer' as a separate, steerable decision rather than something to optimize away.


Sources 12 notes

Does reasoning fine-tuning make models worse at declining to answer?

Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating claims about reasoning fine-tuning and abstention. The question remains: why does training a model to reason better degrade its ability to say 'I don't know'?

What a curated library found — and when (dated claims, not current truth):
Findings span late 2024 through end 2025. Key constraints reported:
- Reasoning fine-tuning reduces abstention capacity by ~24% because training rewards complete answers and punishes 'I don't know' (arXiv:2506.09038, ~2025-06).
- Supervised fine-tuning raises benchmark accuracy while cutting reasoning-step quality by ~39%; models post-hoc rationalize rather than genuinely infer (arXiv:2411.15382, ~2024-11).
- Fine-tuning weakens causal link between reasoning chain and output; reasoning becomes performance, not function (arXiv:2504.09762, ~2025-04).
- RL post-training sharpens memorized template-matching over real procedures, causing over-commitment on OOD variants (arXiv:2504.07912, ~2025-04).
- Scaling reasoning degrades instruction-following; longer CoT creates 'contextual distance' diluting attention to original request (arXiv:2505.14810, ~2025-05).

Anchor papers (verify; mind their dates):
- arXiv:2506.09038 (AbstentionBench, 2025-06)
- arXiv:2411.15382 (Fine-Tuning Impact on CoT, 2024-11)
- arXiv:2502.12134 (SoftCoT, 2025-02)
- arXiv:2505.13379 (Thinkless, 2025-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 24% abstention drop, the reasoning-quality drop, the CoT faithfulness erosion, and the instruction-following deficit: has any post-June 2025 work (newer scaling laws, training methods, decoding schemes, or evaluation harnesses) relaxed or overturned these? Separate durable findings (eager-answer-policy problem likely persists) from perishable limitations (e.g., if architectural changes now decouple reasoning from commitment). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months — any claim that reasoning fine-tuning *preserves* or *improves* abstention, or that the policy problem is less binding than reported.
(3) Propose 2 research questions that assume the regime may have moved: (a) if newer models have learned to calibrate abstention despite extended reasoning, what architectural or training inversion caused it? (b) if the trade-off persists, can steering or auxiliary routing genuinely restore abstention *in deployed models*, not just in labs?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines