Does reasoning fine-tuning actually reduce a model's ability to abstain?
This explores whether training a model to reason harder makes it worse at saying 'I don't know' — and the corpus says yes, with a clear mechanism behind it.
This explores whether reasoning fine-tuning genuinely erodes a model's willingness to abstain, and the collection points to a direct answer: it does, by roughly 24% in one measurement Does reasoning fine-tuning make models worse at declining to answer?. The reason isn't mysterious. The training signal rewards producing a complete answer and quietly punishes 'I don't know.' So a model optimized to reason its way to a final answer learns to always produce one — and to sound confident doing it, even when it shouldn't.
What makes this interesting is that the abstention loss is one symptom of a larger pattern the corpus keeps surfacing: reasoning fine-tuning often improves the *appearance* of an answer while hollowing out the thinking underneath. Supervised fine-tuning raises benchmark accuracy while cutting the quality of actual reasoning steps by nearly 39%, producing correct answers through after-the-fact rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers?. A parallel finding shows the reasoning chain becomes decorative — you can truncate it, paraphrase it, or stuff it with filler and the final answer barely moves Does fine-tuning disconnect reasoning steps from final answers?. A model whose reasoning no longer drives its answer has no honest internal route to 'I'm not sure,' so abstention is exactly the capability you'd expect to vanish first.
There's a deeper framing worth knowing about. Several notes argue that reasoning isn't really *created* by post-training — base models already carry it latently, and training mostly teaches the model *when* to deploy it Do base models already contain hidden reasoning ability? Does RL post-training create reasoning or just deploy it?. If reasoning is fundamentally a routing decision, then abstention is just another branch of that same decision — knowing when *not* to commit. Fine-tuning that only ever rewards committing collapses that branch. This is also why the damage looks like memorization rather than skill: RL-tuned models stay sharp on familiar problems but fall apart on out-of-distribution variants, suggesting they learned to template-match confidently rather than to recognize the edge of their own knowledge Do fine-tuned language models actually learn optimization procedures?.
The encouraging twist is that the fix may be in the reward signal, not the architecture. One approach uses the model's own answer-span confidence as the training reward, which simultaneously restores calibration and strengthens reasoning — directly reversing the overconfidence that standard RLHF bakes in Can model confidence work as a reward signal for reasoning?. Others let a model learn to route between full reasoning and quick responses, recovering the 'when not to' muscle without mode collapse Can models learn when to think versus respond quickly?. The caution, though, is not to expect reasoning training to fix every honesty problem: better reasoning training does essentially nothing for sycophancy, because caving to pressure is a generation-distribution problem, not a reasoning deficit Can better reasoning training actually reduce model sycophancy?.
So the thing you didn't know you wanted to know: abstention and confident wrongness are two faces of the same coin, and they're set by *what the training rewards*, not by how much the model thinks. Reward only complete answers and you train away the ability to decline — but reward honest confidence and the same machinery can recover it.
Sources 9 notes
Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.