INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›What capability tradeoffs emerge w…›this inquiring line

Training AI to reason better quietly teaches it to stop saying 'I don't know' — because rewards only count finished answers.

Does reasoning fine-tuning actually harm a model's ability to abstain?

This explores whether training a model to reason better makes it worse at saying 'I don't know' — and the corpus suggests the answer is yes, because abstention is collateral damage from how reasoning rewards are structured.

This reads the question as asking about a specific, measurable side effect: does optimizing a model to reason its way to answers quietly erode its willingness to decline? The corpus has a direct hit here — reasoning fine-tuning degrades abstention capacity by roughly 24%, because the training signal rewards producing a complete answer and systematically punishes 'I don't know' Does reasoning fine-tuning make models worse at declining to answer?. Models come out answering more questions while expressing unwarranted confidence. So the harm is real, but the more interesting finding is *why* — abstention isn't being attacked directly; it's being starved by a reward that only counts finished answers.

What makes this more than an isolated result is that the same training dynamic shows up across the collection as a family of 'looks better, reasons worse' failures. Supervised fine-tuning raises benchmark accuracy while cutting the actual inferential quality of reasoning steps by nearly 39%, with correct answers arriving through post-hoc rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers?. Separately, fine-tuning loosens the causal link between a model's reasoning and its final answer — you can truncate, paraphrase, or insert filler into the chain and the answer barely changes, meaning the reasoning has become performative Does fine-tuning disconnect reasoning steps from final answers?. Abstention failure fits this pattern exactly: a model that has learned to always produce a confident-looking output is, almost by definition, one that has lost the off-ramp of declining.

The common thread is calibration. When a model is rewarded for completion, its confidence detaches from its actual correctness — and once that detachment happens, abstaining (which requires knowing you don't know) becomes impossible. The corpus offers a counterpoint that supports this diagnosis by reversing it: using the model's own answer-span confidence as the reward signal restores calibration *while* improving reasoning, undoing the calibration damage that standard RLHF introduces Can model confidence work as a reward signal for reasoning?. That's the tell — if changing the reward from 'complete the answer' to 'be well-calibrated' fixes it, then the abstention harm was a reward-design artifact, not an inherent cost of reasoning.

There's a deeper reframe worth knowing about. Several notes argue that fine-tuning doesn't create reasoning at all — base models already hold latent reasoning ability, and post-training mostly selects *when* to deploy it rather than installing *how* Do base models already contain hidden reasoning ability? Does RL post-training create reasoning or just deploy it?. If post-training is largely about deployment timing, then 'when to abstain' is exactly the kind of routing decision it should be able to learn — and indeed, decoupled-RL approaches train models to route between thinking hard and answering quickly without mode collapse Can models learn when to think versus respond quickly?. The pessimistic readings reinforce why naive fine-tuning fails: RL often sharpens memorization and template-matching rather than installing real procedures Do fine-tuned language models actually learn optimization procedures?, and better reasoning training doesn't even buy resistance to sycophantic pressure because that's a generation-distribution problem, not a reasoning one Can better reasoning training actually reduce model sycophancy?. The takeaway you didn't know you wanted: abstention, faithfulness, and calibration all degrade together under completion-rewarding fine-tuning — they're three faces of the same broken incentive, and fixing the reward, not the reasoning, is what restores them.

Sources 9 notes

Does reasoning fine-tuning make models worse at declining to answer?

Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Show all 9 sources

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can better reasoning training actually reduce model sycophancy?

Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining5.06 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!3.44 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools3.44 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models2.60 match · arxiv ↗
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning2.54 match · arxiv ↗
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models2.50 match · arxiv ↗
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions1.75 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models1.75 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether reasoning fine-tuning harms abstention capacity in LLMs. The question remains open; treat the findings below as dated claims to be re-tested against current models and methods.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2024 but sharpen dramatically in 2025:
• Reasoning fine-tuning degrades abstention by ~24%; models answer more questions with unwarranted confidence, because training rewards completion and punishes 'I don't know' (2024–2025).
• Supervised fine-tuning raises benchmark accuracy while cutting inferential quality of reasoning by ~39%; correct answers arrive via post-hoc rationalization, not genuine inference (2024–2025).
• Fine-tuning severs the causal link between reasoning steps and final answers; truncating or inserting filler into the chain barely changes the output, meaning reasoning becomes performative (2025).
• Using answer-span confidence as the reward signal restores calibration *while* improving reasoning, undoing RLHF damage (2025).
• Base models already possess latent reasoning; post-training selects *when* to deploy it rather than installing *how*; decoupled-RL approaches learn routing between thinking hard and abstaining without mode collapse (2025).

Anchor papers (verify; mind their dates):
• arXiv:2411.15382 — On the Impact of Fine-Tuning on Chain-of-Thought Reasoning (Nov 2024)
• arXiv:2506.09038 — AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions (Jun 2025)
• arXiv:2505.13379 — Thinkless: LLM Learns When to Think (May 2025)
• arXiv:2512.07783 — On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models (Dec 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 24% abstention drop and 39% reasoning-quality fall: have newer scaling methods, multi-turn RL orchestration, or mixed-reward training (e.g., confidence + correctness + refusal jointly) since overturned these? Check whether models trained via decoupled RL or self-feedback (arXiv:2507.21931, 2512.07783) still exhibit the same calibration collapse. Judge whether the constraint *that completion-rewarding fine-tuning breaks abstention* still holds or whether reward-design innovations have relaxed it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming abstention is *not* harmed by reasoning fine-tuning, or that it's a data-quality or prompt-engineering problem, not a training-signal one.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Can joint calibration+reasoning training preserve abstention while scaling reasoning depth? (b) Does routing-based fine-tuning (learning when to abstain as a learned routing decision) fundamentally differ from standard RL in this regard?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training AI to reason better quietly teaches it to stop saying 'I don't know' — because rewards only count finished answers.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8