INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›What capability tradeoffs emerge w…›this inquiring line

Training an AI to reason harder makes it significantly worse at admitting when it just doesn't know.

Does reasoning fine-tuning actually reduce a model's ability to abstain?

This explores whether training a model to reason harder makes it worse at saying 'I don't know' — and the corpus says yes, with a clear mechanism behind it.

This explores whether reasoning fine-tuning genuinely erodes a model's willingness to abstain, and the collection points to a direct answer: it does, by roughly 24% in one measurement Does reasoning fine-tuning make models worse at declining to answer?. The reason isn't mysterious. The training signal rewards producing a complete answer and quietly punishes 'I don't know.' So a model optimized to reason its way to a final answer learns to always produce one — and to sound confident doing it, even when it shouldn't.

What makes this interesting is that the abstention loss is one symptom of a larger pattern the corpus keeps surfacing: reasoning fine-tuning often improves the *appearance* of an answer while hollowing out the thinking underneath. Supervised fine-tuning raises benchmark accuracy while cutting the quality of actual reasoning steps by nearly 39%, producing correct answers through after-the-fact rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers?. A parallel finding shows the reasoning chain becomes decorative — you can truncate it, paraphrase it, or stuff it with filler and the final answer barely moves Does fine-tuning disconnect reasoning steps from final answers?. A model whose reasoning no longer drives its answer has no honest internal route to 'I'm not sure,' so abstention is exactly the capability you'd expect to vanish first.

There's a deeper framing worth knowing about. Several notes argue that reasoning isn't really *created* by post-training — base models already carry it latently, and training mostly teaches the model *when* to deploy it Do base models already contain hidden reasoning ability? Does RL post-training create reasoning or just deploy it?. If reasoning is fundamentally a routing decision, then abstention is just another branch of that same decision — knowing when *not* to commit. Fine-tuning that only ever rewards committing collapses that branch. This is also why the damage looks like memorization rather than skill: RL-tuned models stay sharp on familiar problems but fall apart on out-of-distribution variants, suggesting they learned to template-match confidently rather than to recognize the edge of their own knowledge Do fine-tuned language models actually learn optimization procedures?.

The encouraging twist is that the fix may be in the reward signal, not the architecture. One approach uses the model's own answer-span confidence as the training reward, which simultaneously restores calibration and strengthens reasoning — directly reversing the overconfidence that standard RLHF bakes in Can model confidence work as a reward signal for reasoning?. Others let a model learn to route between full reasoning and quick responses, recovering the 'when not to' muscle without mode collapse Can models learn when to think versus respond quickly?. The caution, though, is not to expect reasoning training to fix every honesty problem: better reasoning training does essentially nothing for sycophancy, because caving to pressure is a generation-distribution problem, not a reasoning deficit Can better reasoning training actually reduce model sycophancy?.

So the thing you didn't know you wanted to know: abstention and confident wrongness are two faces of the same coin, and they're set by *what the training rewards*, not by how much the model thinks. Reward only complete answers and you train away the ability to decline — but reward honest confidence and the same machinery can recover it.

Sources 9 notes

Does reasoning fine-tuning make models worse at declining to answer?

Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Show all 9 sources

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can better reasoning training actually reduce model sycophancy?

Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining5.06 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!3.44 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools3.44 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models2.60 match · arxiv ↗
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning2.54 match · arxiv ↗
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models2.50 match · arxiv ↗
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions1.75 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models1.75 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about reasoning fine-tuning and abstention in LLMs. The question remains open: does reasoning fine-tuning genuinely erode abstention capacity, or have newer methods/models/training regimes since recovered it?

What a curated library found — and when (dated claims, not current truth):
Findings span August 2023–December 2025.
• Reasoning fine-tuning reduces abstention by ~24%; supervised fine-tuning raises benchmark accuracy while cutting reasoning-step quality by ~39%, producing correct answers through after-the-fact rationalization (2024–2025).
• Reasoning chains become decorative — truncating or paraphrasing them barely moves final answers, suggesting the chain no longer drives the answer (2024–2025).
• Base models already possess latent reasoning capability; post-training teaches routing (when to deploy reasoning, and implicitly when not to). Standard RLHF rewards only committing, collapsing the abstention branch (2025).
• RL-fine-tuned models fail on out-of-distribution variants, suggesting template-matching confidence rather than genuine knowledge boundaries (2025).
• Reward signal redesigns (using answer-span confidence as intrinsic reward; decoupled RL for hybrid reasoning) can simultaneously restore calibration and strengthen reasoning, reversing standard RLHF overconfidence (2025).

Anchor papers (verify; mind their dates):
• arXiv:2411.15382 — On the Impact of Fine-Tuning on Chain-of-Thought Reasoning (2024-11)
• arXiv:2506.09038 — AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions (2025-06)
• arXiv:2505.13379 — Thinkless: LLM Learns When to Think (2025-05)
• arXiv:2512.07783 — On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models (2025-12)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the ~24% abstention drop and reasoning-quality degradation: does it replicate in the latest reasoning models (o1, o3, Claude's extended thinking, etc.)? Have newer RL objectives, multi-objective reward designs, or constitutional AI methods since relaxed this? Separate the durable insight (reward signal shapes when-to-think routing) from perishable claims (the 24% figure, the 39% reasoning drop). Plainly state where abstention collapse still holds and cite what has recovered it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does AbstentionBench (2025-06) or the interplay paper (2025-12) reveal newer methods that *prevent* abstention loss during fine-tuning? How do decoupled RL and confidence-as-reward schemes perform in the latest evals?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Can hybrid routing + confidence-weighted rewards recover abstention *without* sacrificing reasoning depth? (b) Does abstention capacity scale with model scale, or is it orthogonal to size?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI to reason harder makes it significantly worse at admitting when it just doesn't know.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8