INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›How can models identify insufficie…›this inquiring line

Training an AI to reason better secretly trains out its ability to say 'I don't know.'

Does training for better reasoning reduce an AI system's ability to abstain?

This explores whether optimizing a model to reason well comes with a hidden tax on its willingness to say 'I don't know' — and the corpus says yes, with a fairly clear mechanism for why.

This explores whether training a model to reason better quietly erodes its ability to abstain — to decline a question it can't actually answer. The most direct evidence says yes: reasoning fine-tuning degrades abstention capacity by roughly 24 percent Does reasoning fine-tuning make models worse at declining to answer?. Models tuned for reasoning performance answer *more* questions, but they do so with unwarranted confidence — the training signal rewards producing a complete answer and systematically punishes 'I don't know.' Abstention isn't lost by accident; it's trained away, because the optimization target never valued it.

What makes this more than a one-paper finding is that the same reward structure shows up as a recurring failure pattern across the collection. Supervised fine-tuning, for instance, raises benchmark accuracy while cutting the quality of the actual reasoning steps by nearly 39 percent — models learn to produce correct-looking final answers through post-hoc rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers?. The common thread: when you reward the final answer, you get a model that always produces a final answer — confidently, whether or not it should. Abstention and reasoning honesty are casualties of the same blind spot in how we score success.

There's a cognitive-overreach version of this too. Push a model to 'think harder' and accuracy doesn't keep climbing — it peaks and then declines, because models overthink easy problems and underthink hard ones once you flood them with thinking tokens Does more thinking time always improve reasoning accuracy?. More reasoning effort doesn't translate into better calibration about *when* to stop or stay silent. Effort and good judgment about one's own limits turn out to be different things.

The interesting counterpoint is that this damage isn't inevitable — it's a property of *how* you train, not of reasoning itself. Some methods explicitly teach a model when to engage extended thinking versus answer quickly, routing between modes without collapsing into always-on reasoning Can models learn when to think versus respond quickly?. And RL training can redirect a model's extended thinking away from counterproductive self-doubt into productive gap analysis, suggesting the training signal mediates reasoning *quality*, not just quantity Does extended thinking help or hurt model reasoning?. If a reward can teach a model to second-guess itself usefully, it can in principle teach it to abstain — the 24 percent drop reflects a reward that simply never asked for that.

The takeaway you might not have expected: abstention is a casualty of optimization targets, not of intelligence. A model that reasons more isn't a model that knows its limits better — those are separate capabilities, and current reasoning training buys the first while quietly selling off the second.

Sources 5 notes

Does reasoning fine-tuning make models worse at declining to answer?

Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing claims about reasoning-training trade-offs in LLMs. The question remains open: does training for better reasoning erode abstention capacity?

What a curated library found — and when (spanning Nov 2024–Sep 2025; claims are dated, not current truth):
• Reasoning fine-tuning degrades abstention by ~24% while raising answer-production confidence, because optimization rewards complete answers and punishes 'I don't know' (2024–2025).
• Supervised fine-tuning raises benchmark accuracy but cuts reasoning-step quality by ~39%, as models learn post-hoc rationalization instead of genuine inference; the same reward structure collapses both reasoning honesty and abstention (2024–2025).
• Extended thinking peaks then declines in accuracy; models overthink easy problems and underthink hard ones, failing to calibrate *when* to stop or stay silent (2025–2026).
• Hybrid routing methods (decoupled RL, mode switching) and RL-redirected thinking suggest the damage is optimization-target-dependent, not inherent to reasoning — abstention is trainable as a distinct capability (2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.09038 (AbstentionBench, Jun 2025) — direct measurement of abstention failure on unanswerable questions
• arXiv:2506.04210 (Test-Time Scaling, Jun 2025) — reasoning accuracy degradation beyond token thresholds
• arXiv:2505.13379 (Thinkless, May 2025) — learned routing between extended and quick thinking
• arXiv:2510.01265 (RLP, Sep 2025) — RL as pretraining objective, potential signal reframing

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 24% abstention drop, the 39% reasoning-step collapse, and the accuracy-degradation curve: has newer tooling (e.g., SDK-level abstention penalties, steering techniques, multi-agent orchestration), training methods (RL variants, DPO, process supervision), or post-hoc evaluation since relaxed or overturned these? Separate the durable finding (reward structure mediates abstention) from the perishable limitation (current benchmarks don't measure it). Cite what moved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any paper argue reasoning and abstention are *not* coupled at the optimization level, or show a method that trains both simultaneously without trade-off?
(3) Propose 2 research questions that ASSUME the regime has shifted: e.g., "Can process-level supervision (token-level rewards for 'I don't know') simultaneously improve reasoning *and* calibration?" or "Do multi-agent setups (e.g., verify→abstain agents) decouple the single-model trade-off?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI to reason better secretly trains out its ability to say 'I don't know.'

Related lines of inquiry

Sources 5 notes

Papers this line draws on 8