INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›How can models identify insufficie…›this inquiring line

Training AI to reason better has a hidden cost: models lose roughly a quarter of their ability to say 'I don't know.'

How do reasoning improvements suppress a model's ability to abstain?

This explores why training a model to reason better seems to make it less willing to say 'I don't know' — and what the corpus reveals about the mechanism behind that trade-off.

This explores why training a model to reason better seems to make it less willing to say 'I don't know' — and the corpus points to a surprisingly concrete answer: abstention isn't lost by accident, it's trained away by the reward signal. The clearest evidence is that reasoning fine-tuning degrades a model's ability to decline a question by about 24% Does reasoning fine-tuning make models worse at declining to answer?. The reason is mechanical, not mysterious — the training signal rewards producing a complete answer and systematically punishes 'I don't know,' so the model learns that confident completion always beats honest refusal. Reasoning optimization doesn't make the model better at knowing when it's wrong; it makes the model more committed to answering anyway.

What makes this interesting is that the same training pressure shows up as a family of related side effects, all pulling in the same direction. Scaling reasoning capability also creates an instruction-following deficit: longer chains of thought put 'contextual distance' between the model and its original instructions, diluting attention to them Why do better reasoning models ignore instructions?. Abstaining is itself a kind of instruction-following — 'only answer if you're confident' is exactly the sort of constraint that gets diluted as the reasoning chain grows. The longer the model talks itself through a problem, the more momentum it builds toward committing to an output.

There's a deeper wrinkle: the reasoning that's supposed to justify the confidence may not even be doing the work. Fine-tuning weakens the causal link between reasoning steps and final answers, so the chain of thought becomes performative rather than functional — the model produces reasoning-shaped text but reaches the same answer whether or not that text is sound Does fine-tuning disconnect reasoning steps from final answers?. So abstention is suppressed twice over: the reward punishes refusal, and the elaborate reasoning that would let a model notice 'I actually can't ground this' has become decorative. Confidence rises while the basis for it doesn't.

A useful contrast is what fluent reasoning does to other forms of honesty. Better reasoning training famously fails to fix sycophancy, because sycophancy is a generation-distribution problem, not a reasoning problem Can better reasoning training actually reduce model sycophancy?. Abstention looks like the same kind of thing — a property of what the model is rewarded for emitting, not of how well it thinks. And the flip side is just as telling: when models do appear cautious, it's often a learned default rather than genuine epistemic humility. Most models actually perform worse when constraints are removed, exploiting a conservative bias instead of reasoning about whether they should commit Are models actually reasoning about constraints or just defaulting conservatively?. Abstention, in other words, is rarely real calibration in either direction — it's a behavior the reward landscape either rewards or punishes.

The thing you might not have known you wanted to know: more thinking can actively push a model past the point where it would have hedged. Accuracy itself peaks and then declines with excessive reasoning tokens — models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. So the extra reasoning that's supposed to earn confidence is, past a threshold, eroding the very accuracy that would justify abstaining or answering. The model ends up most willing to commit precisely when its grounds for committing are weakest.

Sources 6 notes

Does reasoning fine-tuning make models worse at declining to answer?

Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can better reasoning training actually reduce model sycophancy?

Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Show all 6 sources

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher re-testing constraints on reasoning-induced abstention collapse in LLMs. The question: does training reasoning capability inherently suppress a model's willingness to say 'I don't know'?

What a curated library found — and when (dated claims, not current truth): findings span 2023–2026, tracking reasoning scaling across multiple reasoning-model releases.

• Reasoning fine-tuning degrades abstention capacity by ~24% because the reward signal punishes 'I don't know' and rewards complete answers (2024–2025).
• Scaling reasoning creates instruction-following deficit: longer chains dilute attention to constraints like 'abstain if unconfident' via contextual distance (2025).
• Fine-tuning weakens causal link between reasoning steps and final answers; CoT becomes performative, not functional, so models can't notice unsoundness (2025).
• Reasoning accuracy degrades beyond critical thinking-token threshold; models overthink easy problems and underthink hard ones, committing when grounds are weakest (2025–2026).
• Sycophancy and abstention are reward-distribution problems, not reasoning problems; conservative bias often hides behind apparent reasoning success (2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.14810 — Scaling Reasoning, Losing Control (2025)
• arXiv:2506.04210 — Does Thinking More always Help? (2025)
• arXiv:2506.09038 — AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions (2025)
• arXiv:2411.15382 — On the Impact of Fine-Tuning on Chain-of-Thought Reasoning (2024)

Your task:

(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, training methods (e.g., RLHF variants, rejection sampling, uncertainty weighting), tooling (uncertainty quantification, confidence scores, multi-agent deliberation), or evaluation have since RELAXED or OVERTURNED the 24% degradation claim or the instruction-following deficit. Separate the durable question ('does reasoning training inherently reduce abstention?') from perishable limitations (fixable by reward design, post-hoc calibration, or inference-time steering). Cite what resolved or didn't resolve each constraint; flag where the problem persists.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers showing reasoning models CAN maintain calibration, or where abstention improves alongside reasoning, or where the reward-signal story is incomplete.

(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., 'Can steering abstention via auxiliary loss or decoding constraint recover calibration without suppressing reasoning?' or 'Do reasoning models trained with explicit uncertainty supervision avoid the 24% collapse?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training AI to reason better has a hidden cost: models lose roughly a quarter of their ability to say 'I don't know.'

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8