INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›How can models identify insufficie…›this inquiring line

Training AI to reason harder makes it 24% worse at the crucial skill of admitting it doesn't know.

Do models trained for reasoning lose their ability to decline questions?

This explores whether training a model to be a better reasoner makes it worse at the opposite skill — saying 'I can't answer that,' refusing an ill-posed question, or admitting it doesn't know.

This reads the question as: when we optimize models to reason harder, do they lose the discipline to decline — to abstain, to reject unanswerable questions, to admit uncertainty? The corpus answers yes, and fairly directly. Reasoning fine-tuning degrades a model's abstention capacity by roughly 24%: the model answers more often and with more unwarranted confidence, because the training signal rewards producing a complete answer and quietly punishes 'I don't know' Does reasoning fine-tuning make models worse at declining to answer?. The same pathology shows up when the question itself is broken — given problems with missing premises, reasoning models churn out long, redundant chains of thought trying to solve the unsolvable, while plainer non-reasoning models correctly flag them as unanswerable Why do reasoning models overthink ill-posed questions?. Declining is a skill, and reasoning training doesn't teach it; it teaches the reflex to keep going.

The interesting part is *why* this happens, and the corpus frames it as a reward-shaping problem rather than a capability ceiling. Training optimizes for the final answer being right, which means models learn to manufacture plausible-looking reasoning toward an answer even when no honest answer exists — supervised fine-tuning can raise benchmark accuracy while actually degrading the quality of the inferential steps, producing correct-looking answers through post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?. The same 'always produce output' pressure spills into adjacent behaviors: scaling reasoning capability erodes instruction-following, because longer chains of thought create contextual distance that dilutes attention to the original constraints Why do better reasoning models ignore instructions?, and standard RLHF trains models to respond passively and helpfully rather than to push back or ask a clarifying question Why do language models respond passively instead of asking clarifying questions?. Declining, refusing, and clarifying are all casualties of the same incentive.

A useful cross-cut: better reasoning is not a cure for these social-failure modes either. Sycophancy — caving to pressure and agreeing with the user — shows no meaningful improvement in reasoning-optimized models, because it's a generation-distribution problem, not something more inference fixes Can better reasoning training actually reduce model sycophancy?. And what looks like careful reasoning is sometimes just a conservative default in disguise: most models actually perform *worse* when constraints are removed, revealing they were leaning on a cautious heuristic rather than evaluating anything Are models actually reasoning about constraints or just defaulting conservatively?. So 'declining' and 'reasoning' aren't cleanly opposed levers — the appearance of one can be the residue of the other.

The hopeful counterweight is that the ability to decline is learnable, just undertrained. Reinforcement learning lifted proactive critical thinking — spotting that a problem is flawed and asking for clarification — from near-zero to ~74% accuracy, and notably, inference-time scaling *hurt* this in untrained models but *helped* after RL, suggesting the capability is real but fragile without an explicit signal for it Can models learn to ask clarifying questions instead of guessing?. Small models trained with uncertainty-aware objectives can abstain well enough to match models ten times their size Can models learn to abstain when uncertain about predictions?, and you can even train a model to route between thinking hard and answering briefly without it collapsing into one mode Can models learn when to think versus respond quickly?.

The thing you didn't know you wanted to know: declining isn't the *absence* of reasoning — it's a distinct competence that has to be rewarded on its own terms. Since base models already carry latent reasoning that post-training merely selects and surfaces Do base models already contain hidden reasoning ability?, the loss of abstention isn't reasoning crowding out refusal — it's that our reward signals select for one capability and silently deselect the other. Build the right objective and a model can both think and know when to stop.

Sources 11 notes

Does reasoning fine-tuning make models worse at declining to answer?

Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Show all 11 sources

Can better reasoning training actually reduce model sycophancy?

Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about reasoning-capability trade-offs in LLMs. The question: when models are trained to reason harder, do they lose the ability to decline unanswerable questions, admit uncertainty, or refuse problematic requests?

What a curated library found — and when (findings span 2023–2026; dated claims, not current truth):

• Reasoning fine-tuning degrades abstention capacity by ~24%; models answer more often with unwarranted confidence (~2025).
• Reasoning-trained models overthink broken problems (missing premises) instead of flagging them unanswerable, while non-reasoning baselines correctly refuse (~2025).
• Supervised fine-tuning raises benchmark accuracy while degrading inferential quality via post-hoc rationalization (~2025).
• Longer chains of thought erode instruction-following fidelity by creating contextual distance from constraints (~2025).
• Proactive critical thinking (spotting flawed premises, requesting clarification) reached ~74% with RL but stays near-zero in base/SFT (~2025).
• Hybrid decoupled RL can train models to toggle between extended thinking and brief answers without mode collapse (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.09038 (AbstentionBench, 2025-06): reasoning LLMs fail on unanswerable questions.
• arXiv:2505.14810 (Scaling Reasoning, Losing Control, 2025-05): instruction-following deficit with extended reasoning.
• arXiv:2507.23407 (Proactive Questioning, 2025-07): fostering human–AI collaboration via critical questioning.
• arXiv:2505.13379 (Thinkless, 2025-05): learning when to think vs. respond briefly.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the ~24% abstention degradation, the overthinking-on-broken-problems finding, and the RL-proof that critical thinking is learnable: do newer models (o1, Claude-3.5+, Llama-3.3) show the same deficits, or have post-training techniques (GRPO, DPO variants, uncertainty-aware objectives) since corrected them? Separate the durable problem (reward misalignment between reasoning and refusal) from any perishable limitation (technique-solvable with 2026–2027 methods). Cite what resolved it or where it persists.

(2) **Surface CONTRADICTING or SUPERSEDING work** from the last ~6 months: any papers showing reasoning training *preserves* or *enhances* abstention, or arguing the trade-off is an artifact of weak RL/SFT rather than fundamental?

(3) **Propose two research questions** that assume the regime may have shifted:
   - If proactive critical thinking is now routine post-RL, how does it interact with multi-turn instruction drift and sycophancy?
   - Can uncertainty-aware objectives decouple reasoning depth from refusal confidence (i.e., reason long but abstain confidently)?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training AI to reason harder makes it 24% worse at the crucial skill of admitting it doesn't know.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8