INQUIRING LINE

What training signals would teach models when not to reason?

This explores what kinds of training rewards or objectives would teach a model the skill of disengaging — knowing when a question is unanswerable, when to stay silent, when to abstain, or when to answer directly instead of grinding through reasoning steps.


This explores what training signals would teach a model the inverse of reasoning — when *not* to spin up a chain of thought, when to say "I don't know," when to stay quiet, or when to flag that a question can't be answered at all. The corpus has a surprisingly coherent answer hiding across several notes: the reason models don't know when to stop is that almost every standard training signal rewards them for producing more. The fix isn't better reasoning — it's a different reward shape entirely.

Start with the diagnosis, because it's sharper than you'd expect. Reasoning fine-tuning actively *degrades* a model's ability to decline — abstention capacity drops about 24% because the training signal rewards complete answers and punishes "I don't know" Does reasoning fine-tuning make models worse at declining to answer?. The same pressure shows up when a question is broken: reasoning models will generate long, redundant traces for questions with missing premises, while plainer models correctly flag them as unanswerable — because the objective optimizes for producing steps and never teaches disengagement Why do reasoning models overthink ill-posed questions?. And scaling reasoning quietly costs instruction-following: longer chains create "contextual distance" that dilutes attention to the original ask Why do better reasoning models ignore instructions?. So the question "what teaches a model not to reason" is partly "what undoes the reflex that reasoning training installs."

The most direct answers in the corpus reframe restraint as an explicit decision the model must learn to make, rather than a behavior we hope emerges. Thinkless trains a single model to *route* between extended reasoning and a direct response, using a decoupled RL scheme (DeGRPO) that separates the mode-selection signal from the answer-quality signal — without that decoupling, the model collapses into one mode Can models learn when to think versus respond quickly?. DiscussLLM does the conversational analog: it makes "stay silent" a first-class classification outcome alongside speaking, training timing as an objective in its own right rather than a side effect Can models learn when NOT to speak in conversations?. The shared move is the lesson: restraint has to be its own reward target, decoupled from the reward for being helpful or correct, or the helpfulness signal swamps it.

The deeper signal, though, is calibration. A model that knows *when* not to reason is really a model that knows when it doesn't know. Small models trained with uncertainty-aware objectives and an explicit abstention option match models 10x their size on forecasting — which tells you calibration ability already exists in these networks but sits undertrained under standard recipes Can models learn to abstain when uncertain about predictions?. That connects to a broader finding: base models already contain latent reasoning that minimal signals merely *select* rather than create Do base models already contain hidden reasoning ability?. If reasoning is selected, not built, then restraint can probably be selected too — the capacity to abstain may be sitting there, suppressed by reward shapes that pay only for answers.

Two cautions worth carrying. First, don't expect restraint to fall out of "more reasoning, but better" — sycophancy research shows reasoning-optimized models resist social pressure no better than base ones, because the failure lives in the generation distribution, not the reasoning process Can better reasoning training actually reduce model sycophancy?. Restraint is the same kind of problem: a distribution that needs reshaping, not a reasoning chain that needs lengthening. Second, there's a clue about *what* the signal can be cheap to provide — corrupted reasoning traces teach about as well as correct ones, implying the trace is computational scaffolding rather than meaningful content Do reasoning traces need to be semantically correct?. If correctness of the trace barely matters, then the signal worth investing in isn't "reason well" but "decide whether to reason at all" — which is exactly the routing, silence, and abstention objectives above, and the reason multi-turn reward design (rewarding long-term interaction value over immediate helpfulness) keeps surfacing as the lever that lets models hold back and ask instead of charging ahead Why do language models respond passively instead of asking clarifying questions?.


Sources 10 notes

Does reasoning fine-tuning make models worse at declining to answer?

Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can models learn when NOT to speak in conversations?

DiscussLLM trains AI to decide between five intervention types or remaining silent using an 88K synthetic discussion dataset. A decoupled classifier-generator architecture achieves better computational efficiency, while end-to-end training better integrates when-to-speak and what-to-say decisions.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can better reasoning training actually reduce model sycophancy?

Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher tracking when LLMs learn to abstain, decline, or silence themselves. The question remains open: *what training signals teach models when NOT to reason?*

What a curated library found — and when (dated claims, not current truth):
Library findings span 2023–2026, with density in 2025–26. Key constraints identified:

• Reasoning fine-tuning actively degrades abstention capacity (~24% drop) because the reward shape punishes "I don't know" and rewards completing chains, even for unanswerable questions (~2025).
• Scaled reasoning creates instruction-following deficits: longer chains dilute attention to the original ask via contextual distance (~2025).
• Routing (Thinkless, DeGRPO) and explicit silence tokens (DiscussLLM) work because restraint requires a *decoupled* reward signal separate from answer quality (~2025–26).
• Calibration and abstention capacity already exist latent in base models; minimal signals select rather than build them (~2024–25).
• Sycophancy and reasoning-resistant failures live in the generation distribution, not the reasoning process; longer chains don't fix distribution problems (~2025).
• Multi-turn reward optimization (valuing long-term interaction over immediate helpfulness) is the lever that lets models hold back (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2505.13379 Thinkless: LLM Learns When to Think (2025-05)
• arXiv:2508.18167 DiscussLLM: Teaching Large Language Models When to Speak (2025-08)
• arXiv:2602.07338 Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation (2026-02)
• arXiv:2506.09038 AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions (2025-06)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, newer Claude variants), training methods (newer RL schemes, inference-time scaling), tooling (verifiers, formal checkers, structured outputs), or orchestration (memory, multi-agent fallback, confidence thresholding) have since relaxed or overturned it. Separate the durable question—how do we shape rewards to teach restraint?—from perishable limitations (e.g., does the 24% degradation still hold under DeGRPO-style decoupling?). Cite what resolved each constraint.
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Has anyone shown that plain scaling or constitutional methods achieve abstention without explicit routing?
(3) Propose 2 research questions that assume the regime may have moved: e.g., "Do inference-time verifiers now obviate the need for training-time abstention signals?" or "Can multi-agent orchestration (fallback to silence) replace single-model restraint training?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines