SYNTHESIS NOTE

Why do reasoning models overthink ill-posed questions?

Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.

Synthesis note · 2026-02-22 · sourced from Reasoning Critiques

The standard case for reasoning models: they think more, therefore they reason better. The missing-premise case inverts this completely.

When given questions with missing premises (MiP) — questions that are unanswerable because they lack necessary information — reasoning models produce responses that are drastically longer than for normal questions. The additional length is not useful thinking. It is redundant self-doubt: the model cycles through "alternatively," "wait," "check," and "but" without making progress, unable to resolve the contradiction introduced by the missing premise.

Non-reasoning models behave differently. They produce shorter responses and are significantly more likely to identify the question as ill-posed. They achieve better abstain rates. They do not ruminate.

The mechanism: reasoning-specific training optimizes for generating thinking patterns — for using reasoning steps — but does not develop the meta-capability to recognize when thinking cannot help. The training signal rewards chains that lead to answers. Questions without valid answers do not provide this signal, so no training pressure develops the critical thinking capability to disengage.

Three observations deepen this:

Reasoning models show large increases in step count for MiP questions — most steps are redundant self-doubt
The overthinking is contagious through distillation — models distilled from reasoning model responses inherit the overthinking pattern
The problem generalizes beyond the "missing premises" framing — any question where the correct response is not to reason further will expose this deficit

This contradicts the naïve test-time scaling law assumption. Scaling thinking tokens is supposed to improve outcomes. For ill-posed questions, it does the opposite. The model is burning compute on questions that require no answer, only recognition.

The practical implication for deployed reasoning agents: well-formed questions from trusted sources are fine. Ill-formed, ambiguous, or manipulative questions are not — the reasoning model will not disengage, it will overthink.

Prompting-level mitigation: ISP2 (Iterative Summarization Pre-Prompting) demonstrates that pre-reasoning information gathering can partially address the implicit/missing information problem. The technique extracts entities and their descriptions from the question, rates the reliability of these information pairs, then iteratively merges the lowest-reliability pairs into new descriptions — building a key information pair that is fed alongside the original question into reasoning. The principle: "understanding before reasoning" — CoT emphasizes reasoning stages but neglects the critical prior step of gathering and extracting essential information. ISP2 addresses the missing-premise gap from the prompting side, while training-based approaches like Can models learn to ask clarifying questions instead of guessing? address it from the capability side.

QuestBench extends the picture from behavior to diagnostics: models can't even IDENTIFY what information is missing. At 40-50% accuracy on logic and planning clarification tasks, the information acquisition failure precedes the overthinking failure. See Can models identify what information they actually need? — the two findings describe a two-part deficit: (1) cannot detect what information is needed, (2) cannot disengage when information is absent.

"When Prompts Go Wrong" (2025) extends this to code generation with a systematic taxonomy. Ambiguous descriptions (multiple plausible interpretations), contradictory descriptions (conflicting requirements), and incomplete descriptions (omitted constraints) each cause distinct failure modes. Contradictory descriptions result in the most logical errors — models attempt to satisfy incompatible requirements simultaneously. Incomplete descriptions cause models to make incorrect assumptions (e.g., assuming a base area is provided when "triangular" is omitted). Even larger, more resilient models are not immune. The finding generalizes the missing-premises problem: it is not specific to reasoning tasks but a fundamental vulnerability wherever task specifications are imperfect. Source: Arxiv/Prompts Prompting.

Inquiring lines that read this note 90

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can models identify insufficient information and respond appropriately without guessing?

How should models express uncertainty rather than forced confident answers?

What capability tradeoffs emerge when scaling model reasoning abilities?

Why do reasoning models fail at systematic problem-solving and search?

Why do correct reasoning traces tend to be shorter than incorrect ones?

What makes specific clarifying questions more effective than generic ones?

How do training data properties shape reasoning capability development?

How does example difficulty affect learning efficiency in language models?

Why do models automatically adjust reasoning length to problem difficulty?

When do additional thinking tokens stop improving reasoning performance?

How does latent reasoning compare to verbalized chain-of-thought?

Can prompting inject entirely new knowledge into language models?

Does irrelevant context degrade reasoning even within model context limits?

Why do language models reinforce false assumptions instead of correcting them?

Why are false presuppositions harder to spot when they sound plausible?

How should iterative research systems allocate reasoning per search step?

How does overthinking in early turns degrade later retrieval rounds?

How do adversarial and manipulative prompts attack reasoning models?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Why do benchmark improvements fail to reflect actual reasoning quality?

What explains the gap between perplexity performance and actual reasoning capability?

How does reasoning effort affect AI theory of mind performance?

What makes reasoning models worse at understanding people?

How do training priors constrain what context information can override?

Why does monological training prevent models from overriding statistical priors?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Why do models skip steps that would make reasoning clearer?

Does alignment training create blind spots in detecting genuine safety threats?

Why do safety-trained models refuse questions they could actually answer well?

Can model confidence signals reliably improve reasoning quality and calibration?

Why does self-revision increase model confidence while degrading accuracy?

Does reinforcement learning teach reasoning or just when to reason?

Why does extended reasoning training improve exploration without adding new capabilities?

When should retrieval-augmented systems decide to fetch new information?

Why do external feature triggers outperform uncertainty on complex questions?

Can AI-generated outputs constitute genuine knowledge or valid claims?

What happens to long-tail reasoning when AI assists public deliberation?

When does architectural design matter more than raw model capacity?

Why do harder puzzles cause all models to collapse despite larger token budgets?

How should agents balance memory condensation to optimize context efficiency?

How can agents distinguish over-generalized lessons from genuinely useful long-tail knowledge?

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

21 direct connections · 198 in 2-hop network ·dense cluster Open in graph ↗

Why do reasoning models overthink ill-posed ques… Does more thinking time always improve reasoning a… Does more thinking time actually improve LLM reaso… Why do reasoning models fail under manipulative pr… Does reasoning fine-tuning make models worse at de… Can models identify what information they actually… How do users actually form intent when prompting A… Why do language models lose performance in longer … When should retrieval happen during model generati…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
missing premises may push models past any threshold by making the threshold undefined
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
MiP is a particularly sharp falsification: more thinking is not just unhelpful, it actively produces worse behavior
Why do reasoning models fail under manipulative prompts? Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
same vulnerability pattern: reasoning models trained to use thinking are more susceptible to scenarios where thinking doesn't help
Does reasoning fine-tuning make models worse at declining to answer? When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
connects directly: reasoning training reduces appropriate non-answering
Can models identify what information they actually need? When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
diagnostic complement: models can't identify what's missing (QuestBench 40-50%), then overthink when it IS missing
How do users actually form intent when prompting AI systems? Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
when users provide incomplete intent (the default condition), reasoning models overthink rather than recognizing the gap and helping users mature their intent
Why do language models lose performance in longer conversations? Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
multi-turn conversation is the natural habitat of missing premises: gradually revealed instructions create underspecification that reasoning models overthink rather than resolve; the Mediator-Assistant architecture separates the problem, preventing overthinking at the execution stage
When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
active retrieval offers a constructive exit from the overthinking spiral: when uncertainty is detected, retrieve external information instead of generating more reasoning tokens; without this mechanism, the model can only ruminate
Can models reason without generating visible thinking steps? Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
latent recurrence with bounded depth provides an architectural constraint against rumination: verbalized reasoning models cannot stop token generation when premises are missing, but bounded latent iteration would naturally cap unproductive cycles rather than spiraling into self-doubt
Why do language models fail in gradually revealed conversations? Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
multi-turn conversation is the natural habitat of missing premises: gradually revealed instructions create exactly the underspecification that triggers overthinking rather than clarification; the 39% degradation is the conversational cost of the critical thinking deficit
Why do users drift away from their original information need? When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
users in ASK states naturally produce the incomplete queries that trigger overthinking: they know they need something but cannot specify what, producing vague questions with implicit missing premises that reasoning models ruminate on rather than recognizing as underspecified

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

missing premises exacerbate overthinking — reasoning models lack critical thinking to reject ill-posed questions

Why do reasoning models overthink ill-posed questions?

Inquiring lines that read this note 90

Related concepts in this collection 11

Related papers in this collection 8

Search by related questions 4