Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
The standard case for reasoning models: they think more, therefore they reason better. The missing-premise case inverts this completely.
When given questions with missing premises (MiP) — questions that are unanswerable because they lack necessary information — reasoning models produce responses that are drastically longer than for normal questions. The additional length is not useful thinking. It is redundant self-doubt: the model cycles through "alternatively," "wait," "check," and "but" without making progress, unable to resolve the contradiction introduced by the missing premise.
Non-reasoning models behave differently. They produce shorter responses and are significantly more likely to identify the question as ill-posed. They achieve better abstain rates. They do not ruminate.
The mechanism: reasoning-specific training optimizes for generating thinking patterns — for using reasoning steps — but does not develop the meta-capability to recognize when thinking cannot help. The training signal rewards chains that lead to answers. Questions without valid answers do not provide this signal, so no training pressure develops the critical thinking capability to disengage.
Three observations deepen this:
- Reasoning models show large increases in step count for MiP questions — most steps are redundant self-doubt
- The overthinking is contagious through distillation — models distilled from reasoning model responses inherit the overthinking pattern
- The problem generalizes beyond the "missing premises" framing — any question where the correct response is not to reason further will expose this deficit
This contradicts the naïve test-time scaling law assumption. Scaling thinking tokens is supposed to improve outcomes. For ill-posed questions, it does the opposite. The model is burning compute on questions that require no answer, only recognition.
The practical implication for deployed reasoning agents: well-formed questions from trusted sources are fine. Ill-formed, ambiguous, or manipulative questions are not — the reasoning model will not disengage, it will overthink.
Prompting-level mitigation: ISP2 (Iterative Summarization Pre-Prompting) demonstrates that pre-reasoning information gathering can partially address the implicit/missing information problem. The technique extracts entities and their descriptions from the question, rates the reliability of these information pairs, then iteratively merges the lowest-reliability pairs into new descriptions — building a key information pair that is fed alongside the original question into reasoning. The principle: "understanding before reasoning" — CoT emphasizes reasoning stages but neglects the critical prior step of gathering and extracting essential information. ISP2 addresses the missing-premise gap from the prompting side, while training-based approaches like Can models learn to ask clarifying questions instead of guessing? address it from the capability side.
QuestBench extends the picture from behavior to diagnostics: models can't even IDENTIFY what information is missing. At 40-50% accuracy on logic and planning clarification tasks, the information acquisition failure precedes the overthinking failure. See Can models identify what information they actually need? — the two findings describe a two-part deficit: (1) cannot detect what information is needed, (2) cannot disengage when information is absent.
"When Prompts Go Wrong" (2025) extends this to code generation with a systematic taxonomy. Ambiguous descriptions (multiple plausible interpretations), contradictory descriptions (conflicting requirements), and incomplete descriptions (omitted constraints) each cause distinct failure modes. Contradictory descriptions result in the most logical errors — models attempt to satisfy incompatible requirements simultaneously. Incomplete descriptions cause models to make incorrect assumptions (e.g., assuming a base area is provided when "triangular" is omitted). Even larger, more resilient models are not immune. The finding generalizes the missing-premises problem: it is not specific to reasoning tasks but a fundamental vulnerability wherever task specifications are imperfect. Source: Arxiv/Prompts Prompting.
Inquiring lines that use this note as a source 82
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can AI systems identify important unanswered questions that emerge during reasoning?
- Why do models commit to answers early on easy versus hard tasks?
- Can penalizing reasoning transitions fix underthinking without fine-tuning models?
- Can models identify what information they are missing in underspecified problems?
- Can models identify information gaps without just guessing or refusing to answer?
- Why do reasoning models fail on structurally unfamiliar instances?
- What mechanism causes confident false answers under high cognitive load?
- Can extended thinking genuinely improve reasoning or just increase variance?
- What makes some clarifying questions more useful than others?
- Can testing prior knowledge and checking understanding improve explanation outcomes?
- Why do models automatically adjust reasoning length to problem difficulty?
- What triggers overthinking versus underthinking in reasoning models?
- How does difficulty level change whether extended thinking provides genuine reasoning signal?
- Why do non-reasoning models work better under extreme decomposition than reasoning models?
- Does irrelevant context degrade reasoning even within model context limits?
- Do models trained for safety over-refuse compared to models trained for reasoning?
- Why do simple math problems get worse with longer reasoning chains?
- Can proactive critical thinking train models to request clarification actively?
- How does random walk length control reasoning complexity in question generation?
- How does ambiguity detection connect to models' ability to ask clarifying questions?
- Why does extended reasoning fail for search and knowledge retrieval tasks?
- Why are false presuppositions harder to spot when they sound plausible?
- What makes correcting a false assumption harder than just detecting it?
- How does overthinking in early turns degrade later retrieval rounds?
- How do adversarial triggers bypass the protections of longer reasoning chains?
- Can models learn to identify what information is missing from questions?
- Does distillation from reasoning models spread overthinking to smaller models?
- What training signals would teach models when not to reason?
- Do reasoning models overthink ill-posed questions instead of recognizing incompleteness?
- Why does extended thinking increase output variance without improving reasoning quality?
- Do models trained for reasoning lose their ability to decline questions?
- How much of a model's reasoning tokens are unnecessary for reaching the final answer?
- Why do introverted agents produce longer and more detailed reasoning traces?
- Can extended deliberation in agents become counterproductive like human overthinking?
- Why do reasoning models fail when input length increases even below context limits?
- Why do models overthink easy problems and underthink difficult ones?
- Can preference optimization reduce overthinking without sacrificing accuracy?
- Why do reasoning models confidently generate wrong answers instead of abstaining?
- Can models trained on longer contexts develop better fundamental reasoning abilities?
- What explains the gap between perplexity performance and actual reasoning capability?
- Why do reasoning models wander instead of searching systematically?
- What makes reasoning models worse at understanding people?
- Why does overthinking degrade performance at extreme recursion depths?
- Are reasoning models more vulnerable to persuasion than standard models?
- Why do models overthink underspecified problems instead of rejecting them?
- When should a system choose extended thinking versus quick responses?
- Why does monological training prevent models from overriding statistical priors?
- Can models learn to stop thinking when a question lacks necessary information?
- What makes a first answer so often the best answer a model produces?
- Why do models detect false assumptions but still fail to correct them appropriately?
- Why do models skip steps that would make reasoning clearer?
- Do search agents face their own overthinking threshold like reasoning models do?
- How much does extended thinking actually improve model reasoning ability?
- Why does more inference compute amplify wandering rather than solving it?
- How does RLHF training reward models for guessing over asking clarifying questions?
- Why do specific clarifying questions outperform rephrased versions of user needs?
- Why do safety-trained models refuse questions they could actually answer well?
- Can confidence levels reliably detect when a model is overthinking?
- Why do specific clarifying questions outperform generic requests for clarity?
- Can models overthink and underthink at the same time?
- Why are incorrect reasoning traces longer than correct ones?
- Can runtime confidence signals detect when reasoning has crossed the overthinking threshold?
- Why do different model training approaches produce different overthinking thresholds?
- Why do models struggle with asking questions in multi-turn conversational reasoning tasks?
- Can models learn to ask clarifying questions instead of making assumptions?
- Can reasoning models reject ill-posed questions or do they overthink?
- What causes reasoning quality to degrade during long research tasks?
- How do reasoning-related features behave when trained on near-impossible problems?
- Can conditioning generation on difficulty probes reduce overthinking on simple tasks?
- Why do reasoning models exhibit self-doubt about their own early assessments?
- Why do longer reasoning chains explore like tourists instead of scientists?
- Do reasoning models need to verbalize doubt to correct their own mistakes?
- Why do language models overthink simple questions when given extra time?
- Why does extended reasoning training improve exploration without adding new capabilities?
- Why do external feature triggers outperform uncertainty on complex questions?
- Are reasoning models more vulnerable to adversarial manipulation than standard models?
- What makes uncertainty calibration harder than expanding knowledge?
- Why does reasoning backward enable better forward reasoning performance?
- How does question difficulty and breadth affect what models learn to reason?
- Do models naturally learn to ask clarifying questions without explicit supervision?
- Which types of clarifying questions actually help users versus wasting their time?
- How can models select the optimal question to ask given multiple uncertainties?
Related concepts in this collection 11
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
missing premises may push models past any threshold by making the threshold undefined
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
MiP is a particularly sharp falsification: more thinking is not just unhelpful, it actively produces worse behavior
-
Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
same vulnerability pattern: reasoning models trained to use thinking are more susceptible to scenarios where thinking doesn't help
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
connects directly: reasoning training reduces appropriate non-answering
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
diagnostic complement: models can't identify what's missing (QuestBench 40-50%), then overthink when it IS missing
-
How do users actually form intent when prompting AI systems?
Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
when users provide incomplete intent (the default condition), reasoning models overthink rather than recognizing the gap and helping users mature their intent
-
Why do language models lose performance in longer conversations?
Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
multi-turn conversation is the natural habitat of missing premises: gradually revealed instructions create underspecification that reasoning models overthink rather than resolve; the Mediator-Assistant architecture separates the problem, preventing overthinking at the execution stage
-
When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
active retrieval offers a constructive exit from the overthinking spiral: when uncertainty is detected, retrieve external information instead of generating more reasoning tokens; without this mechanism, the model can only ruminate
-
Can models reason without generating visible thinking steps?
Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
latent recurrence with bounded depth provides an architectural constraint against rumination: verbalized reasoning models cannot stop token generation when premises are missing, but bounded latent iteration would naturally cap unproductive cycles rather than spiraling into self-doubt
-
Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
multi-turn conversation is the natural habitat of missing premises: gradually revealed instructions create exactly the underspecification that triggers overthinking rather than clarification; the 39% degradation is the conversational cost of the critical thinking deficit
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
users in ASK states naturally produce the incomplete queries that trigger overthinking: they know they need something but cannot specify what, producing vague questions with implicit missing premises that reasoning models ruminate on rather than recognizing as underspecified
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs
- Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
Original note title
missing premises exacerbate overthinking — reasoning models lack critical thinking to reject ill-posed questions