INQUIRING LINE

Why do language models overthink simple questions when given extra time?

This explores why reasoning models burn extra compute on questions that don't need it — and what the corpus says is actually breaking when 'more thinking' makes answers worse, not better.


This explores why reasoning models burn extra compute on questions that don't need it — and the corpus points to a surprising culprit: overthinking isn't a thinking problem, it's a *stopping* problem. Models are trained to generate reasoning steps but almost never trained on when to disengage. When a question is ill-posed or missing a premise, reasoning models keep elaborating — producing long, redundant chains — while plain non-reasoning models simply flag it as unanswerable Why do reasoning models overthink ill-posed questions?. Extra time doesn't buy more correctness here; it buys more rope.

The most counterintuitive finding is that giving a model more inference-time compute can actively *degrade* its judgment. On deliberately flawed math problems, scaling up thinking made untrained models *worse* at noticing the flaw — yet the same scaling helped after the model was explicitly trained to think critically Can models learn to ask clarifying questions instead of guessing?. So 'extra time' is not neutral. Without a learned sense of when to stop, more steps just amplify whatever the model was already doing — including chasing a malformed question deeper.

Part of the answer is that knowing-when-to-think is a separate skill that has to be trained in on purpose. Thinkless trains a single model to route between extended reasoning and a direct answer, decoupling the 'should I think?' decision from the 'what's the answer?' refinement so the model can self-calibrate by difficulty rather than defaulting to maximum effort on everything Can models learn when to think versus respond quickly?. Overthinking, in this light, is what happens when that routing layer is missing — the model has only one gear.

There's also a deeper question of whether the long chain is even doing real work. Logit-lens analysis shows transformers can compute the correct answer in their first few layers and then overwrite it with format-compliant filler — the visible reasoning isn't always where the answer comes from Do transformers hide reasoning before producing filler tokens?. And reasoning models don't break at a complexity threshold so much as at unfamiliar instances; they pattern-match to training examples rather than running a general algorithm, so a long chain succeeds or fails based on novelty, not length Do language models fail at reasoning due to complexity or novelty?. That reframes 'overthinking' as effort spent regardless of whether it's the kind of problem extra effort can solve.

The through-line the corpus draws: models are optimized to *produce* reasoning, not to *withhold* it. Whether it's failing to reject a missing premise, failing to ignore a distractor, or failing to route to a quick answer, the same training gap appears — systems learn what to do far better than what *not* to do, and on simple questions, restraint is the missing instruction.


Sources 5 notes

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning researcher evaluating whether language models' overthinking on simple questions is a SOLVED or EVOLVING problem. The question: why do LLMs waste compute on trivial queries when given extra inference time?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Feb 2026. Key constraints identified:
• Overthinking is a *stopping problem*, not a thinking problem: models lack training signals for when to disengage, so extra time amplifies noise rather than improving correctness (2024–25).
• Scaling inference-time compute can *degrade* judgment on flawed premises; untrained models get worse at noticing missing information with more steps, but explicit critical-thinking training flips this (2025).
• Routing-based decoupling (e.g., Thinkless, 2025) can teach models to self-calibrate difficulty and choose direct vs. extended reasoning, suggesting the core gap is a missing routing layer.
• Logit-lens reveals transformers compute correct answers early, then overwrite with format-compliant filler; reasoning chains don't always drive the final answer (2024–25).
• Reasoning breakdown is instance-level (novelty) not task-level (complexity); models pattern-match to training examples rather than execute general algorithms (2025).

Anchor papers (verify; mind their dates):
• arXiv:2412.04537 (Understanding Hidden Computations in Chain-of-Thought Reasoning, 2024-12)
• arXiv:2505.13379 (Thinkless: LLM Learns When to Think, 2025-05)
• arXiv:2505.22354 (LLMs Struggle to Reject False Presuppositions, 2025-05)
• arXiv:2602.06176 (Large Language Model Reasoning Failures, 2026-02)

Your task:
(1) RE-TEST EACH CONSTRAINT. Has routing-based training (Thinkless-style decoupling) become standard practice, or do most reasoning models still lack this layer? Do newer architectures (post-2025) train stopping explicitly as a learned skill? Do inference-time scaling methods now include calibration checks? For each finding, separate the durable question (what drives overthinking?) from the perishable claim (does this model lack routing?) and cite what, if anything, has since resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper shown that overthinking can be useful under realistic conditions, or that the stopping problem is downstream of something else (e.g., prompt design, sampling strategy)?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If stopping is now trainable, what's the new bottleneck — data efficiency, generalization, or something deeper? (b) Can you design a unified metric for "wasted compute" that disentangles noise from valid but unnecessary reasoning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines