INQUIRING LINE

What triggers overthinking versus underthinking in reasoning models?

This explores what actually flips a reasoning model between two opposite failure modes — burning tokens redundantly (overthinking) versus bailing on good ideas too early (underthinking) — and whether the triggers are about problem difficulty, training, or something in the decoding itself.


This explores what actually flips a reasoning model between two opposite failure modes — burning tokens redundantly (overthinking) versus bailing on good ideas too early (underthinking). The corpus suggests the trigger is less about how much the model thinks and more about *calibration*: whether the model's confidence and its sense of when to stop match the difficulty of the problem in front of it.

The sharpest single finding is that the two failures map onto problem difficulty in opposite directions — models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. Accuracy isn't monotonic in thinking length; it peaks at a task-specific token count and then falls off a cliff (one study clocks 87.3% down to 70.3% as tokens climb from ~1,100 to ~16,000), because extended thinking starts inflating output variance and introducing self-revision errors rather than fixing anything When does thinking too much actually hurt reasoning?. So overthinking isn't just wasted compute — past a threshold it actively corrupts a correct answer.

Underthinking has a more mechanical trigger: premature thought-switching. Models abandon promising reasoning paths mid-exploration, scattering tokens across incomplete approaches — exploring "like tourists, not scientists" Why do reasoning models abandon promising solution paths?. Strikingly, you can fix this at decoding time without retraining: a penalty on thought-transition tokens discourages the bailing and improves accuracy on hard math Do reasoning models switch between ideas too frequently?. That points to confidence as the underlying dial — when a model can't commit, it switches; when it's overconfident, it pads. ReBalance reads confidence variance and overconfidence directly as diagnostic signals, then applies training-free steering to suppress redundancy during overthinking and push exploration during underthinking Can confidence patterns reveal overthinking versus underthinking?.

Two deeper triggers sit underneath. First, training quality, not quantity, decides whether thinking even helps: vanilla models use "thinking mode" to induce self-doubt that *degrades* performance, and RL training reverses the very same mechanism into productive gap analysis Does extended thinking help or hurt model reasoning?. Second, models lack a stop signal entirely for ill-posed inputs — given a question with a missing premise, reasoning models spiral into long redundant chains while non-reasoning models simply flag it as unanswerable. Training optimizes for *producing* reasoning steps but never teaches *when to disengage* Why do reasoning models overthink ill-posed questions?.

The unsettling thread, if you want to pull it: longer reasoning chains aren't just inefficient, they're a liability surface. Each extra step is another intervention point where a single corrupted step propagates — which is why reasoning models are *more* vulnerable to manipulative multi-turn prompts than plain models, losing 25–29% accuracy Why do reasoning models fail under manipulative prompts?. And there's a measurement angle worth knowing exists: a "deep-thinking ratio" tracks how many tokens actually get revised across model layers, distinguishing genuine reasoning effort from the appearance of it Can we measure how deeply a model actually reasons? — useful precisely because the visible length of a reasoning trace tells you almost nothing about whether real work is happening.


Sources 9 notes

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

When does thinking too much actually hurt reasoning?

Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can confidence patterns reveal overthinking versus underthinking?

ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Why do reasoning models fail under manipulative prompts?

GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-model researcher. The question: what actually *flips* a model between overthinking (redundant token-burning that corrupts answers) and underthinking (premature abandonment of promising reasoning paths)? A curated library (arXiv, Jan 2025–Mar 2026) found:

— Overthinking and underthinking map inversely to problem difficulty: models overthink easy problems and underthink hard ones. Accuracy peaks at a task-specific token count (~1,100 on one benchmark), then degrades sharply (87.3% → 70.3% as tokens climb to ~16,000), driven by self-revision errors, not missing insight (2025–2026).
— Underthinking is mechanically triggered by premature thought-switching: models scatter tokens across incomplete reasoning paths. A decoding-time penalty on thought-transition tokens (no retraining) discourages bailing and improves hard-problem accuracy (2025).
— Confidence calibration is the continuous dial: ReBalance reads confidence variance to steer between suppressing redundancy (overthinking) and pushing exploration (underthinking) without retraining (2026).
— RL training flips the role of "thinking mode" from self-doubt (vanilla) to productive gap analysis (trained). Vanilla models use extended thinking to *induce* doubt; RL reverses it (2025).
— Reasoning models lack a stop signal for ill-posed inputs (missing premises, unanswerable questions): they spiral into long chains; non-reasoning models simply flag it (2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.04210 (Does Thinking More Always Help?, Jun 2025)
• arXiv:2505.20296 (Reasoning LLMs Are Wandering Solution Explorers, May 2025)
• arXiv:2603.12372 (Efficient Reasoning with Balanced Thinking, Mar 2026)
• arXiv:2602.13517 (Think Deep, Not Just Long, Feb 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the perishable claims (accuracy cliff at ~16K tokens; vanilla thinking = self-doubt; thought-transition penalties work training-free), judge whether post-Mar-2026 model releases, RL recipes, inference optimization, or steering methods have since relaxed or overturned them. Separate the durable insight (confidence miscalibration as root cause) from the time-locked finding (specific token thresholds, specific RL reversals). Say plainly what still holds and what may have shifted.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: any result that shows overthinking/underthinking is *not* primarily confidence-driven, or that longer chains do help in practice more than these papers claim, or that the accuracy cliff is shallower/avoidable.
(3) Propose 2 research questions that ASSUME the regime may have moved—e.g., do adaptive confidence thresholds per-task outperform global steering? Does agentic collaboration (human + model iterating) bypass the need to choose between overthinking and underthinking at inference time?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines