SYNTHESIS NOTE

Does more thinking time always improve reasoning accuracy?

Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

The prevailing assumption that "more thinking tokens = better reasoning" is empirically false beyond a critical point. Pushing the average thinking token count from ~1,100 to ~15,980 reduced accuracy from 87.3% to 70.3% on the same benchmark.

This non-monotonic relationship — initial improvement followed by steady decline — is consistent across multiple tasks and datasets. The researchers call the degradation phase "overthinking," and it has been largely invisible in prior work because most studies only reported the improving phase of the curve.

The practical implication: there is a sweet spot, and token budgets above it actively harm performance. Current practice of using "more tokens" as a proxy for "more reasoning" is not just wasteful — it is counterproductive past the threshold. Since Does extended thinking actually improve reasoning or just increase variance?, the gains before the threshold aren't even what they appear to be.

The bidirectional calibration failure (Between Underthinking and Overthinking): The relationship is not just non-monotonic — models miscalibrate in both directions. For easy questions, models often detect difficulty increases and extend reasoning appropriately. But for hard questions beyond their capability, models underthink — failing to recognize difficulty or lacking the knowledge to respond effectively, producing responses shorter than needed. The result: models overthink easy problems (generating unnecessarily long outputs) and underthink hard ones (failing to extend reasoning when most needed).

Length-based preference optimization provides a surprising intervention: fine-tuning to prefer shorter responses — using only unlabeled data, without ground-truth labels — maintains relatively strong accuracy while reducing token length. The reduction is disproportionately from incorrect responses (which are significantly longer), but 10-25% reduction on correct responses is also observed. This suggests models have latent ability to calibrate difficulty for easy problems but retain an overthinking tendency that preference optimization can reduce.

PI framework: the attention-level mechanism behind the threshold: The PI (Test-time Prompt Intervention) framework provides the attention-level mechanism that explains why the threshold exists. Visualizing attention maps across reasoning steps reveals that verification and backtracking steps (e.g., steps 7-8 in a typical trace) receive minimal subsequent attention — the model generates them but barely reads them. After generating the correct answer step, all following steps predominantly attend to that pivotal moment rather than to intermediate verification. The critical steps — those whose predecessors all receive high attention — can reproduce the reasoning with 75% fewer steps. This transforms the behavioral observation (accuracy degrades with more tokens) into a mechanistic explanation: redundant tokens are attention-invisible, contributing neither signal nor structure to the final answer. The overthinking region is precisely where token generation has detached from the attention graph that actually drives outputs. Source: Prompts Prompting.

Optimal reasoning token ratio exists but models cannot reach it. ZebraLogic's analysis of constraint satisfaction problems shows that there exists an optimal ratio of reasoning tokens to problem complexity (measured by Z3 solver conflicts). O1-like models scale reasoning tokens with complexity and approach this optimal ratio for moderate problems, but cannot reach it when complexity is extremely high — the reasoning effort ceiling is below what the problem requires. Self-verification prompting provides only marginal improvement (31.7% → 33.0% → 32.1% on second iteration), suggesting the bottleneck is not insufficient verification but insufficient reasoning depth. The optimal ratio finding quantifies the threshold: the sweet spot is not just "not too many tokens" but a specific relationship between problem difficulty and reasoning budget.

S1-Bench (2025) reveals that LRMs can prejudge question simplicity — especially in Chinese — but thinking length does NOT shorten despite this prejudgment. Models generate unnecessary solution rounds after reaching the correct answer, repeatedly reverifying simple problems already solved. Models with longer thinking processes produce more excessive solution rounds. Furthermore, LRMs sometimes include incorrect intermediate conclusions in their reasoning even when ultimately reaching correct final answers, and sometimes reach the correct answer during reasoning but then deviate to produce incorrect final conclusions. The prejudgment finding is architecturally important: it suggests the overthinking mechanism is not caused by inability to assess difficulty, but by an inability to act on that assessment — the model "knows" the problem is simple but cannot truncate its reasoning accordingly. Source: Arxiv/Evaluations.

S1-Bench's architectural deepening — difficulty is linearly probable from hidden states; the failure is action not perception. The full S1-Bench study (28 LRMs across multi-domain, multilingual model-simple questions) goes beyond the prejudgment-but-no-truncation observation. Using DS-R1-1.5B and DS-R1-7B as representative cases, a single-layer MLP trained on the final-layer hidden state of the last token in the encoded question predicts question difficulty with monotonically increasing accuracy as difficulty rises. The structure is already there — implicit, linear, decodable without specialized probes. Yet behaviorally, LRMs still produce redundant solution rounds with higher average token entropy on the same questions the probe correctly classifies as easy. The authors interpret this as architectural self-doubt: the model perceives simplicity, then second-guesses its own perception, leading to exploratory generation that overrides the implicit difficulty signal. This localizes the failure to the perception-to-action interface — not to representational capacity, not to difficulty assessment. The probe-vs-behavior gap is the diagnostic; it predicts that mechanistic interventions routing generation through the difficulty representation should outperform prompt-engineered "answer briefly" instructions, which target the wrong layer. Source: Reasoning Methods CoT ToT.

Inquiring lines that read this note 151

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do benchmark improvements fail to reflect actual reasoning quality?

How should models express uncertainty rather than forced confident answers?

Why do models commit to answers early on easy versus hard tasks?

How does AI assistance affect human cognitive development and reasoning autonomy?

When do additional thinking tokens stop improving reasoning performance?

How do training data properties shape reasoning capability development?

How do neural networks separate factual knowledge from reasoning abilities?

How do verbose and concise reasoning occupy different regions in activation space?

How does latent reasoning compare to verbalized chain-of-thought?

What capability tradeoffs emerge when scaling model reasoning abilities?

How can models identify insufficient information and respond appropriately without guessing?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Why do correct reasoning traces tend to be shorter than incorrect ones?

How should inference compute be adaptively allocated based on prompt difficulty?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Why does fine-tuning degrade reasoning quality even as accuracy improves?

How does example difficulty affect learning efficiency in language models?

Why do models automatically adjust reasoning length to problem difficulty?

Do base models contain latent reasoning that training can unlock?

Why does self-revision increase model confidence while degrading accuracy?

Can model confidence signals reliably improve reasoning quality and calibration?

Can ensemble evaluation methods reduce bias more than single judges?

How does evaluation format change what we measure about model reasoning?

How should iterative research systems allocate reasoning per search step?

How does overthinking in early turns degrade later retrieval rounds?

What actually drives chain-of-thought reasoning improvements in language models?

How does reasoning effort affect AI theory of mind performance?

Can inference-time compute substitute for scaling up model parameters?

What properties determine whether reward signals teach genuine reasoning?

How do soft continuous representations explore multiple reasoning paths simultaneously?

How does soft thinking compare to sampling multiple independent reasoning paths?

Why do reasoning models fail at systematic problem-solving and search?

How can AI systems learn from failures without cascading errors?

Can prompting inject entirely new knowledge into language models?

Can next-token prediction alone produce genuine language understanding?

What other internal model decisions beyond attention could be optimized directly?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Can models internally identify which tokens matter most for reasoning?

How do self-generated feedback mechanisms enable effective model learning?

How much can externalized skills improve models before hitting diminishing returns?

Do language models learn genuine linguistic structure or just surface patterns?

Why do thinking models execute longer tasks than standard language models?

Does reinforcement learning teach reasoning or just when to reason?

Why does extended reasoning training improve exploration without adding new capabilities?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How does early commitment in reasoning differ from early exploitation in planning?

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

29 direct connections · 240 in 2-hop network ·medium cluster Open in graph ↗

Does more thinking time always improve reasoning… Does extended thinking actually improve reasoning … Why does parallel reasoning outperform single chai… Why do correct reasoning traces contain fewer toke… Do reasoning models switch between ideas too frequ… Can dialogue planning balance fast responses with … Do personality types shape how AI agents make stra… When should retrieval happen during model generati… Do large language models use one reasoning style o…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does extended thinking actually improve reasoning or just increase variance? When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
the mechanistic explanation for why this threshold exists
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the alternative strategy that avoids the overthinking trap
Why do correct reasoning traces contain fewer tokens? In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
supporting evidence from a different angle
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
the complementary failure mode: insufficient depth per path, not just excessive total tokens
Can dialogue planning balance fast responses with strategic depth? Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
DPDP naturally avoids the overthinking threshold by restricting deep search (MCTS) to genuinely uncertain contexts via System 1/2 switching
Do personality types shape how AI agents make strategic choices? This research explores whether priming LLM agents with MBTI personality profiles causes them to adopt different strategic behaviors in games. Understanding this matters for designing AI systems optimized for specific tasks.
personality priming modulates reasoning depth: Introversion produces longer, more elaborated rationales, potentially lowering the threshold at which overthinking degrades accuracy; personality conditioning is an unexamined variable in test-time compute allocation
When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
retrieval-level analog: just as reasoning tokens past the threshold harm accuracy, retrieval at every step regardless of confidence wastes context and introduces noise; both findings argue for uncertainty-gated resource allocation rather than fixed budgets
Do large language models use one reasoning style or many? Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
cross-domain confirmation: in strategic games, top performers produce shortest CoT in their strongest game types while DeepSeek-R1 exhibits "repeated self-doubt" loops in competitive games that inflate tokens without improvement — the overthinking threshold extends to interactive reasoning
Why do reasoning models struggle with theory of mind tasks? Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
the overthinking threshold is categorically worse for social reasoning
Can we measure how deeply a model actually reasons? What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
DTR explains the threshold mechanistically: tokens past the threshold should have low DTR (early layer stabilization = pattern-matching filler rather than genuine computation); Think@n provides a selection mechanism that avoids the overthinking region: reasoning effort shows zero or negative correlation with ToM performance, meaning extended thinking actively degrades social cognition rather than merely plateauing — social tasks may have a near-zero optimal thinking threshold
Can models recognize question difficulty before they reason? Does reasoning language models encode implicit knowledge of problem difficulty in their hidden states, even before generating solution steps? And if so, why don't they act on this knowledge?
the architectural mechanism behind S1-Bench's prejudgment-but-no-truncation finding: difficulty is implicitly encoded but generation overrides it

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning accuracy degrades beyond a critical thinking-token threshold

Does more thinking time always improve reasoning accuracy?

Inquiring lines that read this note 151

Related concepts in this collection 11

Related papers in this collection 8

Search by related questions 4